Introduction to BERT model (2) : Attention model implementation

Summary:

In the previous article, BERT Model Primer series: In “Introduction to Attention Mechanism”, machine translation examples are used to explain the Encoder-Decoder model and the basic principles of Attention model. This article implements the model involved and explains it in detail with the explanation of the previous article, which is conducive to our further understanding. Examples are based on Python and TensorFlow 2.4. It doesn’t matter if you haven’t learned them. We have explained all the key points in the code in detail.

Before you begin, it’s easier to learn with a clear goal in mind:

Text preprocessing
Encoder implementation
Decoder implementation
Attention model implementation
Model training
English -> Chinese translation

The code for this article has been submitted to github.com/rotbit/nmt. However, the purpose of this code is not to make a usable commercial product, so it does not pursue actual optimal results. If it helps us better understand Encoder, Decoder, and attention models, we have achieved the desired goal.

1. Text preprocessing

The task of text processing is as simple as converting our sentences into a vector of numbers, such as “Have you eaten yet? Convert to a vector “[2,543,56,12,76]”. The reason for doing this is because computers can’t read, they only know 1010, so they have to turn it into something that computers can understand. How to do it in detail? First, make a flow chart.

The process is very simple, speaking of the main steps, read in the file, text preprocessing, dictionary construction, text to vector. Great oaks from little acorns grow, so let’s start with the most basic function.

Seg_char: Chinese character split

[' I ', 'love ', 'tenforflow']" def seg_char(sent): Pattern_char_1 = re.pile (r'([\W])') parts = pattern_char_1.split(sent) parts = [p for p in parts if Len (u4e00-u9fa5]) chars = [u4e00-u9fa5] ') chars = [w for w in chars if len(w.strip())>0] return charsCopy the code

We just need to know what the input is and what the output looks like.

Preprocess_sentence: Sentence preprocessing

Def preprocess_sentence(w, type): if type == 0: W = re. Sub (r "([?!, ¿] )", r" \1 ", w) w = re.sub(r'[" "]+', " ", w) if type == 1: #seg_list = jieba.cut(w) seg_list = seg_char(w) w = " ".join(seg_list) w = '<start> ' + w + ' <end>' return wCopy the code

So let’s run it. What are the inputs and outputs

en = "I love tensorflow." pre_en = preprocess_sentence(en, 0) print("pre_en=", Print ("pre_cn=", pre_cn) print("pre_cn=", pre_cn)Copy the code

Output:

Pre_en = <start> I love tensorflow. <end> pre_cn= <start>Copy the code

Take a look at the output. The text we input is separated by Spaces, and we add, at the beginning and at the end. The identifier at the beginning is used to mark the beginning and end of the text in later model training.

Create_dataset: text loading and preprocessing

Def create_dataset(PATH, num_examples): lines = io.open(path, Encoding =' utF-8 ').read().strip().split('\n') # English_words = [] # Chinese_words = [] for l in lines[:num_examples]: word_arrs = l.split('\t') if len(word_arrs) < 2: continue english_w = preprocess_sentence(word_arrs[0], 0) chinese_w = preprocess_sentence(word_arrs[1], 1) english_words.append(english_w) chinese_words.append(chinese_w) # return [('<start> Hi. <end>', '<start> hi. <end>')] return english_words, chinese_wordsCopy the code

And the data sets that we’re going to use you can download cmnt.txt from here and let’s take a couple of data sets and look at them. A row in a dataset is a sample. It can be seen that it will be divided into three columns, the first column is English, the second column is the Corresponding Chinese translation of English, and the third column we don’t need, just throw it away. The create_dataset function reads this text and returns the processed Chinese and English lists.

Hi. Hi. Cc-by 2.0 (France) Attribution: Tatoeba.org #538123 (CM) & #891077 (Martha) Hi. hello Cc-by 2.0 (France) Attribution: Tatoeba.org #538123 (CM) & #4857568 (musclegirlxyp) Run. You had to run. Cc-by 2.0 (France) Attribution: Tatoeba.org #4008918 (JSakuragi) & #3748344 (egg0073) Wait! Wait a minute! Cc-by 2.0 (France) Attribution: Tatoeba.org #1744314 (Belgavox) & #4970122 (WZHD)Copy the code

As usual, run through the code to see what it actually outputs.

Inp_lang, targ_lang= create_dataset('cmn.txt', 4) print("inp_lang={}, targ_lang={}". Format (inp_lang, targ_lang={}) targ_lang))Copy the code

Inp_lang [0]=’ Hi. ‘the corresponding Chinese translation is targ_lang=’ Hi.’ ‘

inp_lang=[ '<start> Hi . <end>', '<start> Hi . <end>', '<start> Run . <end>', '<start> Wait ! <end>' ], Targ_lang =['<start> hi. <end>', '<start> you are good. <end>', '<start> you are running. <end>', '<start> etc! <end>']Copy the code

Load_dataset, tokenize: Create dictionary, text turn quantity

Def tokenize(lang): lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='') lang_tokenizer.fit_on_texts(lang) tensor = lang_tokenizer.texts_to_sequences(lang) tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post') return tensor, lang_tokenizer def load_dataset(path, num_examples=None): inp_lang, targ_lang = create_dataset(path, num_examples) input_tensor, inp_lang_tokenizer = tokenize(inp_lang) target_tensor, targ_lang_tokenizer = tokenize(targ_lang) return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizerCopy the code

Run load_dataset:

inp_tensor, targ_tensor, inp_tokenizer, targ_tokenizer = load_dataset("cmn.txt", 4)
print("inp_tensor={}, inp_tokenizer={}".format(input_tensor, inp_lang_tokenizer.index_word))
Copy the code

Take a look at the output

inp_tensor=[[1 4 3 2] [1 4 3 2] [1 5 3 2] [1 6 7 2]], inp_tokenizer={1: '<start>', 2: '<end>', 3: '.', 4: 'hi', 5: 'run', 6: 'wait', 7: '! '}Copy the code

Inp_tokenizer is a dictionary of constructs, and the way you construct it is you assign a unique integer ID to every word, and inp_tensor is the result of the deflection of the text, and every element in the vector corresponds to the word in the dictionary.

This is the end of text preprocessing.

2. Encoder implementation

The use of Encoder has been covered in the Introduction to BERT models: Introduction to the Attention mechanism, but I won’t cover it here. In our following code implementation, Encoder consists of two parts: the Embeding layer and the RNN layer. Look at the code first.

import tensorflow as tf # encoder class Encoder(tf.keras.Model): # vocab_size: Table size # embedding_DIM: word embedded dimension # enc_uints: Def __init__(self, vocab_size, embedding_DIM, enc_units, batch_sz): super(Encoder, Self).__init__() self.batch_sz = batch_sz # Batch size self.enc_units = enc_units # To convert an integer into a fixed length of dense vector self. The embedding = tf. Keras. The layers. The embedding (vocab_size, Embedding_dim) # create a RNN layer self. The RNN = tf. Keras. The layers. SimpleRNN (self. Enc_units return_sequences = True, return_state=True) def call(self, x, hidden): x = self.embedding(x) output, state = self.rnn(x, initial_state=hidden) return output, The concept of state # Tensor tf. The Tensor https://www.tensorflow.org/guide/tensor def initialize_hidden_state (self) : return tf.zeros((self.batch_sz, self.enc_units))Copy the code

Let’s parse what the parameters mean

The meanings of the parameters to the __init__ function:

Vocab_size: the size of the dictionary table refers to the number of non-repeating words in the dictionary table. This dictionary is constructed by calling the load_dataset function.

Embedding_dim: Word embedding dimension. As mentioned earlier, we use a number for each word so that our sentences can be encoded as a dense vector, but this encoding is flawed and does not capture the correlation between two words. As a result, our input data is encoded as a dense vector with integers and then reencoded as a fixed length dense vector through an Embedding layer. Embedding_dim refers to the dimension of the vector encoded by Embedding. For why Embedding is required after integer encoding, see word Embedding

Enc_uints: The output node encoding the RNN. This example uses only one layer of RNN, but it can also be set to multiple layers of RNN. Enc_uint refers to the number of nodes in the last output layer

Batch_sz: batch size. In deep learning, the loss function calculated for each parameter update is not only calculated by a {data:label}, but is weighted by a group of {data:label} data whose size is batch_size

In addition to the _init_ function, Encoder also has a call function. Call function is the logic that performs the encoding action. Let’s look at the specific parameter parsing of call function

Call function parameter meaning, output meaning:

X: Training sample, i.e., vectorized text, processed data returned by load_dataset. Is the matrix of BATCH_SIZE * sample length, that is, x is BATCH_SIZE sample data.

Hidden: Matrix of BATCH_SIZE * enc_units. The hidden layer value of the recurrent neural network not only depends on the current input X, but also depends on the hidden value of the last hidden layer. Therefore, the hidden value of the last input needs to be entered. Here, the call function is called in its initial state, so we just need to give it an initial value.

So the question is, why is hidden the matrix of BATCH_SIZE * enc_uint?

To put it simply, when training the model, we input BATCH samples. Secondly, our RNN defines enc_UINts neurons. In other words, for each word input, there will be enc_Uints neuron output values. Thus, the hidden layer of our RNN output is BATCH_SIZE * word_size* enc_uints_, where word_size is the number of words in a sample.

Therefore, for our initial value, we only need to enter BATCH_SIZE samples with the number of words in each sample being 1, i.e. the hidden parameter of the call function is the matrix of BATCH_SIZE * 1* enc_uints

Output: BATCH_SIZE * word_size * enc_uints, where word_size is the number of words in a sample

Understanding input and output is very helpful in understanding code, so here’s a picture to summarize the above.

Encoder data flow chart

Look at the above analysis, I believe that the data input and output have a certain understanding, we run directly, see the code output results.

Input_tensor, target_tensor, inp_lang, targ_lang=preprocess.load_dataset("./cmn.txt", Input_tensor_train, input_tensor_val, target_tensor_train, \ target_tensor_val = train_test_split(input_tensor, target_tensor, BUFFER_SIZE = len(input_tensor_train) BATCH_SIZE = 32 steps_per_epoch = Len (input_tensor_train)//BATCH_SIZE embedding_DIM = 256 # embedding dimension units = 512 VOCab_inp_size = len(inp_lang.word_index)+1 vocab_tar_size = len(targ_lang.word_index)+1 dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)) dataset = dataset.batch(BATCH_SIZE, Encoder encoder.Encoder(VOCab_inp_size, embedding_DIM, units, BATCH_SIZE) # initialize a hidden state sample_hidden = encoder. Initialize_hidden_state () # execute code sample_output sample_hidden = encoder(example_input_batch, sample_hidden) print ('output shape:(batch size, sequence length, units){}'.format(sample_output.shape)) print ('Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))Copy the code

Encoder output result:

output shape: (batch size, sequence length, units) (32, 36, 512)
Hidden state shape: (batch size, units) (32, 512)
Copy the code

This is the end of Encoder implementation, but… We’re not done yet.

We start here to talk about Decoder implementation, Decoder is the role of Encoder coding after the text translated into target text, well, right, the function is so simple, Decoder we also use an RNN implementation, nonsense not to say, look at the code first.

3. Decoder implementation

import tensorflow as tf import attention class Decoder(tf.keras.Model): # vocab_size dictionary size # embedding_DIM word embedded dimension # dec_uints decoding RNN output neuron number # batch_sz batch size def __init__(self, VOCab_size, embedding_dim, dec_units, batch_sz, attention): super(Decoder, self).__init__() self.batch_sz = batch_sz self.dec_units = dec_units self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) self.rnn = tf.keras.layers.SimpleRNN(self.dec_units, return_sequences=True, Return_state =True) self.fc = tf.keras.layers.Dense(VOCab_size) self.attention = attention # x Def call(self, x, hidden, enc_output): # context_vector shape == (batch size, hidden layer size) # attention_weight == (batch size, hidden layer size) 1) context_vector, attention_weights = self.attention(hidden, Enc_output) #print("context_vector.shape={}". Format (context_vector.shape)) #print("context_vector.shape={}". Embedding (x) # x = self. Embedding (x) # x = self. X = tf.concat([tf.expand_dims(context_vector, 1), x], Axis =-1) #print("x.shape={}".format(x.shape)) #print(" batch_size, time_step, feature) output" State = self.rnn(x) #print("output 1.shape={}". Format (output.shape)) #print("output 1.shape={}". Shape (batch_size, time_step, feature); Output = tf. shape(Output, (-1, output.shape[2]) shape = 0 Vocab), x = self.fc(output) return x, state, attention_weightsCopy the code

Let’s also parse the parameters of Decoder

Call function parameter parsing

X: translation result of previous input, e.g. “machine learning” =>” machine learning “,

1, if the current translation is “machine”, then x is an identifier “”,

2, if the current translation is “learning”, then here x is “machine”.

Forcing the teacher to train the output of the previous input as a characteristic of the current input is a way to quickly and effectively train a recurrent neural network model. A New Algorithm for Training Recurrent Networks

Hidden: Hidden layer status returned by Encoder. Hidden shape is BATCH_SIZE * enc_uints.

Enc_output: Encoder result, shape is BATCH_SIZE * word**_**size * enc_uints.

Decoder also has a attention parameter. This is the function that calculates attention and passes it as a parameter. The calculation method of attention has been explained in “Introduction to BERT Model series: Introduction to the Mechanism of Attention”. We will not go into details here. We use dot product to calculate the calculation method of attention.

4. Attention model implementation:

class DotProductAttention(tf.keras.layers.Layer): def __init__(self): super(DotProductAttention, self).__init__() def call(self, query, value): # 32 * 512 * 1 hidden = tf.expand_dims(query, -1) # score = tf.matmul(value, hidden) attention_weights = tf.nn.softmax(score, Axis =1) context_vector = attention_weights * value # context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weightsCopy the code

Now that the main parts are defined, let’s run them

Import tensorflow as TF import decoder import attention import encoder import preprocess target_tensor, inp_lang, targ_lang = preprocess.load_dataset("./cmn.txt", BUFFER_SIZE = len(input_tensor) BATCH_SIZE = 32 Steps_per_epoch = len(input_tensor) BATCH_SIZE Embedding_dim = 256 # Word vector units = 512 VOCab_inp_size = len(inp_lang.word_index)+1 VOCab_tar_size = Len (targ_lang. Word_index)+1 # dataset dataset = tf.data.dataset. From_tensor_slices ((input_tensor, target_tensor)).shuffle(BUFFER_SIZE) dataset = dataset.batch(BATCH_SIZE, Encoder encoder = encoder.Encoder(VOCab_inp_size, embedding_DIM, units, BATCH_SIZE) sample_hidden = encoder.initialize_hidden_state() sample_output, sample_hidden = encoder(example_input_batch, sample_hidden) print ('encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape)) print ('encoder Hidden state shape: (batch size, Units) {} '. The format (sample_hidden. Shape)) # define attention attention_layer = attention. DotProductAttention context_vector (), attention_weights = attention_layer(sample_hidden, sample_output) print ('context_vector shape: {}'.format(context_vector.shape)) print ('attention_weights state: {}'.format(attention_weights. Shape)) # define decoder dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1) decoder = decoder.Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE, attention_layer) dec_output, dec_state, attention_weights = decoder(dec_input, sample_hidden, sample_output) print ('decoder shape: (batch size, sequence length, units) {}'.format(dec_output.shape)) print ('decoder Hidden state shape: (batch size, units) {}'.format(dec_state.shape))Copy the code

5. Model training:

Now that Encoder and Decoder and Attention have been implemented, we can begin to define the steps of model training. In our data preprocessing step, we have used

 dataset.batch(BATCH_SIZE, drop_remainder=True)
Copy the code

The training data are sorted according to the BATCH_SIZE size, so the minimum unit of each training is the BATCH_SIZE data set. Let’s look at the specific training steps

Single BATCH training

Import tensorflow as TF import Optimizer # single sample model training # encoder defined encoder model decoder defined decoder model inP training data, The tensor # targ training data of the text to be translated, Def train_step(encoder, decoder, inp, targ, targ_lang, enc_hidden, BATCH_SIZE): loss = 0 with tf.GradientTape() as tape: enc_output, enc_hidden = encoder(inp, enc_hidden) dec_hidden = enc_hidden dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, For t in range(1, targ. Shape [1]): Predictions, DEC_HIDDEN, _ = decoder(DEC_input, dec_hidden, Enc_output) # here is a Batch loss += Optimizer.loss_function (TARg [:, T], Predictions) # teacher enforces - use target word as the next input, A batch loop dec_input = tf.expand_dims(targ[:, t], 1) batch_loss = (loss / int(targ.shape[1])) variables = encoder.trainable_variables + decoder.trainable_variables gradients = tape.gradient(loss, variables) optimizer.optimizer.apply_gradients(zip(gradients, variables)) return batch_lossCopy the code

Overall training process:

Epochs = epochs for epochs in range(epochs): Enc_hidden = encoder. Initialize_hidden_state () total_loss = 0 # dataset Contains steps_per_epoch elements for (batch, (INp, targ)) in enumerate(dataset.take(len(input_tensor))): batch_loss = train_function.train_step(encoder, decoder, inp, targ, targ_lang, enc_hidden, BATCH_SIZE) total_loss += batch_loss if batch % 100 == 0: print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, batch_loss.numpy()))Copy the code

6. English -> Chinese translation

Def evaluate(sentence): sentence = preprocess.preprocess_sentence(sentence, 0) inputs = [inp_lang.word_index[i] for i in sentence.split(' ')] inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post') inputs = tf.convert_to_tensor(inputs) result = '' hidden = [tf.zeros((1, units))] enc_out, enc_hidden = encoder(inputs, hidden) dec_hidden = enc_hidden dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0) # max_length_targ: predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out) tf.reshape(attention_weights, (-1, )) predicted_id = tf.argmax(predictions[0]).numpy() result += targ_lang.index_word[predicted_id] + ' ' if targ_lang.index_word[predicted_id] == '<end>': Dec_input = tf.expand_dims([predicted_id], 0) return result Def translate(sentence): sentence = evaluate(sentence) print('Input: %s' % (sentence)) print('Predicted translation: {}'.format(result))Copy the code

Run it:

train(20)
translate("hello")
translate("he is swimming in the river")
Copy the code

The output

Epoch 20 Batch 300 Loss 0.5712 Epoch 20 Batch 400 Loss 0.4970 Epoch 20 Batch 500 Loss 0.5692 Epoch 20 Batch 600 Loss 0.6004 Epoch 20 Batch 700 Loss 0.6078 Input: <start> Hello <end> Predicted Translation: You are good. <end> Input: <start> He is swimming in the river<end> Predicted Translation: I <end>Copy the code

Finally, this example is not very good, there are several reasons

1. Insufficient amount of data, with only over 3000 data sets.

2. The number of training is insufficient, so further optimize and increase the number of iterations

3. The Attention model still has optimized space and Attention calculation method using only the dot product. There are still better calculation methods

4. RNN model is used, which can also be replaced by LSTM, GRU and other neural networks for debugging

Reference:

www.tensorflow.org/tutorials/t…

Github.com/rotbit/nmt….