“This is my 28th day of participating in the First Challenge 2022. For details: First Challenge 2022.”

Introduction:

This is a transcript of my previous blog post on CSDN. Neural Network Machine Translation (NMT)

Neural Machine Translation

In this class, we use RNN to do machine translation. There are many kinds of machine translation models. In this class, we introduce Seq2Seq model to translate English into German. Machine translation is a many-to-many problem.

First, we have to deal with the data.

Machine Translation Data

Here, we’re just using a small data set for learning purposes. As is shown in the picture, the file has two columns, with English sentences on the left and German sentences on the right. One English sentence corresponds to multiple German sentences. Given an English sentence, the translation is considered perfect if it can match exactly one of them.We need to preprocess the data first, including: write smaller caps, remove punctuation marks, etc.

After preprocessing, we do Tokenization, changing a sentence into many words or characters and so on. It is important to note that Tokenization requires two different Tokenizers. Two dictionaries are created after Tokenization.

Tokenization can be both character-level and word-level. In this lesson, character level is used for simplicity. In the actual system, due to the large training data set, it is basically word level.

It’s also easy to understand why two different Tokenizers are used. At the character level, different languages usually have different alphabets. the characters of the two languages are usually different, so two different vocabularies should be used. This is especially true at the word level. In addition, different languages have different ways of dividing words.

The library provided by Keras is used for Tokenization, and the dictionary is automatically generated when finished. On the left is the English dictionary, 26 letters plus a space symbol. On the right is the German dictionary, with 26 letters and a space, and two special characters, one for start and one for stop, \t for start and \n for stop.

After this, an English sentence becomes a list of characters, which is then mapped with a dictionary to a list of numbers.So does a German sentence.

After one-hot encoding, a character is represented by a one-hot vector, and a sentence is represented by a matrix. We then spliced these one-hot vectors together as a matrix input to the RNN.

Seq2seq Model

Now that the data is ready, we need to build a SEQ2SEQ model and train it.

The SEQ2SEQ model consists of an encoder and a decoder. An Encoder is an RNN or LSTM, etc. used to extract features from the input sentence. The last state of Encoder is the feature extracted from the sentence, which contains the information for the sentence. Encoder, the rest of the states are useless, they’re all thrown away. The output of Encoder is the last state H of LSTM and the transport band C.

The SEQ2SEq model also has a Decoder to generate German. The Decoder is actually the text generator mentioned in the last blog post. The only difference between this generator and the last one is the initial state, which was an all-zero vector, and this is the last state h and C of LSTM. Encoder uses this vector to summarize English sentences, and Decoder uses this vector to know that English means ‘go away’.

The first input is the starter \t, and the Decoder outputs a probability called vector P. The following character is the output character label, and we use the one-hot vector Y and P of the label to calculate the loss function. We want tag P to be as close to tag Y as possible with as little loss as possible. With the loss function, you can propagate back to calculate gradients, which pass from the loss function to the Decoder, and then from the Decoder to the Encoder, and then use gradient descent to update the parameters of the Encoder and Decoder. Let the loss function go down.

Then use \tm as input, character A as label to continue calculating loss, and so on. Repeat the process until the last character comes out, using the whole sentence as input and stop character \n as label. Repeat the process, taking all the (English-German) duals and training them.

Its Keras structure is as follows:As shown, the input to Encoder networks is a one-hot encoding, represented by a matrix. Encoder network has one or more layers of LSTM, used to extract features from English sentences, LSTM output is the final state H and conveyor C, Decoder network initial state H and C, so that Encoder and Decoder can be linked, Decoder input is the upper half of the German sentence, Decoder outputs the current state H, and then the full connection layer outputs the prediction for the next character.

Inference Using the Seq2seq Model

With the SEQ2SEq model trained, we can use it to translate English into German. Type each character of an English sentence into Encoder. Encoder will accumulate information about this sentence in states H and C. Encoder outputs the final state, called H0H_0H0 and C0C_0C0, which are the features extracted from this sentence and sent to Decoder.

Encoder outputs H0H_0H0 and C0C_0C0 are used as the initial Decoder states. That way, the Decoder knows the sentence is “go away,” and now it works just like the text generator we talked about in class. The start character is entered into the Decoder. With new input, the Decoder updates the state H and transport band C and predicts the next character. Decoder output prediction is the probability value of each character, we need to select a character according to the probability value. You can select the character with the highest probability or perform random sampling based on the probability value. Let’s say I get the character “m”, so I write down “m”.

Now. The Decoder states become H1H_1H1 and C1C_1C1, with character “m” as input and LSTM predicting the next character. Based on the h1H_1H1 and C1C_1C1 states and the new input “M”, LSTM updates the states to H2H_2H2 and C2C_2C2 and outputs a probability distribution. Based on the probability distribution sampling, we might get the character “A”. Write down the character “A”.

Then the process continues, updating the status and generating new characters, and using the newly generated characters as input for the next round.

At the end of the run, we can get the final output, which is the translated German.

Summary

This lesson we talked about using Seq2seq model to do machine translation, model has an Encoder network and a Decoder network. In the example of this lesson, the Encoder input is an English sentence. For each character input, the RNN will update the status and put the accumulated information into the status. The last state of Encoder is a feature extracted from an English sentence. Encoder outputs only the last state, throwing away all previous states. The last state is passed to the Decoder network, and the last state of Encoder network is taken as the initial state of Decoder network.

Once initialized, the Decoder knows the Entered English sentence, and the Decoder acts as a text generator to generate a German sentence.

The initial character is first used as input to the RNN. Decoder RNN will update the status to
s 1 s_1
, and then the full connection layer will output the predicted probability, denoted as
p 1 p_1
. Sample the next character based on the probability distribution
z 1 z_1
.

Decoder network with
z 1 z_1
Make input and update the status to
s 2 s_2
, and then the full connection layer will output the predicted probability, denoted as
p 2 p_2
. Sample the next character based on the probability distribution
z 2 z_2
.

This process is then repeated until the output stops the text generation and returns the generated sequence. We designed a Seq2seq model, its Encoder and Decoder are LSTM. So how can we improve this model?

The principle of Seq2seq model is as follows: the input information is put into Encoder, and the information is compressed into state vector. The last state of Encoder is a summary of the whole sentence. Ideally, the last state of Encoder contains the complete information for the entire sentence of English. If English sentences are long, the information ahead is likely to be forgotten.

One obvious improvement is to use bidirectional LSTM on the Encoder side. Decoder is a text generator and must be one-way.

This lesson uses character levels for simplicity, but a better way is to use word levels. With an average of 4.5 characters per word in English, the input sequence is 4.5 times shorter. Shorter sequences are less likely to be forgotten.

However, if word level is used, it must have large data set and low dimensional word vector must be obtained by Word embedding. Embedding has too many parameters and cannot be trained well with small data sets. Or pretrain the Embedding layer.

Another approach is multitasking. For example, the original English sentence is translated into German, now we can add a new English sentence into English, so that the training data is doubled, Encoder can be trained better. There are many other tasks that can be added, with data sets translated from English into other languages.

However, there is only one Encoder, so Encoder is trained many times more, although the Decoder is not improved, but the effect is still improved.

There’s also a method called Attention, which is the strongest, and it improves machine translation a lot. We’ll talk about that next time.