Seq2Seq training and prediction detailed and optimization techniques

This is the 15th day of my participation in the More text Challenge. For more details, see more text Challenge

Multiple results of RNN model

Due to the difference between input and output, the model built with RNN can be divided into multiple types of mechanisms, and different structures can be applied to different tasks.

As shown in the figure above, there are four types of structures

In one to one, the input is a vector, and the output is also a structure of the results of the vector. For example, the input is a picture, and the output is a predicted category.
In one to many, the input is a vector and the output is a structure of the results of multiple vectors. This structure can be used to describe the scene in an input image with words.
Many to one is the structure of the input is multiple vectors and the output is a vector of results. This structure can be used in text sentiment analysis to determine whether the input text is negative or positive.
Many to many is a structure in which the input is multiple vectors and the output is the result of multiple vectors. This structure can be used in machine translation or chat dialogue scenarios.
Synchronization many to Many is similar to the previous structure, but it is a classic RNN structure. One input state is carried to the next state, and each input corresponds to an output. We are most familiar with this structure for character prediction, but it can also be used for video classification and frame labeling.

Seq2Seq

Machine translation is a many-to-many problem. Here, the task of translating French into English is taken as an example. The number of French words input is not fixed, and the number of English words output is also not fixed. Although there are many machine translation models, the most classic is the Seq2Seq model, which is also a many to many structure.

The training process

Seq2Seq model is mainly divided into two parts, there is an Encoder on the left and a Decoder on the right.

An Encoder is an LSTM or other RNN or its variants, used to extract features from the input French sentence. The output of the Encoder is the state vector H and the cell state C of the LSTM at the last moment. The output of the other moments is not used, so it is not shown in the figure.
Decoder is also an LSTM or other RNN and its variants, and is used to translate from French to English. Decoder is basically the same principle as the text generation in the last article, except that the text generation in the last article starts with a full 0 vector. But the initial state of the Decoder is the state vector H output by the Encoder at the last moment. The decoding process is as follows:

A) Decoder can know the input text content by getting the state vector H of the last moment output of Encoder.

B) START typing the first word to START training. The first input of Decoder must be the START symbol. Set the START symbol of Decoder to “<START>” (any other string that does not exist in the current dictionary can be used) to START translation. At this time, the Decoder has the initial vector (h) and the input “<START>” of the current moment. The Decoder will output a probability distribution of predicted words. Since we know that the input of the next moment is “the”, so the input of the next moment is used as the label. We need to train the model to make the Decoder output and tag loss as small as possible at the current moment.

C) With the loss function, we can back propagate the gradient, the gradient will be passed from the loss function to the Decoder, and then from the Decoder to the Encoder, and then use gradient descent to update the parameters of the Encoder and Decoder.

D) Then input the second word to START training. At this time, the input is the existing two words “<START>” and “the”. Decoder will output the word probability distribution predicted for the second word. So we want to make the current Decoder output and tag loss as small as possible.

E) Continue back propagation and update parameters of Encoder and Decoder.

F) Then input the third word to START training, similar to the above, the input at this time is the existing three words “<START>”, “the” and “poor”, the Decoder will output the prediction of the third word probability distribution, we know that the word of the next moment is “don’t”, We use it as a tag, so we want to make the Decoder output and tag loss value as small as possible at the current moment.

G) Continue back propagation and update parameters of Encoder and Decoder.

H) Repeat the process until the last moment, when the entire sentence “The poor don’t have any money” is used as input for the current moment. At this point we have no more content for the next moment, so we define the translation as finished now, using the tag “<END>”, So we want to make the current Decoder output and tag loss as small as possible.

I) Continue back propagation and update parameters of Encoder and Decoder.

J) Use a large amount of binary combinatorial data in French and English to train this model

Forecasting process

When the model is trained, we can use it to translate, so if we now have “Les pauvres sont demunis” in French, the process of translating into English is similar to the training process. The process is as follows:

A) Input every word of French into Encoder, accumulate features into the feature vector H of the last moment, and transmit it to Decoder as its initial state.

B) Predict the first word. The Decoder does the same work as the text generation mentioned in the previous article. We input the initial state h and the starting word ”

” into the Decoder, and the Decoder will output the state vector H0 at the current moment, as well as the word probability distribution. At this time, we can either choose the word with the highest probability or conduct random sampling according to the probability value. Either way, we now have the predicted word. If the model is valid, the predicted word should be “the”; of course, if the model is incorrect, other words may be predicted.

C) Predict the second word. We input the state vector H0 obtained at the last moment and the predicted word “the” into the Decoder. The Decoder will generate the current state vector H1 and the word probability distribution. Select the word with the highest probability or the word sampled according to the probability as the current prediction result. If our model is valid, it should be predicted as “poor”; if the model is invalid, it may be predicted as other words.

D) To predict the third word, we input the state vector H1 obtained at the last moment and the predicted word “poor” into the Decoder. The Decoder will generate the current state vector H2 and the word probability distribution. The above method can be used. Select the word with the maximum probability or the word sampled according to the probability as the current prediction result. If our model is valid, it should be predicted as “don’t”; if the model is invalid, it may be predicted as other words.

E) Repeat the above process until finally we input the state vector H5 obtained at the last moment and the predicted word “money” into the Decoder. The Decoder will generate the current state vector H6 and the word probability distribution. The above method can be used. The word with the maximum probability or the word sampled according to the probability is selected as the current prediction result. If our model is valid, the prediction should be ”

“, indicating the END of our translation process. At this point our translation output is “The poor don’t have any money
“, because our definition of “” is just a symbol indicating the END of the translation, The poor don’t have any money. If the model does not work, other words may be predicted, and the translation continues.

Encoder optimization tips 1

The main function of the Encoder is to extract the features of the input. Sometimes the input is too long and the feature extraction is not good. In this case, we can use bi-lSTM instead of LSTM or RNN. This Encoder will extract the input features from both the forward and reverse directions and accumulate them until the last moment, when the features are richer.

This is only applicable to Encoder, Decoder can only be one-way translation from left to right.

Encoder optimization tips 2

In general, we use words as input, not characters as input, because the average length of a word is 4 characters. If words are used instead of characters as input, the input length can be shortened by nearly 4 times. In this way, features can be extracted using Encoder, so that the first content is not easy to forget.

However, if words are used as input, one-hot vector can be used for representation if the number of words used is less than 100. However, there are generally thousands of commonly used words. In this case, we should use word embedding to obtain the low-dimensional word vector.

There are too many parameters in the Embbeding layer. If large data sets are not used for training, overfitting phenomenon will occur, or pre-training will be conducted for the Embedding layer.

Encoder optimization tip three

We can also improve the performance of the model by taking advantage of the relevance of different language translations, using multi-tasking learning, and we’re only doing French to English here, but we can also train the French to German, French to Chinese, using the Encoder here, because no matter how many tasks we add, Our Encoder is always the same, but the training data is twice as much. After the training, the Encoder is greatly optimized by the language logic, and then performs the task of translating French into English, the effect will be significantly improved.

Encoder optimization tip four

The biggest way to improve machine translation is to use the Attention mechanism, which we will discuss in detail in the next article.

case

Here is my own implementation of a small case, to translate the word into another opposite meaning of the word, the case is simple, only to explain the meaning of Seq2Seq, there are detailed notes, feel good to leave praise oh. Juejin. Cn/post / 694941…