This is the 16th day of my participation in the More text Challenge. For more details, see more text Challenge

The shortage of the Seq2Seq

Seq2Seq has many techniques to improve the effect, but it also has a big defect. When the input sequence is too long, the final output state vector H can hardly remember the original content, or some key content.

If Seq2Seq is used for machine translation, it works best when the number of words in the sentence is around 20. When the number of words in the sentence is more than 20, the effect will continue to decline, because the Encoder will forget some information. Adding Attention to Seq2Seq did not decrease the effect at 20 words or more.

Introduce Attention’s Seq2Seq

The introduction of attentional Seq2Seq model can greatly improve the performance of Seq2Seq, because the Decoder reviews all the characteristics of the Encoder’s summary of the input every time it decodes. Attention also tells the Decoder which inputs and characteristics of the Encoder to pay more Attention to, which is where Attention’s name comes from. This mechanism pays attention to input in a similar way to humans. When we read a sentence, we will directly focus on the key words, rather than each character or word being the key.

The only downside to Attentional, which is a huge performance boost, is that it requires a lot of computing.

Attention principle

As shown in the figure, the left is the Encoder process, and the right is the Decoder process. Both parts can use the structure of RNN and its variants. Here we use SimpleRNN to introduce the principle of Attention. After the Encoder captures the characteristics of the input as usual, it outputs the state vector hi of each moment, and takes hm of the last moment as the initial state vector S0 of Decoder. At this time, the Deocder process is as follows:

First, we calculate the weights of s0 in Decoder and each state vector HI in Encoder (the weight calculation method will be introduced below). Each state vector HI corresponds to a weight AI, and AI represents the correlation between HI and S0. Then do Softmax transformation for all [A1,a2… am] and turn them into weights [A1,a2… am]. Then we use the new weights [A1,a2… AM] and the corresponding state vectors [H1, H2… hm] to get weighted average c0. Then we use x’1, c0 and s0 to calculate S1. The formula is as follows:

s1 = tanh( A * contact(x1 ,c0, s0) + b)

C0 is the weighted sum of all state vectors in the Encoder at all times, so it knows the complete Encoder input information, which solves the problem of forgetting Seq2Seq. Plus the current input information x’1 and the state information s0 at the last moment, so the state vector output S1 at the current moment can be predicted.

B) The second decoding, similar to the above, we calculate the weight of S1 in Decoder and the state vector HI in all Encoder. Each state vector HI corresponds to a weight AI, and AI represents the correlation between HI and S1. Then do Softmax transformation for all [A1,a2… am] and turn them into weights [A1,a2… AM]. Then we use the new AI and the corresponding HI to calculate weighted average to get C1. Then we use x’2, c1 and S1 to get S2. The formula is as follows:

s2 = tanh( A * contact(x2, c1, s1) + b)

C1 is the weighted sum of all state vectors in the Encoder at all times, so it knows the complete Encoder input information. This solves the problem of forgetting Seq2Seq. In addition to the current input information x’2 and the state information S1 at the previous moment, the state vector output S2 at the current moment can be predicted.

C) Similarly repeat the above decoding process until the end.

Two methods of weight calculation

In general, there are two ways to calculate the weight of SI in Decoder and hi in all Encoder state vector.

The first is the method in the original paper, as shown in the figure below. In the figure, s0 is used as an example to calculate the weight of the state vector HI in all Encoder. After the combination of HI and s0, multiplied by the parameter matrix W, after the transformation of the nonlinear function tanh, the result is multiplied by the parameter matrix vT to get AI. Since there are m inputs, Encoder has m state vectors. Therefore, m A’s need to be calculated. Finally, [A1,a2… am] will be changed through Softmax to get the new weight parameter [A1,a2… am]. The W and vT here are parameters that need to be trained.

The second is the approach adopted by the Transformer model, as shown below. Here, we still take s0 and all Encoder state vector Hi as an example to calculate the weight, multiply WK and hi to get ki, multiply WQ and s0 to get q0, and then take the inner product of kTi and q0 as the similarity ai. Since there are m inputs, Encoder has m state vectors, so m A needs to be calculated, and finally [A1,a2… am] is changed through Softmax to get a new weight parameter [A1,a2… am]. Here, WK and WQ are parameters that need training.

Time complexity

Suppose the input length is m and the target length is T.

After the introduction of Attention mechanism, m state vectors are obtained in the Encoder. In the following Decoder process, M A’s are calculated each time. After the Decoder process is executed t times, a total of M *t A’s are calculated finally. So the time complexity is O times m plus m times t. Therefore, the introduction of Attention in Seq2Seq can greatly improve the performance and avoid forgetting problems, but the cost is a huge amount of calculation.

For Seq2Seq without Attention mechanism, because the Encoder only calculates m state vectors, and the Decoder decodes t times, the time complexity is only O(m+t).

Weight visualization

Here is an example of translating English into French to intuitively explain the meaning of the weight parameter expression from the perspective of visualization. The thickness of the purple line in the figure represents the degree of weight size. When the word zone is translated into Decoder, it will calculate the weight parameter with each input in the Encoder. We can see that zone has weight with all input words, but the weight value with the word Area is obviously the largest. In other words, the word Area has the greatest impact on the translation of zone. In fact, the French word for zone is similar to the English word for Area. That’s where Attention got its name. For another example, when translating French Europeenne, special attention should be paid to European in English, similar to the above.

case

I use the realization of an interesting small case, will scribber string translation into English, there are detailed comments, and the realization of two weight calculation methods, package teaching package meeting, the lian text is not easy, feel good leave praise. Juejin. Cn/post / 695060…