Portal: [NLP] Attention principle and source code analysis

Since the Attention mechanism was proposed, the Seq2seq model with Attention has been improved in each task, so the current Seq2seq model refers to the model combining RNN and Attention. For the specific principle, please refer to the portal article. Google then introduced the Transformer model to solve the sequence to sequence problem, which replaced the LSTM with the full attention structure, achieving better results in translation tasks. This article mainly introduces the article “Attention Is All You Need”. I still did not understand it at the first reading. I hope that I can make you understand this model faster with my own interpretation

1. Model structure

The model structure is as follows:


Like most Seq2SEq models, transformer’s structure is composed of encoder and decoder.

1.1 Encoder

Encoder is made up of N=6 identical layers, layer is the unit on the left, there’s Nx on the left, there’s x6. Each Layer is composed of two sub-layers, namely, the multi-head self-attention mechanism and the fully connected feed-forward network. Each sub-layer is added with residual connection and normalisation, so the output of the sub-layer can be expressed as:

Next, explain the two sub-layers in order:

  • Multi-head self-attention

For those familiar with the principle of attention, it can be expressed as follows:

Multi-head attention projected Q, K and V through H different linear transformations, and finally pieced together the results of different attention:

Self-attention is going to take Q, K, and V.

In addition, the calculation of attention was carried out by scaled dot-product, i.e.

The author also mentions another method with similar complexity but additive attention inIt’s similar to dot-product when you’re very young,When it is large, the performance will be better if there is no scaling, but the calculation speed of dot-product is faster, and the impact can be reduced after scaling (because softMax makes the gradient too small, see the reference in the paper for details).

  • Position-wise feed-forward networks

The second sub-layer is a fully connected layer, and it’s position wise because the attention output is the attention output at a certain position I.

1.2 Decoder

The structure of Decoder is similar to that of Encoder, but there is a sub-layer of attention. Here we first define the input and output of Decoder and the decoding process.

  • Output: Probability distribution of output words corresponding to position I
  • Input: output of encoder & output of decoder corresponding to position I-1. So the attention in the middle is not self-attention, its K, V is from the encoder, and Q is from the output of the decoder in the previous position
  • Decoding: It is important to note that the encoding can be computed in parallel, and all the encoding can be solved at once. Decoding does not solve all the sequences at once, but one by one like RNN, because the input at the last position is used as the query for attention

The top figure is easy to understand after the decoding process is clarified. The main difference here is that a mask is added to the newly added attention, because the output during training is all ground truth, which can ensure that the future information will not be contacted when the i-th position is predicted.

The principle of attention with mask is shown in the figure (with multi-head attention) :


1.3 Positional Encoding

In addition to the main Encoder and Decoder, there is a data preprocessing part. Transformer abandons RNN, and the biggest advantage of RNN is the abstract of data on time series. Therefore, the author proposes two methods of Positional Encoding, sum the data after Encoding and embedding data, and add relative position information.

Here the author mentions two methods:

  1. Sine and cosine functions with different frequencies
  2. Learning a Positional embedding (reference)

After the experiment, it was found that the results of the two methods were the same, so the first method was finally chosen. The formula is as follows:

According to the author, there are two benefits of approach 1:

  1. anyplaceCan beTo review the characteristics of trigonometric functions:

2. If it is positional embedding, it will be limited by dictionary size just like word vector. So you can only learn that the vector corresponding to position 2 is (1,1,1,2). So using trigonometric formulas is obviously not limited by the length of the sequence, that is, you can represent a sequence longer than the one you encounter.


2. The advantages

The author mainly talks about the following three points:


  1. Total Complexity per layer

2. Amount of computation that can be parallelized, as mesured by the minimum number of sequential operations required

The authors measure the computation that can be parallelized with the minimal serialization operation. In other words, for some sequenceSelf-attention can be calculated directlyAnd RNN has to go fromCalculated to

3. Path length between long-range dependencies in the network

Here Path length refers to the length of the Path to calculate a sequence of length n information. CNN needs to increase the number of convolutional layers to expand the field of view; RNN needs to calculate one by one from 1 to N; while self-attention only needs one step matrix calculation. Therefore, it can be seen that self-attention can solve the long-term dependency problem better than RNN. Of course, if the amount of computation is too large, such as the sequence length n> the sequence dimension D, you can also use the window to limit the number of calculations of self-attention

4. In addition, it can be seen from the chestnuts given by the author in the appendix that the self-attention model is more explicable, and the distribution of attention results indicates that the model has learned some grammatical and semantic information


3. The shortcomings

In the original text, there are two disadvantages, which are not mentioned in the original text, but later pointed out in the Universal Transformers.

  1. In practice: There are some problems that RNN can easily solve that Transformer can’t, such as copying strings, especially if the sequence is longer than the trained sequence
  2. In theory: Transformers are not computationally complete universal, because (IN my view) it is impossible to achieve a “while” loop

4. To summarize

Transformer is one of the first models to use pure attention to achieve faster computation and better results for translation tasks. Google’s current translation should be based on this, but after consulting one or two friends, I got the answer that it mainly depends on the data volume. If the data volume is large, transformer might be better, and if the data volume is small, I should continue to use RNn-based Model

The above.

[Reference] :

  1. Attention is all you need
  2. Zhihu: Some questions about The Transformer model in Google’s “Attention Is All You Need”?
  3. Stackoverflow: positional embedding