Attention Is All You Need

Since the Attention mechanism was proposed, the Seq2Seq model with Attention has been improved in various tasks. Therefore, the current Seq2Seq model refers to the model combining RNN and Attention. Traditional RNN based Seq2Seq model is difficult to deal with long sequences of sentences, parallelism and alignment problems.

Therefore, the subsequent development of such models mainly starts from three aspects:

Input directivity: unidirectional -> bidirectional

Depth: single layer -> multiple layers

Type: RNN -> LSTM GRU

However, there are still some potential constraints. The neural network needs to be able to compress all the necessary information of the source statement into a fixed length vector. This may make it difficult for the neural network to cope with long sentences, especially those longer than those in the training corpus; The output of each time step depends on the output of the previous time step, which makes the model unable to be parallel and inefficient. Still facing alignment issues.

Then CNN is also introduced into deep NLP by computer vision. CNN cannot be directly used to process variable-length sequence samples but can realize parallel computing. Although the Seq2Seq model completely based on CNN can be implemented in parallel, it takes up a lot of memory, requires a lot of trick, and is not easy to adjust parameters with a large amount of data.

The innovation of this paper lies in that the traditional encoder-decoder model must be combined with THE inherent mode of CNN or RNN, and only Attention is used. The main purpose of this paper is to reduce the amount of computation and improve the parallel efficiency without compromising the final experimental results.

Model

1 Overall Framework

The overall framework is easy to understand, but the picture above is very complicated. Simplify it:

The encoder on the left reads the input and the decoder on the right gets the output: Seq2Seq

At first glance at the block diagram in the paper, the question arises as to how the output of the left encoder is combined with the right decoder. Because decoder has N layers inside. Let me draw another picture and it looks something like this:

That is, Encoder’s output is combined with each layer’s Decoder. Let’s take one layer to show it in detail:

2 Attention Mechanism

2.1 Attention to define

Attention is used to calculate “correlation degree”. For example, in the translation process, different English has different dependence on Chinese. Attention can be described as mapping Query (Q) and key-value pairs to the output. Where query, each key and each value are vectors, the output is the weight of all values in V, where the weight is calculated by Query and each key, and the calculation method is divided into three steps:

Step 1: Calculate the similarity between Q and K, denoted by f:

Step 2: Conduct Softmax operation on the obtained similarity to normalize:

Step 3: For the calculated weight picture, perform weighted summation for all values in V to obtain the Attention vector:

Note: The calculation methods in the first step include the following four methods:

Dot product:

General weights:

Concat:

Perceptron:

In the paper, Attention was implemented into concrete, called Scaled dot-product Attention and multi-head Attention respectively.

2.2 three Dot – Product Attention

Its structure is shown as follows:

First Step

To understand where Q, K, V came from for Scaled dot-Product Attention: Give me an input X and transform X to Q, K, V by 3 linear transformations.

Two words, Thinking, Machines. By embedding transform two vectors X1,X2 [1 x 4]. With the three matrices [4×3] of Wq,Wk and Wv, we want to take the dot product to get six vectors [1×3],{q1,q2},{k1,k2},{v1,v2}.

Second Step

The dot product of the vectors {q1,k1} gives a Score of 112, and the dot product of {q1,k2} gives a Score of 96.

Third and Forth Steps

Normalize the score and divide by 8. This is explained in the paper to make the gradient more stable. Then the score [14,12] was softmax to get the ratio [0.88, 0.12].

Fifth Step

Multiply the score ratio [0.88, 0.12] by the [v1,v2] value (Values) to obtain a weighted value. You add these values together and you get z1. That’s the output of this layer. Feel it carefully. Use Q and K to calculate the weight of a thinking pair, thinking and machine. Multiply the weight by thinking, and get the weighted thinking and MACHINE’s V.

Matrix representation

The previous example was an example of a single vector operation. This is an example of a matrix operation. The input is a [2×4] matrix (word embedding), with each operation being a [4×3] matrix, for Q,K,V.

Q dot the transformation of K divided by the square root of DK. Take a softmax to get a ratio of 1 and dot V to get the output Z. So this Z is an output that has considered the words around it (machine).

If you look at this formula, the images will form a Word2Word attention map! (Add Softmax and you get a weight of 1). For example, if your input is “I have a dream”, four words in total, this will form a 4×4 map of the attention mechanism:

In this way, each word has a weight for each word

Encoder is called self-attention, and decoder is called masked self-attention.

“Masked” means that the future information is not shown to the model during language modelling (or translation).

Mask is to cover the gray area with 0 along the diagonal, so that the model does not see the future information.

Specifically, “I”, as the first word, can only have attention with “I”. “Have” is the second word. There is “I” and “attention”. ‘A’ is the third word. There is’ attention ‘with’ I ‘, ‘have’ and ‘a’. Only when the last word “dream” was there was attention for the whole sentence.

So softmax is going to look something like this, 1

2.3 – the Head of Attention

Multi-head Attention is to do the Scaled dot-product Attention process H times and then combine the output Z. In the paper, its structure is shown as follows:

Let’s explain it in the form above:

We repeat the same operation eight times, we get eight Zi matrices

In order to multiply the output and input structure by a linear W0 to get the final Z.

3 Transformer Architecture

Most sequence processing models use encoder-decoder structure, in which Encoder maps the input sequence image to the continuous representation image, and then decoder generates an output sequence image, each time output a result. From the framework diagram, we can see that the Transformer model continues this model.

3.1 the Position Embedding

Because the model did not include Recurrence/Convolution, it was impossible to capture the sequence sequence information, such as the row disruption of K and V. Then the result would be the same after Attention. However, the sequence information is very important and represents the global structure, so the relative or absolute position information of the sequence must be used.

The position embedding vector dimension of each participle is also the image, and the original input embedding and position embedding are added to form the final embedding as the input of encoder/decoder. The calculation formula of position embedding is as follows:

Pos represents position index and I represents Dimension index.

Position Embedding itself is an absolute Position information, but relative Position is also very important in language. An important reason why Google selects the above Position vector formula is that we have:

This shows that the vector of position P +k can be represented as a linear transformation of the vector of position P, which provides the possibility of expressing relative position information.

In other NLP papers, we have also seen position embedding, which is usually a trained vector, but position embedding is only extra features. It will be better if there is this information, but the performance will not be greatly reduced. RNN and CNN themselves can capture location information, but in Transformer model, Position Embedding is the only source of location information, so it is the core component of the model, not a feature of auxiliary nature.

3.2 Position-wise feed-forward Networks After the Attention operation, each layer of encoder and decoder contains a fully connected forward network, and the same operation is performed on each Position vector. Includes two linear transformations and a ReLU activation output:

Each of these layers has different parameters.

3.3 Encoder

Encoder has N=6 layers, each layer contains two sub-layers:

The first sub-layer is the multi-head self-attention mechanism, which is used to calculate the input self-attention

The second sub-layer is a simple fully connected network.

At each sub-layer, we simulate the residual network, and the output of each sub-layer is:

Sublayer(x) represents the mapping of sub-layer to input X. To ensure connection, all sub-layers and embedding Layer output dimensions are the same picture.

3.4 Decoder

Decoder is also N=6 layers, each layer includes 3 sub-layers:

The first one is Masked multi-head self-attention, which is also the self-attention of the calculation input. However, because it is a generation process, there is no result at time I when it is greater than I, and only at time less than I, so masks need to be made

The second sub-layer is a fully connected network, the same as Encoder

The third sub-layer performs attention calculations on the encoder input. At the same time, the self-attention layer in Decoder needs to be modified, because only the input before the current moment can be obtained, so only the input before the moment T can be calculated for attention, which is also called Mask operation.

3.5 The Final Linear and Softmax Layer

Take the stack output of the Decoder as input, starting at the bottom, and finally word prediction.

3.6 The Decoder Side

To proceed:

4 Experiment

The picture

As you can see, Transformer uses the least amount of resources to get a state-of-art output return.

Some parameters of the model itself were tested to change independent variables to see which parameters had a greater impact on the model.