1. What is Transformer

Attention Is All You Need Is a Google paper that takes Attention to the extreme. This paper proposes a new model called Transformer, which abandons CNN and RNN used in previous deep learning tasks. The popular Bert is based on Transformer, which is widely used in NLP fields such as machine translation, question answering, text summarization and speech recognition.

2. The structure of the Transformer

2.1 Overall Structure

The Transformer has the same structure as the Attention model, with encoer-Decoder architecture in the Transformer model. However, its structure is more complex than Attention. In this paper, the Encoder layer is stacked with 6 Encoders, as is the decoder layer.

For those of you unfamiliar with the Attention model, please refer back to the previous article on Attention

The internal structure of each Encoder and decoder is shown below:

  • Encoder, including two layers, a self-attention layer and a feedforward neural network, self-attention can help the current node not only focus on the current word, so as to obtain the semantics of the context.
  • Decoder also includes the two layers of networks mentioned by Encoder, but there is a attention layer in between that helps the current node get what it needs to pay attention to.

2.2 Encoder layer structure

When the stitching is finished, the data is added to the Encoder layer and sent to the feed forward neural network. The feed forward neural network can compute the data in parallel. The resulting output is fed into the next encoder.

2.2.1 Positional Encoding

The Transformer model, unlike the sequence model, lacks a way to interpret the order of words in the input sequence. To solve this problem, Transformer adds an additional vector Positional Encoding to the input of the Encoder and decoder layers. Dimensions are the same as those of embedding, and this vector uses a unique way for the model to learn this value. This vector determines the position of the current word, or the distance between different words in a sentence. There are many specific calculation methods for this position vector, and the calculation methods in this paper are as follows:



Where pos refers to the position of the current word in the sentence, and I refers to the index of each value in the vector. It can be seen that sine encoding is used in even positions, and cosine encoding is used in odd positions.

Finally, this Positional Encoding is added to the value of embedding as input to the next layer.

2.2.2 the Self – Attention

Self-attention has a similar idea to attention, but this is how Transformer converts “understanding” of other related words into the word we’re working on. Here’s an example:

The animal didn’t cross the street because it was too tired

It is easy for us to determine whether “IT” here represents Animal or street, but it is difficult for the machine to determine. Self-attention can make the machine associate “IT” with animal. Next, let’s see the detailed processing process.

  1. In this paper, the dimension of vector is 512. We call these three vectors Query, Key and Value respectively. The three vectors are obtained by embedding vector multiplied with a matrix, which is randomly initialized. Note that the second dimension needs to be the same as the dimension of embedding, its value is constantly updated during the BP process, and the dimension of the three vectors is 64.

  2. Calculates the self-attention score value, which determines how much attention we pay to the rest of the input sentence when we encode a word in one location. The calculation method of the score value is the combination of Query and Key. The following figure is an example. Firstly, we need to calculate a score value of other words for the word Thinking, first for itself (Q1 ·k1), and then for the second word (Q1 ·k2).

  3. Next, divide the result by a constant, in this case by 8, which generally takes the square root of the first dimension of the matrix mentioned above, that is, the square root of 64, 8, or some other value, and then does a Softmax calculation of the result. The result is how relevant each word is to the word in the current position, which of course will be very relevant.

  4. The next step is to multiply the values obtained by Value and SoftMax and add them together. The result is the Value of self-attetion on the current node.

In practical application scenarios, in order to improve the calculation speed, we use the matrix method to directly calculate the matrix of Query, Key and Value, and then multiply the Value of embedding with the three matrices, multiply the new matrix Q and K, multiply by a constant, softmax operation is done. And then finally times the V matrix.

This method of determining the weight distribution of value by the similarity of query and key is called scaled dot-product attention.

2.2.3 Multi – Headed Attention

What’s really cool about this paper is that it adds another mechanism to self-attention, called “multi-headed” attention, which is very simple to understand. It means that instead of just initializing a matrix of Q, K, and V, it initializes multiple groups. Tranformer uses 8 sets, so the final result is 8 matrices.

2.2.4 Layer normalization

In Transformer, each sub-layer (Feed Forward Neural Network) is followed by a defective module and there is a Layer of normalization.

There are many types of Normalization, but they all have a common purpose of converting input into data with a mean of zero and variance of one. We do normalization before putting data into the activation function because we don’t want the input data to fall in the saturated area of the activation function.

Batch Normalization

The main idea of BN is to normalize each batch of data in each layer. We might normalize the input data, but after the network layer, our data is no longer normalized. With the development of this situation, the deviation of the data becomes larger and larger. My back propagation needs to take these large deviations into account, which forces us to use a smaller learning rate to prevent gradient disappearance or gradient explosion. The specific approach of BN is to normalize each small batch of data in the direction of batch.

Layer normalization

It’s also a way of normalizing the data, but LN calculates the mean and variance on each sample, rather than BN, which calculates the mean and variance on the batch direction! The formula is as follows:


2.2.5 Feed Forward Neural Network

This leaves us with a small challenge. The feedforward neural network can’t input eight matrices. What can we do? So we need a way to reduce the eight matrices to one, first of all, we join the eight matrices together, which will give us a big matrix, and then we randomly initialize a matrix times the combined matrix, and finally we get the final matrix.

2.3 Decoder layer structure

As you can see from the overall structure diagram above, the decoder part is basically the same as the Encoder part, starting with adding a Positional Encoding vector as in Section 2.2.1. Next comes Masked Mutil-head attetion, which is also a key technology in Transformer and will be introduced in the following sections.

The other layers are the same as Encoder, please refer to Encoder layer structure.

2.3.1 masked mutil – head attetion

Mask indicates a mask that masks certain values so that they do not take effect when parameters are updated. Transformer model involves two types of mask, namely padding mask and Sequence mask. The padding Mask is used in all scaled dot-Product attention, while the Sequence Mask is only used in self-attention for decoder.

  1. padding mask

    What is a padding mask? Because the length of the input sequence is different from batch to batch in other words, we need to align the input sequence. In particular, you populate short sequences with zeros. However, if the input sequence is too long, the content on the left is cut and the excess is discarded. Because these filling positions are meaningless, our attention mechanism should not focus on these positions, so we need to do some processing.

    To do this, add the values of these positions to a very large negative number (negative infinity), so that the probability of these positions approaches 0 by SoftMax!

    And our padding mask is actually a tensor, and each value is a Boolean, and the value false is where we’re going to do the processing.

  2. Sequence mask

    As mentioned earlier in the article, sequence mask is used to prevent decoder from seeing future information. In other words, for a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after T. So we need to figure out a way to hide the information after t.

    So how do you do that? It’s also very simple: generate an upper triangle matrix with all values of 0. Applying this matrix to each sequence will do the trick.

  • For scaled dot-product attention, use padding Mask and Sequence Mask as attn_mask. The implementation is to add two masks together as attn_mask.
  • In all other cases, attn_mask equals the padding mask.

2.3.2 the Output layer

When the decoder layer is fully executed, how to map the vector to the words we need, it is very simple, just need to add a full connection layer and softmax layer at the end, if our dictionary is 1W words, the final softmax will input 1W words probability, probability value of the largest corresponding words is our final result.

2.4 Dynamic Flow Chart

The encoder works by processing input sequences. The output of the top encoder is then transformed into a set of attention vectors containing the vectors K (the key vector) and V (the value vector), which is parallelized. These vectors will be used by each decoder for its own encoding-decoding attention layer, and these layers help the decoder focus on where the input sequence is appropriate:

After completing the coding phase, the decoding phase begins. Each step in the decoding phase outputs an element of the output sequence (in this case, the English translated sentence).

The next steps repeat the process until a special termination symbol is reached indicating that the Transformer decoder has finished its output. The output of each step is fed to the bottom decoders at the next time step, and the decoders output their decoded results just as the encoder did before.

3. Why does Transformer need multi-head Attention

According to the original paper, the reason for multi-head Attention is to divide the model into multiple heads and form multiple subspaces, so that the model can pay Attention to different aspects of information, and finally synthesize all aspects of information. In fact, intuitively speaking, if you design such a model by yourself, attention will not be done only once. The comprehensive result of multiple attention can at least enhance the model, and it can also be similar to the simultaneous use of multiple convolutional kernels in CNN. Intuitively speaking, Multiple attention helps the network capture richer features/information.

4. What are the advantages of Transformer over RNN/LSTM? Why is that?

  1. RNN series models have poor parallel computing capability. The problem of RNN parallel computation lies in this, because the calculation at time T depends on the hidden layer calculation result at time T-1, and the calculation at time T-1 depends on the hidden layer calculation result at time T-2, and so on, the so-called sequence dependence relationship is formed.

  2. Transformer has better feature extraction capability than RNN series models.

    Transformer: A comparison of three feature extractors for Natural Language Processing (CNN/RNN/TF)

    However, it is worth noting that Transformer is not able to completely replace RNN series models. Every model has its own scope of application. Similarly, RNN series models are preferred for many tasks. Quickly analyze what model to use and how to do it well.

5. Why Transformer can replace SEQ2SEQ?

Seq2seq faults: The biggest problem with seq2SEq is that it compresses all the information on the Encoder side into a fixed length vector and uses it as the first input to the Decoder side to hide the state. To predict the hidden state of the first word (token) on the Decoder side. This obviously loses a lot of information on the Encoder side when the input sequence is long, and it sends the fixed vector to the Decoder side all at once, and the Decoder side cannot focus on the information it wants to focus on.

The Transformer advantages: Transformer not only improves the seQ2SEQ model on both of these points, but also introduces the self-attention module, which makes the source sequence and target sequence “self-relate” in the first place. The embedding representation of source sequence and target sequence itself contains richer information, and the subsequent FFN layer also enhances the expression ability of the model. Moreover, Transformer’s capability of parallel computing is far better than that of SEQ2SEQ series models. So I think this is where Transformer is superior to the SEQ2SEQ model.

6. Code implementation

Address: github.com/Kyubyong/tr…

Code interpretation: Transformer parsing and Tensorflow code interpretation

Machine Learning

7. References

  • Transformer model details
  • Illustrated Transformer (full version)
  • Record several questions about Transformer

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]