### 1 the introduction

Dear friends, welcome to Moon Lai Inn. Today we’re talking about a paper called “Attention Is All You Need” published by the BBC in 2017. Of course, there have been a lot of online analysis of this paper, but good food is not afraid of late the author is just here to talk about their own understanding of it and use. This paper will be introduced in about seven articles: ① The ideas and principles of Transformer’s multi-head attention mechanism; ②Transformer’s position encoding and encoding decoding process; (3) network structure and self-attention mechanism implementation of Transformer; ④ the implementation process of Transformer; ⑤ translation model based on Transformer; ⑥ Based on Transformer text classification model; ⑦ The couplet generation model based on Transformer.

I hope this series of 7 articles will give you a clear understanding of Transformer. Now, let’s formally walk into the interpretation of this paper. Public number backstage reply “paper” can be downloaded link!

### 2 the motive

#### 2.1 Problems Facing

In the same order as we always do when we read a paper, let’s first look at why the author proposed the Transformer model. What kind of problems need to be solved? What are the drawbacks of the current model?

In the abstract part of the paper, the author mentioned that the current mainstream sequence models are all Encoder-Decoder models constructed based on complex cyclic neural networks or convolutional neural networks, and even the current sequence models with the best performance are also based on Encoder-Decoder architecture under attention mechanism. Why does the author keep referring to these traditional encoder-decoder models? Then, the author mentioned in the introduction that in the modeling process of the traditional Encoder-Decoder architecture, the calculation process of the next moment will depend on the output of the last moment, and this inherent attribute restricts the traditional Encoder-Decoder model from being able to calculate in parallel, as shown in Figure 1.

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

Figure 1. Encoding diagram of cyclic neural network

The author then states that although the latest research has been able to improve the computational efficiency of traditional cyclic neural networks, the essential problem has not been solved.

Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

#### 2.2 Solutions

Therefore, in this paper, for the first time, the author proposes a new Transformer architecture to solve this problem. The beauty of the Transformer architecture is that it does away with the traditional circular structure altogether and instead calculates the implicit representations of the model’s inputs and outputs only through an attention mechanism known as self-attention.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence- aligned RNNs or convolution.

In general, the so-called self-attention mechanism is to directly calculate the weight of attention at each position of the sentence in the encoding process through some operation. Then the implicit vector representation of the whole sentence is calculated in the form of weight sum. Finally, Transformer architecture is an encoder-decoder model based on this self-attention mechanism.

### 3. Technical means

After introducing the background of the whole paper, let’s first take a look at the real features of the self-attention mechanism, and then explore the overall network architecture.

#### 3.1 the self – Attention

The first thing to realize is that this self-attention mechanism is exactly what the paper refers to as “Scaled Dot-Product Attention.” In the paper, the author states that the attention mechanism can be described as the process of mapping Query and a series of key-value pairs to an output, and the output vector is the sum of the weights calculated according to Query and Key on the value.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

However, to further understand the meaning of query, key and value, you need to combine the decoding process in Transformer, which will be explained in the following section. Specifically, the structure of self-attention mechanism is shown in Figure 2.

Figure 2. Structure of self-attention mechanism

As can be seen from Figure 2, the core process of self-attention mechanism is to calculate the weight of attention through Q and K. And then it applies to V to get the whole weight and the output. Specifically, for input Q, K and V, the formula for calculating the output vector is as follows:

$$ \text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V\; \; \; \; \; (1) $$

Where Q, K and V are three matrices respectively, and their (second) dimension is $D_Q, D_K, D_V $(from the following calculation process we can actually find out that $D_Q = D_V)$. The process of dividing $\ SQRT {d_k}$in formula $(1)$is the Scale referred to in Figure 2.

The reason for scaling is that the author found through experiments that for large $D_K $, a large value will be obtained after $QK^T$is completed, which will lead to a very small gradient after SOFRMAX operation, which is not conducive to the training of the network.

We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

If we just look at the structure in Figure 2 and the calculation process in formula $(1)$, it is obviously not easy to understand the meaning of the self-attention mechanism. For example, one of the most perplexing questions for beginners is how do Q, K and V come from in Figure 2 respectively? Let’s take a look at a practical calculation example. Now, assuming that the input sequence is “who am I” and that somehow a matrix of the shape $3\times4$has been generated to represent it, the process shown in Figure 3 can calculate Q, K, and V[2].

As can be seen from the calculation process in Figure 3, Q, K and V are actually calculated by multiplying input X by three different matrices (this is only limited to the process of Encoder and Decoder encoding by self-attention mechanism in their respective input parts, and Q, K and V in the interaction part of Encoder and Decoder are referred to separately). Here, the Q, K, and V, you can think of them as three different linear transformations of the same input in three different states. After Q, K and V are calculated, the weight vector can be further calculated, as shown in Fig. 4.

Figure 4. Calculation diagram of attention weight (operated by Scale and Softmax)

As shown in Fig. 4, after the attention weight matrix is calculated through the above process, we can’t help asking, what exactly do these weight values represent? For the first row of the weight matrix, 0.7 represents the attention value of “me” and “me”; 0.2 represents the attention value of “I” and “Yes”; 0.1 represents the attention of “me” and “who”. In other words, when encoding the “I” in a sequence, focus 0.7 on “I”, 0.2 on “Is”, and 0.1 on “Who”.

Similarly, line 3 of the weighting matrix implies that 0.2 attention should be paid to “I”, 0.1 to “is”, and 0.7 to “who” when encoding the “who” in the sequence. It can be seen from this process that the weight matrix model can easily know how to focus attention on different positions when coding the vectors at corresponding positions.

However, another point can be seen from the above calculation result is that when the model encodes the information of the current position, it will pay excessive attention to its own position (although this is common sense) and may ignore other positions [2]. So one solution the authors take is to use MultiHeadAttention, which we’ll see later.

It DRM the model’s ability to focus on different positions. z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself.

After the weight matrix is calculated through the process shown in Fig. 4, it can be applied to V to obtain the final coding output. The calculation process is shown in Fig. 5.

Figure 5. Weight and encoding output graph

According to the process shown in Figure 5, we can get the final encoded output vector. Of course, we can look at the above process from another Angle, as shown in Figure 6.

As can be seen from Figure 6, for the coding vector whose final output is, it is actually the weighted sum of the three vectors of the original “Who am I?”, which also reflects the whole process of attention weight allocation when encoding is.

Of course, for the entire process in Figures 4 through 5, we can also represent it by the process in Figure 7.

Can see through this since the attention mechanism really solved at the beginning of the author in the paper have put forward “traditional sequence model must be in the process of coding sequence of the disadvantages of” question, has since attention mechanism, only need a few times on the original input matrix could get the final position of containing different attention information coding vector.

This is the end of the core of the self-attention mechanism, but there are still many details that have not been introduced. For example, how do Encoder and Decoder interact with each other to get Q, K and V? What does the labeled Mask operation in Figure 2 mean, under what circumstances will it be used, and so on? These will be covered in detail later. Next, let’s continue our exploration of the MultiHeadAttention mechanism.

#### 3.2 MultiHeadAttention

Through the above introduction, we are in a certain extent, have a clear understanding to the attention mechanism, but we also mentioned in the above the attention mechanism defect is: model at the current location information coding, become too much to focus on their position, so the author through the long attention mechanism was proposed to solve the problem. At the same time, using the multi-head attention mechanism can also give the output of the attention layer contains the encoding representation information in different subspaces, so as to enhance the expression ability of the model.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Now that we have explained why we need a multi-headed attention system and the benefits of using it, let’s take a look at what a multi-headed attention system is.

Figure 8. Structure of multi-head attention mechanism

As shown in Figure 8, it can be seen that the so-called multi-head attention mechanism is actually a multi-group self-attention processing process based on the original input sequence. Then each group of self-attention results are stitched together for a linear transformation to get the final output results. Specifically, its calculation formula is as follows:

$$ \text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1,… ,\text{head}_h)W^O\\ \; \; \; \; \; \; \; \text{where}\; \; \text{head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V) $$

Among them

$$W ^ Q_i \ in \ mathbb {R} ^ {d_ {model} \ times d_k}, W ^ K_i \ \ mathbb in ^ {R} {d_ {model} \ times d_k}, W ^ V_i \ in \ mathbb {R} ^ {d_ {model} \ times d_v}, W ^ O \ \ mathbb in ^ {R} {hd_v \ times d_ {model}} $$

Meanwhile, in this paper, the author uses $H =8$parallel self-attention modules (8 heads) to construct an attention layer, and for each self-attention module, $D_K = D_V = D_ {model}/ H =64$is limited. It can be found from this that the multi-head attention mechanism used in this paper is actually splitting a large high-dimensional single head into $H $multiple heads. Therefore, the calculation process of the entire multi-head attention mechanism can be represented by the process shown in Figure 9.

As shown in Fig. 9, according to the input sequence X and $W^Q_1,W^K_1,W^V_1$, we can calculate $Q_1,K_1, and V_1$. Further, according to the formula $(1)$, we can get the output $Z_1$of a single self-attention module. Similarly, according to X and $W^Q_2,W^K_2,W^V_2$, another self attention module output $Z_2$is obtained. Finally, $Z_1,Z_2$are stacked horizontally to form $Z$, and then $Z$is multiplied by $W^O$to obtain the output of the entire multi-head attention layer. At the same time, according to the calculation in Figure 8, you can also get $d_q=d_k=d_v$.

This concludes the principle of the multiple attention mechanism that is the core part of Transformer.

### 4 summarizes

In this paper, the author first introduces the motivation of the paper, including the problems faced by traditional network structures and the countermeasures proposed by the author; Then it introduces the self-attention mechanism and its corresponding principle. Finally, it introduces the mechanism of multi-headed attention and the benefits of using multi-headed attention. At the same time, for this part of the content, the key need to understand is the self-attention mechanism calculation principle and process. In the next article, I will look at the location encoding and encoding decoding process in Transformer in detail.

This is the end of the content, thank you for reading! If you find the above content helpful, please share it with one of your friends! If you have any questions or suggestions, please add the author WeChat ‘nulls8’ or the group to communicate. Green mountains do not change, green water long flow, we come to the inn to meet!

### reference

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need

[2] The Illustrated Transformer http://jalammar.github.io/ill…

[3] LANGUAGE TRANSLATION WITH TRANSFORMER https://pytorch.org/tutorials…

[4] The Annotated Transformer http://nlp.seas.harvard.edu/2…

[5] SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT https://pytorch.org/tutorials…