The Transformer model Is based on the paper Attention Is All You Need, which Is used to generate text context encoding. Traditional up and down encoding Is mostly done by RNN. However, RNN has two disadvantages:

First, the computation is carried out sequentially and cannot be parallelized. For example, for a text sequence with 10 words, if we want to get the processing result of the last word, we must first calculate the processing result of the previous 9 words.

** 2. RNN is sequential computation and has attenuation of information, so it is difficult to process information between two words that are far apart. Therefore, RNN is usually used in conjunction with attention.

Attention Is All You Need proposes a Transformer model to address RNN’s shortcomings. Transformer consists of multiplex attention, location coding, layer normalization and location forward neural network.

More said, it is better to look at the network diagram of Transformer in a straightforward way.

Transormer, like seQ2SEQ model, is composed of Encoder and Decoder. The one on the left is Encoder and the one on the right is Decoder. The model is mainly composed of multiple attention and feedforward neural network.

There is a key point here –** position encoding. ** When processing with RNN, each input has a natural location attribute, but this is one of the drawbacks of RNN. The arrangement of text words is closely related to the semantics of the text, so the position of the text is a very important feature. Therefore, in Transformer model, location coding is added to retain the position information of the words.

Here, we explain each part of Transformer:

Encoder part:

1. Multiple attention

Multi-focus is an important part of the Transformer model. Let’s look at the structure of multi-focus through the figure below

Multi-attentional attention is the extension of self-attention. Compared with self-attention, multi-attentional attention improves in two aspects:

1. Extended the model’s attention ability at different positions in the text sequence. Compared with self-attention, the final output of self-attention contains little information about other words, and its attention weight is mainly dominated by the word itself, namely query.

2. Give attention a multi-seed expression. With multi-head attention we have multiple sets of Q, K, V parameter matrices (Transformer uses eight heads, so there are eight sets of parameter matrices).

Let’s take a look at how Transformer’s multi-focus mechanism works.

First of all, 8 heads will be divided, and 8 groups of Q, K and V parameter matrices will be obtained

Then, we need to calculate the weight of attention, we have 8 matrices, so the weight of the output attention also has 8 matrices

Here, the problem is, the next layer is a feed-forward neural network, input is only one matrix, but our multi-attentional mechanism outputs eight. So, we’re going to compress these eight matrices into one matrix.

How do you do that? The algorithm was so good, I had to offer my knees to the big boys,

In two steps

  1. theJoin them together and you get a new matrix
  2. theThe joined matrix times a parameter weight matrixAnd get the final output result

Finally, a graph is used to summarize the calculation process of multiples attention

2. Positional embedding

As mentioned above,Transformer has designed positional embedding to retain the positional information of words, and each location is assigned a unique positional embedding. The following is a positional embedding defined by trigonometric functions:

Among them,The position code of the word representing pos positions in the text sequenceThe value of the dimensional component. Among them,Represents the position of the current word in a sentence,Represents the index of each value in the current vector

Assuming the vector dimension of the word is 4, the actual position encoding looks like this:

That is, after the Positional Encoding is calculated according to the formula, the value of the Positional Encoding and Embeding is added as the input to the next layer.

Iii. Layer Normalization

The training of neural networks has a high degree of computational complexity. In order to reduce the time cost of training and speed up the convergence of the network, layer normalization can be used.

The layer normalization operation applies to the input of the layerWe normalized it to get. The calculation layer normalization operation formula is as follows:

Among them,As a parameter,Is a nonlinear function

There is one more detail we need to pay attention to in transformer Encoder’s network architecture. There is a residual connection in each sublayer of Encoder (self-attention, FFNN), which adds the input and output of the network layer.

The calculation details of self-attention, hierarchical normalization and residual connection and their relationship with each other are further shown as the figure below:

Decoder inside the sub-network layer is also composed of self-attention, level normalization, residual connection, and Transformer is mainly composed of Encocder and Decoder two parts. The internal details of the Transformer should look like this:

Two, Decoder part

Now we know the components of Encoder and related concepts /. The components of Decoder are also similar to Encoder, so we basically know the components of Decoder, let’s take a look at their overall workflow.

After Embeding, positional Embeding, the input text is input to encoder for processing. At the end, the top encoder will output a set of attentionVector, which is used in the decoder to help the decoder focus on the proper placement of the input text.

The decoding steps are shown below, and are repeated until a terminator is encountered. In the decoding process, the output of each current decoding is used as the input of the next decoding. In Encoder, the positional encoding is added to the word Embeding as the input to the next level, and in Decoder, the positional encoding of each word is added to the corresponding Embeding as the Decoder input.

Mask

Mask Indicates the mask. That is, some values are covered up, not involved in the calculation, and will not have an effect on parameter update. When Decoder calculates attention, future information needs to be hidden so that Decoder can’t see it. For example, when time_step is t, the decoder should rely only on the output before t, not on the output after t. So we need to figure out a way to hide the information after t. That’s what masks are for.

The solution is to add a very negative number (-INF) to each of these locations before calculating Softmax in the Attention step, so that the probability of passing through softmax is close to zero.

Linear and softmax

Decoder’s final output is a vector of floating point data. How do we convert this vector to a word? That’s what Linear and Softmax are going to do.

Linear layer is a fully connected neural network with output neurons usually equal to the size of our vocabulary. Decoder output results will be input to Linear layer, and then softmax conversion, is the size of the vocabulary vector, each value of the vector is corresponding to the current Decoder is the probability of the corresponding word, we just take the probability of the largest word, is the result of the current word Decoder.

Transformer’s main modules and related working principles have been explained, I hope it will be helpful to you. Chairman MAO said: Practice is the only standard to test the truth, and code is the best material for learning algorithms (as I said). In the next chapter, we will carefully analyze the implementation code of Transformer model, and then use transformer model to further understand the principle of the model.

Code word is not easy, if it is helpful to you, I hope you can give a praise, encourage me ~^_^

reference

Terrifyzhao. Making. IO / 2019/01/11 /…

Medium.com/inside-mach…

jalammar.github.io/illu