This is the 23rd day of my participation in Gwen Challenge

The introduction

Transformer model has many internal details. This paper only focuses on the Attention part and self-attention part. If you are interested, you can check the paper.

What is the Transformer

  • Transformer is a Seq2Seq model that is ideal for machine translation tasks. If you are not familiar with the Seq2Seq model, please refer to my previous article “Seq2Seq Training and Prediction details and Optimization Techniques”.

  • It is not a recurrent neural network structure, but a network structure composed solely of Attention, self-attention and full connection layer.

  • Transformer’s evaluation performance completely crushes the best RNN + Attention structure. At present, no one in the industry uses RNN, but BERT + Transformer model combination.

Review the RNN + Attention structure

As shown in the figure, it is a model composed of RNN + Attention. In the process of Decoder, the process of cJ calculation is as follows:

A) Multiply the output vector sj and WQ at the JTH moment of Decoder to get a q:j

B) Multiply the hidden layer output hi and WK of each Encoder to get K: I. Since there are m inputs, there are m k: I vectors, represented by K.

C) An m-dimensional vector can be obtained by multiplying KT by q:j, and then m weights aij can be obtained by Softmax.

Q :j is called Query, k: I is called Key, the function of Query is to match the Key, the function of Key is to be matched by Query, the weight aij after calculation represents the matching degree of Query and each Key. The higher the match, the greater the AIJ. I think it can be understood that Query captures sJ features of Decoder, Key captures M HI features of Encoder output, and AIJ represents the correlation between SJ and each HI.

D) Multiply the hidden layer output hi and WV of each Encoder to get V: I. Since there are m inputs, there are m v: I vectors, represented by V.

E) After the above steps, the Decoder at the j moment cJ can be calculated, that is, m a and the corresponding M V multiplied, the weighted average obtained.

【 Note 】 Three parameter matrices WV, WK and WQ need to be learned from the training data.

The Transformer in the Attention

In Transformer, RNN structure is removed and only Attention structure is retained. As can be seen from the following figure, m k: I and M V: I are obtained by calculation using WK, WV and Encoder’s M input x, respectively. T q: T is calculated by using t input X of WQ and Decoder.

As shown below, here is to calculate the weight of Decoder at the first moment, multiply KT and Q :1, and get m weight A after Softmax conversion, marked as A :1.

As shown in the figure below, here is to calculate the context feature C :1 at the first moment of Decoder, multiply and sum m weights A and M v respectively, and get the weighted average result of C :1.

Similarly, the context characteristics of Decoder at each moment can be calculated as above. To be clear, c:j depends on the current Decoder input x’j and all Encoder inputs [x1,… xm].

In the following figure, the input of Encoder is sequence X, the input of Decoder is sequence X’, and the context vector C is the function result of X and X’. The three parameter matrices WQ, WK and WV used are all learned through training data.

Transformer inputs the state vector H :j into the Softmax classifier, while attentional inputs the context feature C :j into the Softmax classifier. Random sampling can then predict the next word input.

The Self – the Attention of the Transformer

The input of self-attention only needs one X input sequence. Here, WQ, WK, WV and each input xi are respectively used to calculate m vectors q: I, K: I, and V: I. The weight at the JTH moment is calculated in the same way as above. Softmax(KT * q:j) can get the m weight parameters of xj at the JTH moment about all input X, and finally multiply and sum the m weight parameters and m v: I respectively to get the context vector C :j.

Similarly, c:j at all times can be solved in the same way.

In conclusion, the input is the sequence X, and the context vector C is the function result of X and X, because xj and all X need to be taken into account and calculated every time the context vector Cj of Xj is calculated. Among them, the three parameter matrices WQ, WK and WV need to be learned through training data.

reference

[1] Vaswani A , Shazeer N , Parmar N , et al. Attention Is All You Need[J]. arXiv, 2017.