The author | Adherer edit | NewBeeNLP

Interview tips on knowledge sorting series, constantly updated

Published:

What do interviewers ask about ELMo

Writing in the front

Some time ago, I finished my paper and began to read Transformer, GPT and Bert series papers carefully. Then I sorted out the relevant questions in the station, but I found that there were few answers in the station. So I collected some information on the Internet for each question and sorted it out. Some problems also wrote some of their own views, there may be mistakes, and even mistakes, but also ask everyone to give advice????

Model Overview:

Transformer model overview

1. What is the structure of Transformer?

Transformer itself is also a typical encoder-decoder model. From the model level, Transformer is actually like a SEQ2SEq with attention model. The following Outlines the structure of Transformer and the components of each module.

1.1 Encoder terminal & Decoder terminal overview

  • The Encoder end is stacked with N(” N=6 “in the original paper) identical large modules, each of which is composed of” two sub-modules “, which are multi-headed self-attention module and a feedforward neural network module respectively.
    • Note that each large module on the Encoder side receives different input: the first large module receives the output of the input sequence embedding(which can be pre-trained by Word2vec), and the other large modules receive the output of the previous large module. The output of the last module is the output of the whole Encoder side.”
  • The Decoder terminal is also composed of N(” N=6 “in the original paper) identical large modules stacked, each of which is composed of” three sub-modules “, these three sub-modules are multi self-attention module, “Multi-encoder -Decoder Attention interaction module”, and a feedforward neural network module;

    • Also note that the Decoder receives different inputs for each large module, where the first large module (the bottom one) receives different inputs for training and testing, and may not send the same inputs for each training (i.e., “right” in the model summary chart), As explained later), the remaining large modules receive the output of their previous large module, and the output of the last module serves as the output of the entire Decoder side

    • For the first large module, in a nutshell, the input received during training and testing is: \

      • When embedding is trained, each input is added with the ground truth that the input sequence moves backwards one bit (for example, if it moves backwards one bit, a new word is added). In particular, When the decoder’s time step is 1 (i.e. the first input received), its input is a special token, which may be the token at the beginning of the target sequence (e.g.), the token at the end of the source sequence (e.g.), or other task-dependent inputs, etc. There may be slight differences between different source codes. Its objective is to predict what the next word (token) will be. When time step is 1, it is to predict what the first word (token) of the target sequence will be, and so on.
        • It should be noted that in actual implementation, the embedding of the target sequence will not be dynamically input each time. Instead, the embedding will be input into the first big module at once, and the sequence will be masked in the multi-direction attention module
      • During the test, it is the output of the first position, and then after this, it is added to the input sequence during the second prediction, and so on until the prediction is finished.

1.2 Submodules of Encoder end

“1.2.1 Multi self-attention Module”

Before introducing the self-attention module, let’s introduce the self-attention module, as shown below:

self-attention

The above attention can be described as “mapping a set of query and key-value key-value pairs to the output”, where query, keys, values, and output are vectors, where query and keys have dimensions of, The dimensions of values are (in the paper), and the output is calculated as the weighted sum of values, where the weight assigned to each value is calculated from the similarity function between Query and corresponding key. This form of attention is called “Scaled dot-product attention”, and corresponds to the formula of the form:

For the multi-head self-attention module, the process is repeated (in the original paper) after the parameter matrix mapping (connect each to a fully connected layer), and then do the self-attention. Finally, all the results are joined together, and then send to a fully connected layer, as shown below:

multi-head attention

The corresponding form to the formula is:

Among them

“1.2.2 Feedforward Neural Network Module”

The feedforward neural network module (Feed Forward in the figure) consists of two linear transformations, with a ReLU activation function in the middle, corresponding to the formula in the form of:

In this paper, the input and output dimensions of the feedforward neural network module are the dimensions of its inner layer.

1.3 Each sub-module of Decoder terminal

“1.3.1 Multi self-attention Module”

The Decoder long self-attention module is the same as the Encoder module, but “it is important to note that the Decoder long self-attention module needs to mask, because it can” not see the future sequence “when predicting. So mask out all the tokens that are currently predicted and all the tokens that follow.”

“1.3.2 Multi-decoder Encoder-Decoder Attention Interaction module”

The form of multi-encoder -Decoder attention interaction module is the same as that of multi-attention module, the only difference is the source of its matrix. The matrix is derived from the output of the lower face module (corresponding to the output of the Masked multi-head self-attention module after adding & Norm in the figure), while the matrix is derived from the output of the whole Encoder side. If you think carefully, you can actually find that: The interaction module here is just like the mechanism in Seq2seq with attention. The purpose is to make the words (tokens) on the Decoder side give more attention weight to the corresponding words (tokens) on the Encoder side.

“1.3.3 Feedforward Neural Network Module”

This section is consistent with the Encoder side

1.4 Other Modules

“1.4.1 Add & Norm module”

The Add & Norm module is followed by each submodule of Encoder and Decoder, where Add stands for residual connection, Norm stands for LayerNorm, Residual connection comes from Deep Residual Learning for Image Recognition[1], and LayerNorm comes from Layer Normalization[2]. So the actual output of each submodule on the Encoder side and Decoder side is: LayerNorm, where Sublayer is the output of the submodule.

“1.4.2 Positional Encoding”

Positional Encoding adds the input embedding to the bottom of the Encoder and Decoder sides. Positional Encoding has the same dimension as embedding

, so you can sum the two.

To do this, we use the sine and cosine functions of different frequencies, as follows:

Where is position and is dimension. This function is chosen because any position can be expressed as a linear function of, which is mainly characterized by trigonometric functions:

It should be noted that the Positional Encoding in Transformer is not obtained through network learning, but directly calculated by the above formula. In this paper, the experimental results of learning Positional Encoding using network are also found to be basically consistent with the above results. But the paper chose the sines and cosines versions “because trigonometric formulas are not limited by sequence length, i.e., can represent sequences longer than those encountered.”

2. What is the specific input of Transformer Decoder?

See the Encoder side & Decoder side overview above for a detailed analysis of the input to the Decoder side

3. What is self-attention that Transformer has always emphasized? How does self-attention work? Why does it work so well? Why use Q, K, V for self-attention? Why not just Q, V/K, V, or V?

3.1 What is self-attention?

“Self-attention”, also called “intra-attention”, is a kind of attention mechanism associated with itself to obtain a better representation to express oneself. Self-attention can be regarded as a special case of general attention. In self-attention, each word in the sequence (token) and the remaining words in the sequence (token) are evaluated for attention. Self-attention is characterized by “ignoring the distance between tokens to directly calculate the dependency relationship, so as to learn the internal structure of the sequence”. It is also relatively easy to implement. It is worth noting that in some subsequent papers, self-attention can be regarded as a layer and RNN. CNN, etc., and successfully applied to other NLP tasks.

3.2 Calculation process of self-attention?

A detailed answer is given in Question 1

3.3 Why can self-attention play such a big role

In fact, it has been mentioned in the introduction of self-attention above. Self-attention is a kind of attention mechanism associated with itself, which can obtain a better representation to express itself. In most cases, However, it is debatable whether Transformer’s remarkable performance and powerful feature extraction ability are entirely due to its self-attention module. See article: How Much attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures[3], the following example can roughly explore the effect of self-attention:

Figure 1 Visualizes an example of self-attention

Figure 2 Visualizes an example of self-attention

It can be seen from the two figures (Figure 1 and 2) that self-attention can capture some syntactic features (such as phrase structure with a certain distance shown in Figure 1) or semantic features (such as the referent object of ITS, Law, shown in Figure 1) between words in the same sentence.

Obviously, the introduction of Self Attention will make it easier to capture the long-distance interdependent features in the sentence, because if it is RNN or LSTM, it needs to be calculated sequentially. For the long-distance interdependent features, information accumulation of several time steps can be used to connect them. However, the farther the distance is, The less likely it is to be captured effectively.

However, in the process of calculation, Self Attention will directly connect any two words in a sentence through a calculation step, so the distance between remote dependent features is greatly shortened, which is conducive to the effective use of these features. In addition, Self Attention can directly help increase parallelism in calculations. This is the main reason why Self Attention is becoming widely used.

3.4 Why Q, K, and V should be used for self-attention? Why not just Q, V/K, V, or V?

This question is not important to me. Self-attention uses Q, K, and V, so that the three parameters are independent, so the expression ability and flexibility of the model is obviously better than Q, V, or V alone. Of course, there are many mainstream attention methods. For example, in SEQ2SEQ with attention, only hidden state is used to calculate the similarity. Different tasks are handled in attention, and the methods of attention are slightly different, but the main idea is still the same. I wonder if there has been a detailed study on this issue in papers, please check it at your free time

“Actually, there is a small detail. Since self-attention includes itself (the same is true of masked self-attention), it is at least in the form of Q, V or K, V, and” inquiring “attention. Personally, Q, K and V seem reasonable.”

4. Why does Transformer need multi-head Attention? What’s the good of that? How to calculate multi-head Attention? What are the points of each paper?

4.1 according to Multi – head Attention

According to the original paper, the reason for multi-head Attention is to divide the model into multiple heads and form multiple subspaces, so that the model can pay Attention to different aspects of information, and finally synthesize all aspects of information. In fact, it can be intuitively thought that if we design such a model by ourselves, attention will not be done only once. The comprehensive result of multiple attention can at least enhance the model. It can also be similar to the simultaneous use of “multiple convolution kernels” in CNN. The attention of multiple “helps the network capture richer features/information”.

4.2 Calculation process of multi-head Attention

There is also a detailed introduction in 1, but it should be noted that there is no strong theoretical explanation for multi-head Attention in the paper. Therefore, many subsequent papers have certain discussions on the mechanism of multi-head Attention. Some related papers are as follows (I haven’t seen it yet, To save the first)

4.3 Multi-head Attention Mechanism related papers:

A Structured Self-attentive Sentence Embedding[4]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned[5]

Are Sixteen Heads Really Better than One? [6]

What Does BERT Look At? An Analysis of BERT’s Attention[7]

A Multiscale Visualization of Attention in the Transformer Model[8]

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention[9]

5. What advantages does Transformer have over RNN/LSTM? Why is that?

5.1 RNN series models have poor parallel computing capability

The calculation of the hidden layer state of RNN series models depends on two inputs, one is the input word of the sentence at the moment, the other is the output of the hidden layer state at the moment, which is the point that can best reflect the essential characteristics of RNN. The historical information of RNN is transmitted through this information transmission channel. The problem of RNN parallel computation lies in this, because the calculation of time depends on the hidden calculation result of time, and the calculation of time depends on the hidden calculation result of time, so that the so-called sequence dependence relationship is formed.

5.2 Transformer has better feature extraction capability than RNN series models

The above conclusions are illustrated by some mainstream experiments, rather than strictly theoretical proof. For specific experimental comparison, please refer to:

Abandon fantasy and Embrace Transformer: A Comparison of three Feature Extractors for Natural Language Processing (CNN/RNN/TF) [10]

However, it is worth noting that Transformer is not able to completely replace RNN series models. Every model has its own scope of application. Similarly, RNN series models are preferred for many tasks. Quickly analyze what model to use and how to do it well.

6. How is Transformer trained? How do you test during the test phase?

6.1 training

Transformer training process is similar to seq2SEQ. First, Encoder side gets encoding representation of input, and inputs it to Decoder side for interactive attention. After receiving the corresponding input in the Decoder end (see detailed analysis in 1), after the multi-head self-attention module, combined with the output of the Encoder end, and then through FFN, the output of the Decoder end is obtained, and finally through a linear full-connection layer. The next word (token) can be predicted by Softmax and loss can be propagated back according to Softmax’s multi-classification loss function, so the Transformer training process as a whole is equivalent to a supervised multi-classification problem.

  • It should be noted that “Encoder side can calculate in parallel, encoding all input sequences at one time, but Decoder side does not predict all words (tokens) at one time, but one by one like seq2seq.”

6.2 test

For the test stage, the only difference with the training stage is the input at the bottom of the Decoder end. For detailed analysis, see Question 1.

7. How do Add & Norm modules work in Transformer?

See 1 for descriptions of other modules for detailed analysis of the Add & Norm module

8. Why Transformer can replace SEQ2SEQ?

The biggest problem with seq2seq is that it “compresses all the information on the Encoder side into a fixed length vector” and uses it as the first input to the Decoder side to hide the state. To predict the hidden state of the first word (token) on the Decoder side. This obviously loses a lot of information on the Encoder side when the input sequence is long, and it sends the fixed vector to the Decoder side all at once, and the Decoder side cannot focus on the information it wants to focus on.

The above two points are the shortcomings of SEQ2SEQ model, and the subsequent paper has some improvements, such as the famous Neural Machine Translation by CPC Learning to Align and Translate[11]. Although seQ2SEQ model has been substantially improved, its parallelism capability is still limited because the main model is still RNN(LSTM) series model. Transformer not only improves the seQ2SEQ model on both points, but also introduces the self-attention module, which makes the source sequence and target sequence “self-relate” first. In this way, The embedding representation of source sequence and target sequence itself contains richer information, and the subsequent FFN layer also enhances the expression ability of the model. How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures[12]), and the parallel computing capability of Transformer is far superior to that of SEQ2SEQ series models. So I think this is where Transformer is superior to the SEQ2SEQ model.

9. What is the encoder representation for sentences in Transformer? How is word order information added?

The Transformer Encoder side gets the encoding representation of the whole input sequence, among which the most important thing is that the self-attention module is used to enrich the expression of the input sequence, and the word order information is added by using the sine and cosine functions of different frequencies, as described in 1.

10. How does Transformer parallelize?

I think the parallelization of Transformer is mainly reflected in the self-attention module. Transformer on the Encoder side can process the whole sequence in parallel and get the output of the whole input sequence through the Encoder side. In the self-attention module, for a certain sequence, the self-attention module can directly calculate the dot product result, while the RNN series model must be calculated in order from.

11. What is the function of normalization in the self-attention formula?

First of all, explain the reason for normalization. With the increase of, the result after dot product will also increase, which will push softmax function into the region with very small gradient, making convergence difficult (the gradient may disappear).

To explain why the dot product becomes larger, suppose that the components of the sum are independent random variables with mean 0 and variance 1, then their dot product mean is 0 and variance is, so to counteract this effect, we scale the dot product. For more detailed analysis, see (summarize at a later time) : Why scaled attention in Transformer? [13]

Write in the back

The Transformer model proposed in 2017 did cause a great sensation at that time, but now with hindsight, Transformer model is indeed very powerful, but I think it is not as the title of the paper said “Attention is all You need”. On the contrary, I think the biggest contribution of this paper lies in that it is the first time for it to achieve good results by stacking the depth of the network in natural language processing tasks. Machine translation happens to be a task with abundant data and little difficulty, which gives full play to the advantages of Transformer. In addition, self-attention is not the whole story of Transformer, in fact, FFN borrowed from deep CNN network may be more important. Therefore, take a rational view of Transformer and choose the most suitable model for your own task in the face of different tasks ~[14][15][16][17][18][19][20][21][22][23][24]

This article is a bit long and may not be comfortable to read on your mobile phone, so I have prepared a PDF version for you

References for this article

[1]

Deep Residual Learning for Image Recognition: arxiv.org/abs/1512.03…

[2]

Layer Normalization: arxiv.org/abs/1607.06…

[3]

How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures: aclweb.org/anthology/P…

[4]

A Structured self-attentive Sentence Embedding: arxiv.org/abs/1703.03…

[5]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned: Arxiv.org/abs/1905.09…

[6]

Are Sixteen Heads Really Better than One? : arxiv.org/abs/1905.10…

[7]

What Does BERT Look At? An Analysis of BERT’s Attention: arxiv.org/abs/1906.04…

[8]

A Multiscale Visualization of Attention in the Transformer Model: arxiv.org/abs/1906.05…

[9]

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention: arxiv.org/abs/1908.11…

[10]

Abandon the fantasy, fully embrace the Transformer: natural language processing three feature extractor (CNN/RNN/TF) : zhuanlan.zhihu.com/p/54743941

[11]

Neural Machine Translation by computerized Learning to Align and Translate: arxiv.org/abs/1409.04…

[12]

How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures: aclweb.org/anthology/P…

[13]

Why scaled attention in Transformer? : www.zhihu.com/question/33…

[14]

The Illustrated The Transformer: jalammar. Making. IO/Illustrated…

[15]

The Annotated Transformer: nlp.seas.harvard.edu/2018/04/03/…

[16]

BERT fires and doesn’t know Transformer? Read this article is enough: zhuanlan.zhihu.com/p/54356280

[17]

Abandon the fantasy, fully embrace the Transformer: natural language processing three feature extractor (CNN/RNN/TF) : zhuanlan.zhihu.com/p/54743941

[18]

Why does the Transformer need multi-head Attention? : www.zhihu.com/question/34…

[19]

Why scaled attention in Transformer? : www.zhihu.com/question/33…

[20]

NLP Transformer 】 a: zhuanlan.zhihu.com/p/44121378

[21]

How do transformer compare to LSTM? : www.zhihu.com/question/31…

[22]

What are the mainstream attention methods? : www.zhihu.com/question/68…

[23]

Questions about the Transformer model in Google’s Attention is All You Need? : www.zhihu.com/question/26…

[24]

Attention is All You Need: spaces.ac.cn/archives/47…

– END

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) this qq group, 1003271085, to join this site WeChat group please reply “add group” to get a sale standing knowledge star coupons, please reply “knowledge planet like the article, A look at

Copy the code