Easy to understand Attention, Transformer, BERT principle detailed explanation

First, write first

This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together. About this part of good articles online, all special details, and the reason why I write this blog today, one is to deepen the understanding of this part of knowledge, the second is hope to pay more attention to some blog content for new students like me difficult to understand the details of some description, three is to write down some of my own thinking, I hope I can communicate with more people. The main content of this article is not a detailed description of the principles, the expectation is to have a general understanding of those principles, but always feel vaguely understand friends help. So the content is biased towards the parts that may be simple for the big guys, but may not be understood by the friends who are new to NLP. Hope predestined friends see not stingy give advice.

Detail on the principle of Attention

1, an overview of the

Before starting Attention, I hope you are familiar with the network structure of RNN series. If you are not familiar with it, you can check my blog about the principle of circular neural network RNN, LSTM and GRU, which simply and clearly describes the network structure and forward and backward propagation process of RNN. The main reason is that although Attention method is not only applied to THE NLP field, but also to the CV field and other fields, this blog can really make it easier for people to understand this part of the content. As far as I know, the earliest application of Attention was put forward by Google in order to improve the machine translation effect of SEQ2SEQ architecture, and RNN series algorithm is also used in seQ2SEQ encoding and decoding stage. Therefore, a clear understanding of RNN series is of great significance to understand the principle of Attention. Therefore, it is very helpful to understand Transformer family. The reason why many friends who have just contacted with others still feel confused after reading others’ articles lies in the fault of knowledge.

2, Attention structure principle

Firstly, a detailed description of Attention is givenPlease click onThe main reason is that I think there are detailed calculation formulas in this paper. Of course, this paper also puts forward many improvement schemes for Attention mechanism, which is worth reading carefully for those who are qualified. Another paper is published on ICRLPlease click onThis paper was published earlier than the last one, so it should be the first paper that proposed Attention mechanism. Similarly, if you are qualified, please read it carefully. Officially started before I was to say more, although looks a bit like bullshit, ha, ha, ha but I think there is still necessary to say a few words, have a clear understanding of RNN series algorithm’s friends know, RNN one serious drawback is long dependence problem, although later had the LSTM and a series of variation method to alleviate this problem, However, when the sequence is very long, the LSTM in the decoding stage cannot decode the earliest input sequence well. Based on this, the Attention mechanism is proposed. Take a look at my previous blog if you’re not too clear about RNN advice, and let’s move on to this topic. In this paper, the principle of the algorithm is also described by the application of Attention in SEq2SEq. It is necessary to have some understanding of SEQ2seQ. Actually, seQ2SEQ algorithm is very simple. Deep learning can do that. First let’s look at the model of seq2seq without Attention:We know both encoding and decoding, using RNN, coding phase in addition to himself in a moment of the hidden layer and output, and the current moment of input character, and decoding is not have this kind of input, then a more direct way is to code coding vector as decoding models obtained from the end of input characteristics of each moment. Of course, if the encoder chooses LSTM, it can also directly use the final semantic vector C as the input. As can be seen from the figure, the coding vectors are input with equal weight, that is, the coding vectors input at each moment are the same. Here is a model of SEq2SEQ with Attention applied:From this figure, we can see that the encoding vector input in the decoding stage is a weighted input, and the weight of this encoding vector changes dynamically according to different decoding moments. Of course, in addition to the encoding vector input, there is also the hidden layer output of the last decoding moment $s_{t-1}$ And the final output $Y_{t-1}$ , including $Y_{t-1}$ It needs embedding, which is the real value when it is trained and the predicted value when it is predicted. Of course, this can also be improved, which is adopted in TF source codeThe paperThis, of course, involves the knowledge of SEQ2SEq, which is not expanded here. So far, the principle of Attention is very simple: it can get a specific encoding vector for different decoding moments during decoding $c_i$ The most important thing in this process is the output of each hidden layer $h_t$ The weight of $a_{ij}$ So let’s look at the implementation. First, the formula is given: $c_i=\sum^{T_x}_{j=1} a_{ij}h_j$ Where I represents the first of the decoding stage $i$ Moment, $j$ Represents the JTH moment in the coding phase, $T_x$ Represents the time step of the coding phase. then $a_{ij}$ This is the decoding order $i$ Moment of the first $j$ Code hidden layer weight. This formula describes the decoding order $i$ How is the input encoding vector of the moment obtained, and you can also see that the most important thing is this $a_{ij}$ How do I get that. $a_{ij}=\cfrac{exp(e{ij})}{\sum^{T_x}_{k=1}exp(e_{ik})}$ Among them $e_{ij}$ Is the corresponding weight $a_{ij}$ The energy function or the score function of phi, $e_{ik}$ Is the energy function corresponding to the KTH weight, and as we can see from the formula, actually the weight $a_{ij}$ It is the result of softmax normalization of the energy function. Then the focus of the problem is shifted to the solution of the energy function. Here I will explain the attention weight finally obtained $a_t$ It is important that the length of step is consistent with the time step. $e_{ij}={v_a}^Ttanh(W_a[s_{i-1};h_j])$ Among them $s_{i-1}$ Refers to the previous hidden layer output of the decoding stage. This formula can be calculated in three ways in the paper I mentioned above. The formula is as follows:In fact, in the TensorFlow source, the third calculation method is used. So far we have the core weight of Attention $a_{ij}$ Then, let’s comb the whole process. The process is shown as follows:In the gray box on the right is the calculation process of the Attention mechanism. It should be clear to look back according to the above formula. Note that the figure above is only the decoding stage. We should not regard the Attention mechanism as a single algorithm, but as a method. As long as you master the essence of Attention, you can use your imagination to achieve ideal effects. Finally, the difference between soft-attention and hard-attention is mentioned. Soft-attention is the above-mentioned attention mechanism. In the decoding stage, the output of each moment of coding is weighted and averaged, and the gradient can be directly calculated. Monte Carlo sampling is used to select the output state of hard-attention encoding. If you need this part, you can go into more details.

3. Personal understanding of the Attention mechanism

First of all, there are two names for this mechanism: Attention and alignment model. I actually prefer to call it soft feature extraction. Because its essence is to select the eigenvalue that has the greatest impact on the current at each decoding moment. In fact, I think this cognition of attention mechanism will be very helpful for us to apply attention mechanism to other tasks of NLP or other fields, such as CV field. Of course, I don’t know how the ATTENTION mechanism is used in CV field, but based on this understanding, I can guess the application method. To tell a picture is a cat or a dog, for example, through our convolution after feature extraction, we use the attention mechanism, can put the focus on capturing the cat’s unique characteristics, not only to capture the characteristics of more valuable information, but also can reduce the total of the effects of the feature of the model classification and even some of the noise image itself. Based on this, we can even truncate the parameters whose weight is lower than the threshold value in reasoning. In some scenarios, the reasoning time can also be reduced. Of course, we hope to make a difference from over-fitting in deep learning, which is different to some extent. Secondly, through the process of solving the energy function, I think whether we can also adopt such a method when obtaining the matching of text similarity. Of course, these are only personal YY, must not be as knowledge, just want to take this example to let friends just contact can be more than one aspect to understand, the so-called benevolence.

Iii. Detailed explanation of Transform principle

1, an overview of the

The inherent order of RNN causes that the model cannot be parallel. Of course, the so-called parallel means that a sample cannot be processed at the same time, because features are dependent on each other. But it’s clear that the attention mechanism works, so is there a better way to address its flaws and retain its advantages? Thus Transform was born, followed by the birth of BERT. At this point, the ImageNet era (transfer learning era) in CV field was also opened in NLP field. More powerful feature extraction ability for complex NLP task provides a powerful semantic vector representation, let NLP further into everyone’s life. Of course, attention is only a part of the model structure of Transformer, but it is also the most critical part. I have translated the paper on Transformer before. If you need it, you can check my previous translation for attention is all you need.

2. Transformer structure principle

Since Transformer is an upgrade of Attention, it must have the same features as Attention, so we need to explain the following three parameters before introducing Transformer: Q (Query) : The hidden output st−1s_{t-1} ST −1 corresponds to the moment before the decoder in our attention mechanism. K (Key) : any hidden output corresponding to the encoding side of our attention mechanism. V (Values) : corresponds to the code vector C in our attention mechanism above. Allow me to make a joke. In most of the online Transformer articles, the explanations of these three parameters are not friendly to those who are just in touch with them. If they are not clear about these three parameters, all the wonderful explanations are in vain. With the above explanation, let’s review the explanation of the big guys: equivalent to taking the query vector Q to the corresponding K to query the corresponding V. That is to say, according to the hidden layer information Q of the previous moment, the model gets the coding vector V of the current moment after weighted average with the output K of each hidden layer of the encoder (that is, attention). I’m sure that explains what QKV is. In fact, we know that QKV under the attention mechanism mentioned above is autoregressive (that is, the output at the current moment depends on the output at the previous moment, which is called AR). In self-attention, QKV is obtained by multiplying the input of the model and the corresponding matrix, so the paper says that self-attention is self-coded (AE), because in self-attention, there is no need to generate step by step like in RNN. This is why Transformer is so called parallelizable, which I mentioned earlier.

The overall structure

Firstly, the overall model structure of the paper is given:After all, it is a variant of attention, and can not escape the end-to-end framework (this sentence does not mean that the self-attention mechanism can only be used in the end-to-end framework, if you want to use any feature extraction). In the paper, there are 6 layers of Encoder on the left, and 6 layers of Decoder on the right. The first layer in Decoder is Masked multi-head Attention layer. As for the reason of masking, it is mainly because our language model modeling is based on the word level. In order to get a sentence, we can only decode every word from left to right and mask future information. As for why you can’t predict a sentence directly, chances are there’s more information between words.

Input Embedding

Those who are familiar with Transformer know that location information is embedded here. The following figure is the position coding formula given in the paper:Pos is the position of each token in the sequence. 2i and 2i+1 represent the even and odd dimensions of the base of each token position vector, where all positions start with subscripts of 0. $d_{model}$ Is the dimension of word vector, but also the dimension of position encoding. It says that for a fixed value of K, $PE_{pos+k}$ I can write both of them $PE_{pos}$ In other words, the position relation between the two tokens at the interval K is linearly correlated. Next, we will deduce to see how the position relation is linearly correlated. For the sake of writing simplicity, we will simplify the above formula: $PE_ {} (p, t) = \ begin {cases} sin (w_k \ cdot p) & \ text {I t = 2} \ \ cos (w_k \ cdot p) & \ text {t = 2 + 1} I {cases} \ end$ So our $PE_{(p+k,t)}=\begin{cases} sin(w_k\cdot (p+k))& \text{t=2i}\\cos(w_k\cdot (p+k))& \text{t=2i+1} \end{cases}$ Since our trigonometric functions have the following transformations: $sin(\alpha +\beta)=sin\alpha cos\beta+cos\alpha sin\beta$ $cos(\alpha +\beta)=cos\alpha cos\beta-sin\alpha sin\beta$ There are: $PE_{(p+k,t)}=\begin{cases} sin(w_k\cdot p) cos(w_k\cdot k)+cos(w_k\cdot p) sin(w_k\cdot k)& \text{t=2i}\\cos(w_k\cdot p)cos(w_k\cdot k)-sin(w_k\cdot p) sin(w_k\cdot k))& \text{t=2i+1} \end{cases}$ Since our k value is a constant value, the term related to k in the above expression is a constant, is it $PE_{pos}$ Of course, for self-encoding models like Transformer, the location information of tokens is missing, and the importance of location information is crucial in sequence learning tasks, which also shows that the quality of location coding will directly affect the rationality of location coding information. So there are a lot of improvements that have been made to the location coding section, so I’ll look at them later and write a separate blog post.

The Transformer layer

This part originally wrote some of their own, but later found that the big guy wrote very good, and I want to write the feedforward neural network and residual connection is also written in, I will directly attach the big guy’s article link. With an in-depth understanding of RNN and attention, it is no longer difficult to look at articles, papers and source codes on the Internet, especially those related to the improvement of NLP in the development process, and there will not be a question of what is wrong with the previous model, why we need to change it, and what are the advantages of such a change. Why to change the question.

Residual connection

For residual networks, the structure is simple, jump a few layers and add x. All say residual network effectively solved the problem of the gradient to disappear, have solved the problem of network degradation, but why is effective, we still through the formula have a look at the effectiveness of the residual, first of all, I drew a simple calculation chart, which is all connection way, in order to graphics look clean, simplifies the picture a bit ugly, of course, ha, ha, ha, Let’s take this as an example:

When we do not use the residual connection, that is, the dotted line is omitted, the forward propagation is as follows: By the way, if the principle of gradient descent is not clear, please refer to my previous blogThe principle of gradient descent algorithm and its calculation process $h_1=f(w_1\cdot x)$ $h_2=f(w_2\cdot h_1)$ $L=\cfrac1 2(y-h_2)^2$ Then the gradient is: $\nabla=<\cfrac{d(L)}{d(y-h_2)}\cdot\cfrac{d(y-h_2)}{d(h_2)}\cdot\cfrac{d(h_2)}{d(w_2\cdot h_1)}\cdot\cfrac{d(w_2\cdot H_1)} {d (h_1)} \ cdot \ cfrac (h_1) {d} {d (w_1 \ cdot x_1)} \ cdot \ cfrac {d (w_1 \ cdot x_1)} {d (w_1)}. \cfrac{d(L)}{d(y-h_2)}\cdot\cfrac{d(y-h_2)}{d(h_2)}\cdot\cfrac{d(h_2)}{d(w_2\cdot h_1)}\cdot\cfrac{d(w_2\cdot h_1)}{d(w_2)}>$ $f$ Is the activation function, $h$ Is the hidden layer output, $L$ Is the loss function, $\nabla$ Is a gradient. After the residual connection is adopted, the forward propagation is as follows: $h_1=f(w_1\cdot x)$ $h_2=f(w_2\cdot h_1+h_1)$ $L=\cfrac1 2(y-h_2)^2$ Then the gradient is: $\nabla=<\cfrac{d(L)}{d(y-h_2)}\cdot\cfrac{d(y-h_2)}{d(h_2)}\cdot\cfrac{d(h_2)}{d(w_2\cdot H_1 + h_1)} \ cdot \ cfrac {d (w_2 \ cdot h_1 + h_1)} {d (h_1)} \ cdot \ cfrac (h_1) {d} {d (w_1 \ cdot x)} \ cdot \ cfrac (w_1 \ cdot x) {d} {d (w_1)}. \cfrac{d(L)}{d(y-h_2)}\cdot\cfrac{d(y-h_2)}{d(h_2)}\cdot\cfrac{d(h_2)}{d(w_2\cdot h_1)}\cdot\cfrac{d(w_2\cdot h_1)}{d(w_2)}>$ One of the $\cfrac{d(x+y)}{dx}=1+\cfrac{dy}{dx}$ Through comparison we found the first case, if the value of each partial derivative are small, so in the deep network structure will lead to the gradient value is small, which is gradient disappear, the parameter will not be able to update, and joined the residuals, and network layers deep, every partial derivatives with 1, so the error can still spread, effectively solved the problem of the gradient to disappear. As to the problem of network degradation, must first clear the concept of network degradation is due to the vast majority of parameters in the process of back propagation of matrix stopped updating, makes the network did not have the ability to continue to learn, because the parameter is not updated, it shows that the new samples for network training doesn’t work, also is not to learn the characteristics of the new sample. As for network degradation, I have not studied it in depth. I think it is necessary to analyze the reasons for stopping updating parameters from the rank characteristics of matrix.

3. Personal understanding

In fact, for Transformer, the main improvement lies in parallelization and acquisition of long sequence of semantic information. The reason for parallelization is that the QKV vector required by self-attention is obtained by three linear transformations of the input, instead of completing all the time steps like the traditional RNN. For self-attention calculation, the longer semantic information can be obtained because each Q can be multiplied by the original K to get the corresponding V. Unlike traditional attention, Q and K are propagated forward in RNN, resulting in the loss of information transmitted over a long distance. Later, of course, Transformer XL has been proposed, including recently, when Google came up with a new kernel Synthesizer to replace Transformer. Taking a closer look at these works and understanding them deeply will give us many other inspirations, such as the location coding method in Transformer. Since deep learning now has a very powerful feature extraction capability, it is particularly important to add more types of features into our data.

Detailed explanation of BERT principle

1, an overview of the

In fact, along the way here, BERT seems to be less mysterious than before. BERT uses the encoder of Transformer. If you need children’s shoes, you can look at my previous onesTranslation of BERT’s thesis. BERT does not have much new ideas in model structure and can achieve good results. Personally, I think the first is mainly relying on Transformer, a powerful feature extractor, and the second is two self-supervised language models. BERT started the beginning of ImageNet in the NLP field, pre-training the network through large-scale corpus, initializing parameters, and then fine-tuning the network based on pre-training with a small part of professional field expectation, so as to achieve objective effects. First, let’s look at the overall structure of BERT:As can be seen from the figure, BERT’s overall structure is stacked with multiple Transformer structures. Transformer in each column has the same operation and complete input of the sequence, which is not much different from Transformer. Below, we mainly look at the language model used in BERT and how some specific tasks are completed, so as to inspire us to have more solutions in algorithm development.

2. Language model in BERT

MLM language model

In Bert, in order to train the input parameters, self-supervised pre-training is performed on large-scale prediction. MLM (Masked LM) is used for word level, and the main process and method are as follows: 1. Randomly masked 15% of all the tokens in the input sequence, using [MASK], which is required in the word list, and then the goal of training is to predict these tokens. 2. The paper finally also mentions, because the [MASK] marker in the prediction of time doesn’t exist, in order to reduce the training and predicting the mismatch between, for the so-called does not match the personal understanding is that although a sequence of cover only 15% of the token, but in expectation of mass, a small percentage, but the base to be reckoned with, It is obvious that the purpose of the MASK operation is to self-supervise the training of the model. If the MASK operation is too little, it will not be meaningful. So what measures does the author take to alleviate this problem? The answer is to apply the following strategy to 15% of the masked tokens: (1) 80% of the time use [mask] tokens, (2) 10% of the time use random tokens, and (3) 10% of the time do not replace tokens. If the token is masked by 15%, the probability of the [MASK] token is reduced by 20%. If the token is masked by 15%, the probability of the token is reduced by 20%. In fact, if the remaining 20% use random tokens, it will lead to false learning, because the sample you gave to the network is wrong, you were supposed to predict the ‘day’ token, but you gave a random ‘ground’ token, which is an artificial error sample, so there is a 10% probability of using the original token. In addition, the error generated by random substitution is only 15%*10%=1.5%. In fact, this is not a big problem for the model. It is like the Python language you work with, but your knowledge of JAVA language is wrong, but it does not affect you. So you might ask, is there a difference between not replacing and not replacing in the first place? The answer is yes, because for the model, you have no idea whether the input is real or not, so the model can only predict according to the context, so as to achieve the purpose of training. It’s like when your unreliable friend tells you that he found $100 on the street today. You don’t know if he’s telling the truth or not. You have to rely on something else to determine how reliable his story is. This is at the word level. Let’s move on to the sentence level.

NSP language model

For sentence-level tasks, BERT also proposed an effective method (later a paper verified that THE NSP task was not so essential, so this issue will not be discussed here), which is to predict whether the next sentence is true according to the previous sentence. Why do we propose this pre-training task? Mainly, there are many tasks such as question and answer and reasoning, which are more about learning the relationship between sentences, which cannot be done by language model, because language model predicts tokens according to tokens and learns within sentences. NSP is to obtain the dependencies between sentences. The NSP task can conduct self-supervised learning in a single corpus with 50% probability that the next sentence is true and 50% probability that it is randomly selected from the corpus. Of course, as mentioned in the paper, the use of document-level corpus is far superior to that of unordered sentences, which is obvious. Much like the MLM language model, self-monitoring is its biggest advantage.

fine-tuning

BERT can be thought of as one and a half to supervise the process, had the oversight of pre training process, in the process of our actual use, also need our professional field data fine-tuning of model parameter, in order to better apply to our network, the following simple introduce several tasks appear in the paper, first of all, the overall task graph is given. MNLI(Multi-genre Natural Language Inference) : Given a pair of sentences, the objective is to predict whether the second sentence and the first one are related, irrelevant or contradictory.QQPDetermine if two questions have the same meaning.QNLI(Question Natural Language Inference) : Sample is the answer to Question in a text.STS-B(Semantic Textual Similarity Benchmark) : Give a pair of sentences, and evaluate their Semantic Similarity on a scale of 1 to 5.MRPC(Microsoft Research Paraphrase Corpus) : Determine whether the pair of sentences are semantically the same.RTEIt is a dichoric problem, similar to MNLI, but with much less data.SST-2(The Stanford Sentiment Treebank) : Dichotomous question of single sentences. The sentences come from people’s evaluation of a movie and judge The emotion of The sentence.CoLAThe Corpus of reportable sentences: The dichotomization of single sentences. Determine whether an English sentence is grammatically acceptable or not.SQuADGiven a Question and a Wikipedia paragraph containing an answer, the task was to predict the location of the answer in the paragraph.CoNLL-2003 NERThe named entity recognition task, which is probably the most familiar, is to predict what the label of each word will be. forSQuADIn fact, the task is explained in detail in the paper, you can directly read my translation of the paper. In general, for the tasks in (a) and (b), we get the value of the final CLS tag. The task in (c) is mainly to predict the maximum value of start and end markers, and the ID value of start is smaller than the ID value of end, where start is the starting position in the paragraph where the answer is located, and end is the end position. For NER tasks, it is natural to take each output value, because each token corresponds to a token. BERT’s source code also provides an example of binary code that we can modify to train our tasks.

3. Personal understanding

In fact, the most important inspiration BERT gave me was self-supervision. As the current deep learning is all supervised learning, the data cost is very high, which also imposes constraints on model training. Since the supervised learning to be able to get a good when our mission requirements is not high, can easily get data and completing a basic model training, can very good help, our main task GPT3 recently and a handful of fire, claims to be fine tune all provinces, with the characteristics of the output can be used directly, using a total of 170 billion the number of arguments, Anyway, I was shocked, personally feel this big big as XLnet, RoBERTa, ALBERT, XLnet, etc., online related articles also have big guy wrote very well, personal knowledge is limited, paste the connection meaning. This blog also took me some time to write here, work during the day, stay up late at night to write, head a little panic, write here for the time being, it is not moving, I hope to help you, remember to point a thumbs up.

Easy to understand Attention, Transformer, BERT principle detailed explanation

First, write first

Detail on the principle of Attention

1, an overview of the

2, Attention structure principle

3. Personal understanding of the Attention mechanism

Iii. Detailed explanation of Transform principle

1, an overview of the

2. Transformer structure principle

The overall structure

Input Embedding

The Transformer layer

Residual connection

3. Personal understanding

Detailed explanation of BERT principle

1, an overview of the

2. Language model in BERT

MLM language model

NSP language model

fine-tuning

3. Personal understanding

Related Posts

Tokenization in NLP (English and Chinese word segmentation differences +3 difficult points +3 typical methods)

An exercise on LSTM generating lyrics

Logistic regression classification – Credit card fraud detection! This is dry stuff!