Make writing a habit together! This is the second day of my participation in the “Gold Digging Day New Plan · April More text challenge”. Click here for more details.

Introduction:

This is a transcript of my previous blog post on CSDN. NLP Learning Notes (9) Self-attention mechanism

Self-attention

In this lesson, we will study the self-attention mechanism.

Self-attention

Attention’s first paper, published in 2015, improved the Seq2seq model’s ability to forget long sentences. Attention is not limited to Seq2seq models, but can be applied to all RNNS. Next we introduce self-attention, published in EMNLP 2016.

In the original paper, attention was applied to LSTM. In this class, we simplify the content of the paper by replacing LSTM with Simple RNN, which is easier to understand.

Initially, the state vector HHH and Context vector CCC are all zero vectors. RNN reads the first input x1x_1x1 and needs to update the state HHH, compressing the information from x1x_1X1 into the new state vector HHH. Standard Simple RNN relies on input x1X_1X1 and the old state H0H_0H0 to calculate H1H_1H1. The calculation formula is shown below:

With self-attention, the formula looks like this:

The box
c 0 c_0
Instead of
h 0 h_0
. You can also use other methods to update, such as the
x 1 x_1
,
c 0 c_0
,
h 0 h_0
Do the concatenation. Calculate the new state
h 1 h_1
The next step is to calculate the new Context vector
c 1 c_1
. The new Context vector
c 1 c_1
Is the existing state
h h
Weighted average of.

Since the initial state h0H_0h0 is all zero, we ignore h0H_0h0, and the weighted average of the existing states is equal to H1H_1H1, so the new Context vector C1C_1C1 is H1H_1H1.

And then calculate
h 2 h_2
, the formula is as follows:Next, compute the new Context vector
c 2 c_2
. Want to calculate
c c
, first need to calculate the weight
Alpha. \alpha
. Take the current state and the two existing state vectors
h h
By contrast (including
h 2 h_2
Themselves). Then the weighted average is calculated
c 2 c_2
.

And so on, c3C_3C3… Cnc_ncn. By doing these steps, we can get the Context vector at each location. After such calculation, self-attention can obtain the importance of each token in its input sequence and analyze the “closeness degree of connection” between each token in the sequence compared with other tokens in the same sequence, so as to better help model modeling and analysis.

Summary

  • All RNNS have forgetting problems, which can be solved by using self-attention. Every time before updating the CCC, the system reviews the previous status information again.

  • The principle of self-attention is the same as that of attention, but self-attention is not limited to the Seq2seq model. It can be applied to all RNNS. In addition to avoiding forgetting, self-attention helps RNN focus on relevant information.

  • In the illustration above, RNN reads a sentence from left to right. The current input is highlighted in red with the high weights α\alphaα. These weights indicate which words are most relevant in the previous text. It can be seen that, in general, each word will have a large weight with the word in front of him. At the same time, weights such as is a are also large, which is consistent with the relevant grammatical logic in English.