The principle of self-attention is easy to understand

This is the 21st day of my participation in Gwen Challenge

The introduction

Before, we introduced how to use Attention to improve Seq2Seq performance. We combined Attention with Seq2Seq Encoder and Decoder. In this article, we introduce self-attention. You can use Attention as a separate part of it.

In this paper, I would like to apply self-attention to LSTM. Here, I simplify the process and introduce this idea by using SimpleRNN instead of LSTM.

SimpleRNN + Self-attention

【 SimpleRNN o h_iMethods 】

In SimpleRNN, we used the following formula to solve for hi:

h_i = tanh(A * concat(x_i, h_i-1)+b)

Note The hidden state of the current moment depends on the current input xi and the hidden state input hi-1 of the previous moment.

SimpleRNN + self-attention: h_iMethods 】

When self-attention is introduced, SimpleRNN’s way of finding hi changes. It follows the following formula:

h_i = tanh(A * concat(x_i, c_i-1)+b)

The example in the figure shows that the hidden layer state h3 at t3 depends on the current input X3 and the context vector C2 at the previous time.

Ci is to output hi and the existing H1 and… , hi weight calculation, weight list A1,… , ai, and finally, these hidden layer outputs are summed up with their corresponding weight parameters by weighted average to obtain CI. As for the specific weight calculation method, it is the same as the method mentioned in the Attention article, which is not repeated here.

From the examples in the figure, c3 is the weighted average sum of H1, H2, h3 and their respective weights A1, A2, and A3.

In addition, more complex calculation ideas can be considered, and other specific processes are the same as above:

h_i = tanh(A * concat(x_i, c_i-1, h_i-1,)+b)

conclusion

Self-attention, like Attention, can solve the problem of forgetting the RNN class model. Every time we calculate the current hidden layer output HI, we use CI-1 to review the previous information, so that we can remember the previous information. However, ci in self-attention can be calculated in its own RNN structure, unlike that in Seq2Seq, which spans Decoder and Encoder RNN structures, That is, the CI of Decoder depends on all the hidden output of Encoder.
Self-attention can now be applied to any RNN class model to improve performance, such as LSTM.
Self-attention also helps RNN focus on relevant information, as shown in the figure below, where red words are the current input and blue words are the words related to the current input.

reference

Cheng J , Dong L , Lapata M . Long Short-Term Memory-Networks for Machine Reading[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016.

The principle of self-attention is easy to understand

The introduction

SimpleRNN + Self-attention

conclusion

reference

Related Posts

An article to understand Flink Window

LCP 40. Mental arithmetic challenge

Without Windows at Build, what are Microsoft’s best AI features?