Fully illustrated RNN, RNN variants, Seq2Seq, Attention mechanism

See a very thorough article, from where the source of Daniel. The underlined part is their own understanding and supplement, for your reference.

 

directory

First, start from the single-layer network

2. Classical RNN structure (N vs N)

N VS 1

1 VS N

N vs M

6. Attention Mechanism

Seven,


This paper mainly introduces the classical RNN, several important variants of RNN, Seq2Seq model and Attention mechanism in detail in the form of pictures. I hope this article can provide a new perspective to help beginners get a better start.

First, start from the single-layer network

Before learning RNN, we should first understand the most basic single-layer network, whose structure is shown as follows:

The input is x, and the transformation of Wx+ B and the activation function F gives the output y. I’m sure you’re already familiar with this.

2. Classical RNN structure (N vs N)

In practical applications, we will also encounter a lot of sequential data:

 

 

Such as:

  • Natural language processing problems. X1 could be the first word, x2 could be the second word, and so on.
  • Speech processing. Now, x1, x2, x3… It’s the sound of each frame.
  • Time series. For example, daily stock prices and so on

Sequential data is harder to process with primitive neural networks. In order to model sequence problems, RNN introduces the concept of hidden state (H), which can extract features from sequential data and then transform them into outputs. Let’s start with h1 calculation:

 

The meanings of the symbols in the drawings are:

  • Circles or squares are vectors.
  • An arrow represents a transformation of that vector. As shown in the figure above, h0 and x1 are connected by an arrow respectively, indicating that a transformation has been performed on h0 and x1 respectively.

Similar notations will appear in many papers. It is easy to get confused at the beginning, but as long as you grasp the above two points, you can easily understand the meaning behind the diagrams.

H2 is calculated similarly to H1. It should be noted that in calculation, the parameters U, W and B used in each step are the same, that is to say, the parameters of each step are shared, which is an important feature of RNN and must be kept in mind.

 

Calculate the rest in turn (using the same parameters U, W, b) :

 

We’re just going to draw the case of length 4 for convenience, but in fact, this calculation can go on indefinitely.

Our current RNN has not been output, and the method to obtain the output value is to calculate directly through H:

So as we said before, an arrow represents a transformation of the corresponding vector like f of Wx plus b, and this arrow right here represents a transformation of h1, which gives you the output y1.

The rest of the output is similar (using the same arguments V and c as y1) :

OK! And you’re done! This is the classic RNN structure, and we built it like a building block. Its inputs are x1, x2,….. Xn, the output is y1, y2… Yn, that is, the input and output sequences have to be the same length.

Due to this limitation, the scope of application of classical RNN is relatively small, but there are also some problems suitable for modeling classical RNN structure, such as:

  • Calculate the classification label for each frame in the video. Because each frame is evaluated, the input and output sequences are of equal length.
  • The probability that the input is a character and the output is the next character. This is The famous Char RNN (See: The Gray-Effectiveness of Recurrent Neural Networks, which can be used to generate articles, poems, and even code, which is very interesting).

N VS 1

Sometimes we’re dealing with a problem where the input is a sequence and the output is a single value instead of a sequence, how do we model that? In fact, we can only perform the output transformation on the last h:

 

This structure is usually used to deal with sequence classification problems. For example, input a text to determine its category, input a sentence to determine its emotional orientation, input a video and determine its category, and so on.

1 VS N

What if the input is not a sequence and the output is a sequence? We can do input calculations only at the beginning of the sequence:

 

There is also a construction that takes the input information X as the input for each stage:

 

The following figure, which omits some X circles, is an equivalent representation:

This 1 VS N structure can handle the following problems:

  • An image caption is then generated, with an X as an image feature and a sequence of Y as a sentence
  • Generate speech or music, etc., from categories

N vs M

Let’s introduce one of the most important variants of RNN: N vs M. This structure is also called Encoder-Decoder model, or Seq2Seq model.

The original N vs N RNN requires the sequence to be of equal length. However, most of the problems we encounter are of unequal length. For example, in machine translation, the sentences in the source language and the target language often do not have the same length.

To do this, the Encoder-Decoder structure first encodes the input data into a context vector C:

 

There are many ways to get C, the simplest way is to assign the last hidden state of Encoder to C, you can also transform the last hidden state to get C, or you can transform all the hidden states.

After getting c, it decodes it with another RNN network, which is called Decoder. To do this, enter c into the Decoder as the previous initial state H0:

 

Another option is to use C as input for each step:

 

Because this Encoder-Decoder structure does not limit the sequence length of input and output, it is widely used, such as:

  • Machine translation. Encoder-Decoder is the most classic application. In fact, this structure was first proposed in the field of machine translation
  • Text abstract. The input is a sequence of text, and the output is a sequence of summaries of that sequence of text.
  • Reading comprehension. Encode the input text and the question separately, and then decode it to get the answer to the question.
  • Speech recognition. The input is a speech signal sequence and the output is a text sequence.
  • …………………

6. Attention Mechanism

In Encoder-Decoder structure, Encoder encodes all input sequences into a unified semantic feature C and then decodes it. Therefore, C must contain all information in the original sequence, and its length becomes a bottleneck that limits the performance of the model. For example, when the sentence to be translated is long, a C may not be able to store so much information, resulting in the decline of translation accuracy.

The Attention mechanic solves this problem by typing a different C at each time. Here’s a Decoder with the Attention mechanic:

 

Each C automatically selects the context information that is most appropriate for the y being output. Specifically, we useMeasure the correlation between the HJ of stage J in Encoder and stage I when decoding, and finally the context information of the input of stage I in DecoderIt comes from all 对 The weighted sum of.(to ,The product and sum of

Take machine translation (translating Chinese into English) :

 

 

The input sequence is “I love China”. Therefore, H1, H2, H3 and H4 in Encoder can be regarded as the information represented by “I”, “ai”, “Zhong” and “Guo” respectively. When translated into English, the first context C1 should be most relevant to the word “I” and therefore corresponding toIt’s going to be bigger and corresponding to 、  、 It’s smaller. C2 should be most associated with love, therefore corresponding toIt’s bigger. And finally c3 is most correlated with H3 and H4, so 、 The value of theta is larger.

That leaves us with one last question about the Attention model, and that is:These weightsHow did it come about?

In fact,Also learned from the model, it is actually related to the hidden state of the i-1 stage of Decoder and the hidden state of the J stage of Encoder.

Again, using the machine translation example above,The arrow represents the calculation of h’ andDo the transformation at the same time) :(How to calculate is the key difficulty)

 

 

The calculation of the:

 

The calculation of the:

Above is the whole process of Encoder-Decoder model calculation with Attention.

Seven,

This paper mainly introduces four classical RNN models of N vs N, N vs 1, 1 vs N and N vs M, and how to use the Attention structure. I hope I can help you.

LSTM is the same as RNN from the outside. Therefore, all of the above structures are universal to LSTM.