Detailed explanation of the principle of recurrent neural network RNN, LSTM and GRU

First, write first

This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together. This part should be the most basic part of the development in recent years, but found himself almost forgotten that many details to remember is not very clear, so I write this blog, also hope to be able to use more simple clarity of thought to clarify this part, in order to help more friends, to understand the wrong place hope you can also leave a valuable opinions, Don’t forget to give it a thumbs up. Conditional shoes can look at this paper, detailed description of the RNN, LSTM, GRU helped calculation process, personally think that if it is to go the academic direction of friends is worth to read the paper, it not only involves the circulation of the neural network calculation process in detail, and also experiment a lot of methods, TF code implementation is based on the paper.

Second, recurrent neural network RNN

1. Neural networksBefore starting with RNN, let’s briefly review the neural network. Here is an example of a simple neural network:

Input Layer: The Input (x1,x2,xnx_1, X_2, X_nx1,x2,xn) is a vector representation of a sample, such as words, words, sentences, etc.
Hidden Layer: Hidden layer, the circle of each layer is neuron, generally (X1, X2, XNX_1, X_2, X_nx1, X2, xN) the result of a linear transformation will conduct a nonlinear transformation in each neuron, that is, there is an activation function in each neuron to conduct nonlinear transformation of the input value.
In the neural network, Softmax classifier is generally used. In essence, it outputs 0~1 values after linear sum of feature values extracted from the hidden Layer. The larger the value is, the higher the probability of the data belonging to this category is.

2. Recurrent neural networkAs to why there will be a RNN, actually because of the above mentioned in the neural network, the characteristics of each sample value calculation in the hidden layer are independent of each other, but in fact many tasks between characteristic value of each sample are all related, so we need to consider the relevant factors, the demand is mainly in NLP tasks, So RNN is used to add ‘memory’ to add relevant information between sample eigenvalues. The network structure of RNN is shown as follows:Before entering the topic, let’s take an example of the input, so as to avoid misunderstanding of the new friends. Suppose that the sample data is: ‘I love you’, and then we carry out the word vector for each word, and it becomes: $, x_1 ((x_1 ^ 1 ^ 2, x_1 ^ 3), (x_2 ^ 1, x_2 ^ 2, x_2 ^ 3), (x_3 ^ 1, x_3 ^ 2, x_3 ^ 3))$ Each word is a THREE-DIMENSIONAL vector. Input in the figure above is the THREE-DIMENSIONAL vector of one of the words. The number of three-dimensional vectors in RNN represents the number of moments or steps. By the way, the actual input is a matrix, the zero dimension is the number of rows in the matrix is your word vector dimension, and the one dimension is the number of columns in the matrix is your batch_size.

Input: The Input layer, which is a three-dimensional vector in this example, is an eigenvector in the sample, and unlike the traditional neural network above, here is an eigenvector, and above is a vector of a sample.
The Hidden inputs of neural networks have the same effect.
Add a Bias node (the intercept of a unary function learned in high school) to compute the hidden and output layers.
Output: Output layer, same as the Output layer of a neural network.

In order to explain the forward and backward propagation process of RNN more clearly, the following parameters are explained in detail:

Xtx_txt: input at time T, a vector.
Hth_tht: the Hidden state at time T, i.e., the output value of each neuron at Hidden inputs in the figure above.
Oto_tot: indicates the Output at time t, that is, the probability value of Output, not the label value.
U: The weight between the input layer and the hidden layer, which abstracts our original input as the hidden layer’s input, which is the gray-dotted line in the figure above.
V: the weight between the hidden layer and the output layer, which abstracts the representation we learned at the hidden layer as the final output, which is the part of the red dotted line in the figure above.
W: The weight between the hidden layer and the hidden layer, as long as the weight between the hidden layer and the hidden layer is represented by W, which is the solid red line in the figure above.

In order to have an image in mind while reading the following, I drew a simple picture for you to imagine, as follows:By the way, the circle in the picture is called a cell in the RNN, that is, the whole hidden layer, which I hope the friends who are new to RNN must remember.

And then we do forward propagation

At t=1, U, V and W are randomly initialized, h0H_0H0 is usually initialized to 0, and then the following calculation is performed:

H1 = F (Ux1+Wh0+b1) H_1 = F (Ux_1+Wh_0+ B_1)h1= F (Ux1+Wh0+ B1)
O1=g(Vh1+ B2)O_1=g(Vh_1+ B_2)O1=g(Vh1+ B2) (final output, GGG is softmax, etc.)

At moment t=2, h1H_1H1 at this moment, as the memory of moment 1, will make the following prediction, and the calculation is as follows:

$h_2=f(Ux_2+Wh_1+b_1)$
$O_2=g(Vh_2+b_2)$

In this way, it can be obtained:

$h_t=f(Ux_t+Wh_{t-1}+b1)$
$O_t=g(Vh_t+b_2)$

By the way, after checking the source code of TensorFlow, the answer is as follows, that is to say, most of the basic RNN memory information is the output of the hidden layer. Because I did not find the original RNN paper, I have no way to find out whether the original RNN is the same, and I found that the RNN has been greatly improved through the source code. For example, if you are interested in the use of input or hidden layer neurons, look at the tensorFlow source code. In fact, THE RNN part has been encapsulated very well, as long as you understand the principle.

Where FFF can be tanH, RELu, Logistic and other activation functions, GGG is usually softmax or other functions. We say that RNN has memory ability, and this ability is to summarize the previous input state through W as the auxiliary input for the next time. The hidden state can be understood as follows: H =f(existing input + previous memory)h= F (existing input + previous memory)

And then we do back propagation.

The method of back transmission is to take the partial derivatives of the total errors of the output layers to the weights of gradient ∇U, ∇V, ∇W\ Nabla U, \ Nabla V, \ Nabla W∇U, ∇V, ∇W, and then update each parameter with the gradient descent method. For those who are not familiar with gradient descent algorithm, please refer to my previous blog on the principle and calculation process of gradient descent algorithm. For the output OtO_tOt at every moment, there will be a certain error ete_TET, and the error loss function can be cross entropy loss function or square error, etc. So the total error is E=∑1tetE=\sum_1^t e_tE=∑1tet. Our ultimate goal is to ask for:

$\nabla U=\cfrac{\partial E}{\partial U}=\sum_1^t\cfrac{\partial e_t}{\partial U}$
$\nabla V=\cfrac{\partial E}{\partial V}=\sum_1^t\cfrac{\partial e_t}{\partial V}$
$\nabla W=\cfrac{\partial E}{\partial W}=\sum_1^t\cfrac{\partial e_t}{\partial W}$

In order to easy to understand, this loss function using square loss function: L (theta) = 12 (y ^ – y) 2 L (\ theta) = \ cfrac12 (\ hat – y) y L ^ 2 (theta) = 21 (y ^ – y) 2. The formula may make my head big, but the following formula process is probably the simplest process ever, as long as the chain derivative rule and gradient descent algorithm understand, elementary school students can understand, welcome correction. Here’s an example of the chain rule: F (x) = 2 + 1, x g (u) = u2f (x) = 2 + 1, x g (u) = u ^ 2 f (x) = 2 + 1, x g (u) = u2, so for the composite function g (f (x)) = (2 x + 1) 2 g (f (x) = (2 x + 1) ^ 2 g (f (x)) = (2 x + 1) 2 derivation process is as follows: ⋅2= 4U =8x+4\cfrac{\partial g}{\partial u}\cdot\cfrac{\partial g}{\partial u}\cdot\cfrac{\partial U}} {\ partial x = 2 u \ cdot2 = 4 u = 8 x + 4 partial partial g = partial u partial x g ⋅ partial x partial u = 2 u ⋅ 2 = 4 u = 8 x + 4, ok we start BPTT. St = Uxt + WHT −1+ b1S_t = Ux_t + WH_ {t-1}+b_1st= Uxt + WHT −1+b1: I’m just going to split h1=f(Ux1+Wh0+b1)h_1= F (Ux_1+ Wh0+ b_1)h1=f(Ux1+Wh0+b1). Ht = F (st) H_T = F (s_t) HT = F (st) : FFF is the activation function. Ot (VHT + b2) o_t = g = g (vh_t + b_2) ot (VHT + b2) = g et = 12 (ot – y) 2 e_t = \ cfrac12 (o_t – y) ^ 2 et = 21 (ot – y) 2: this is the error of each moment. E=∑1tetE=\sum_1^te_tE=∑1tet: This is the total error. Parameters are updated according to gradient descent. Those who are not impressed can review the gradient descent algorithm again. Because the parameters of RNN are shared, although it is divided into many moments, the parameters are only U, V, W, B1, b2U, V, W, B_1, b_2U, V, W, B1 and B2, so the error should be propagated back to update the parameters. The most important thing is to ask the gradient of our loss function E=∑1tetE=\sum_1^te_tE=∑1tet. Let the gradient be δ \Delta δ, then: Δ = < partial E partial U partial E partial V, partial W partial E, partial E partial b1, partial E partial b2 > = < Delta \ \ frac {\ partial E} {\ partial U}, \ frac {\ partial E} {} \ partial V. \frac{\partial E}{\partial W}, \frac{\partial E}{\partial B_1}, \frac{\partial E}{\partial B_2}> δ =< partial U partial E, partial V partial E, partial W partial E, partial B1 partial E, Student: partial B2 partial E> so let’s take the partial derivatives of each of these by the chain rule Partial U partial E = 1 t partial et partial U = ∑ ∑ 1 t partial et partial (ot) – y ⋅ partial (ot – y) partial ot ⋅ partial ot partial (VHT + b2) ⋅ partial (VHT + b2) partial ht ⋅ ⋅ partial ht partial st partial U partial st = ∑ 1 t (ot) – y ⋅ ⋅ 1 g ‘(VHT + b2) ⋅ ⋅ v f’ (st) ⋅ xt \ frac {\ p artial E}{\partial U}=\sum_1^t\frac{\partial e_t}{\partial U}=\sum_1^t\frac{\partial e_t}{\partial {(o_t-y)}}\cdot \frac{\partial (o_t-y)}{\partial o_t}\cdot \frac{\partial o_t}{\partial (vh_t+b_2)}\cdot \frac{\partial (vh_t+b_2)}{\partial h_t}\cdot \frac{\partial h_t}{\partial s_t}\cdot\frac{\partial s_t}{\partial U}=\sum_1^t(o_t-y)\cdot1\cdot g'(vh_t+b_2)\cdot v\cdot f'(s_t)\cdot X_t partial U partial E = 1 t partial U partial et = ∑ ∑ 1 t partial (ot – y) partial et ⋅ partial ot partial (ot) – y ⋅ partial (VHT + b2) partial ot ⋅ partial ht partial (VHT + b2) ⋅ partial st partial ht ⋅ partial U partial st = ∑ 1 t (ot) – y ⋅ ⋅ 1 g ‘(VHT + b2) ⋅ ⋅ v f’ (st) ⋅ xt The rest to hard for you, was deduced by analogy to Δ \ Delta Δ values out, please come out we can update back propagation parameters to minimize the damage, but are in fact matrix calculations, compare to burn the brain, are interested in can be a bit further, in fact the BP algorithm is very important, However, when the code implementation has been packaged, we do not need to design the implementation process of BP algorithm. Mastered the chain derivative method and gradient descent algorithm, not only mastered the BPTT, in fact we also is not hard to find, gradient the root cause of the explosion and disappear, because we are the parameters of the update is under the condition of the given vector along the negative gradient direction is updated, if our gradient is constantly LianCheng a number or a lot of less than 1, LSTM and GRU to solve the problem of gradient extinction, and gradient explosion is gradient given a threshold value C, greater than C or less than -c are set to C or -c, gradient explosion or extinction part of the need for a separate blog. Next we enter the gate of LSTM. The difference lies in that the output of each step of BPTT not only depends on the network of the current step, but also needs the network state of several steps, so it is called BPTT.

Iii. Detailed explanation of LSTM principle

If you are familiar with LSTM, you know that the emergence of LSTM is to alleviate one of the biggest shortcomings of RNN, the problem of long time dependence. With the increase of sequence length, the problem of previous information loss will occur. So let’s first look at why RNN has this problem. Let’s look at the following calculation to see where the problem lies. For simplicity of writing, the offset term is left out. $h_0=0$ $h_1=f(ux_1+wh_0)$ $h_2=f(ux_2+wh_1)$ $h_3=f(ux_3+wh_2)$ $\cdot$ $\cdot$ $\cdot$ $h_t=f(ux_t+wh_{t-1})$ In fact, we can see from this simple process, the transmission of memory information depends on $w$ Parameter, but as time increases, when $w$ Continuous multiplication, $w$ Too small a value can cause previous information to be lost, $w$ Because of this situation, LSTM improves the traditional RNN by adding a mechanism of cell state. As for the final output, the important ones in the previous memory will continue to be passed on, while the unimportant ones will be truncated. Next, let’s analyze the structure of LSTM.Actually does this picture online has, although a lot of people say this figure is not very friendly to new scholars actually, but I don’t know what tools can draw the figure, do not know to have who can tell me, I will put this picture of where each detail and prone to error will say very detailed, does not have its original figure to make up for the loss.

3.1 LSTM structure details

This is a whole hidden layer in RNN, remember! It’s the whole hidden layer, and many of my new friends think it’s a neuron. 1. The neural network layer, which is equivalent to a full-connection layer, contains many neural units, which are the units parameters in the LSTM code. 2. Matrix algorithm, multiplication or addition, of course this multiplication is ordinary multiplication, corresponding to the multiplication of numbers. 3. Vector transfer direction, indicating the next direction of data. 4. Concat: concatenation by column, that is, concatenation of two matrices by column. 5. Replication, that is, the data in both directions is the same.The three parts framed in the figure above are familiar doors, which are also easy to be confused by many friends who are new to LSTM. Next, we will analyze the working principles of these three doors and have an in-depth understanding of the principles. First of all, there’s a concept that you can think of this way, in LSTM you can think of $h_t$ Is the memory of the current moment, $C_t$ It’s the sum of the information you remember in the current moment and before, that’s why they call it, rightShort – and long-term memory networksThe reason why. From left to right they are: 1. Forgetting door:First, explain the data flow of the first box, starting with the hidden output of the previous moment $h_{t-1}$ And the current moment $x_t$ Concat, concatenation and input into the yellow $\sigma$ After the neural network layer is fully connected, sigmoID is activated and the probability value from 0 to 1 is output $f_t$ . $f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)$ Is used to describe how much of each part can pass. 0 is not allowed to pass, and 1 is allowed to pass all. And then the output value and the state of the cell at the previous moment $C_{t-1}$ Do the multiplication, and then pass it right to the next box. Ok, that’s the end of the language description, I think serious thinking friends should be able to understand, let’s take a concrete example, look at the data in the process of calculation, so that you can make sure that you fully understand the forgetting gate.instructions: For the convenience of writing and understanding, the data in the following calculation process is the shape of the matrix, and the input dimension is set as 10, and the data batch size is set as 1, that is, only one sequence is input each time. Units =20, that is, the number of neurons in the full connection layer is 20. The shape of each data is as follows: $x_t=10\times1$ The shape depends on your word vector dimension $h_{t-1}=20\times1$ : Shape depends on how many hidden neurons you have $[x_1,h_0]=30\times1$ $W_f\cdot[h_{t-1},x_t]=20\times30\cdot30\times1=20\times1$ $W_f\cdot[h_{t-1},x_t]+b_f=20\times1+20\times1=20\times1$ $\sigma(W_f\cdot[h_{t-1},x_t]+b_f)=20\times1$ $f_t\times C_{t-1}=20\times1\times20\times1=20\times1$ As a reminder, when TF code is implemented, $h_{t-1}$ and $x_t$ Their weight is separate, using different initialization and regularization methods, this article is more on the principle of the process description, may be different from the actual code implementation, interested friends can directly view the source code.Now that I’m sure you understand the forgetting gate completely, let’s think about why cell state C is able to achieve what memories to remember and what memories to forget. For a simple example, suppose: $C_ {} t – 1 = [0.8, 1.2, 3.2]$ $,0.5 f_t = [0, 1]$ then $F_t \ times C_ {} t – 1 = 3.2] [0,0.6,$ Is it equivalent to $f_t$ The 0 position was discarded, half of the 0.5 position was retained, and all of the 1 position was retained and passed down. In summary, through this gate, we can decide what information is discarded in cell state C. So it’s called the forgetting gate or the forgetting gate. 2. Information adding gate:With the foundation of forgetting gate, it is natural to have a good understanding of information adding gate. As the name implies, we know what information should be discarded through the forgetting gate model. Then we need to determine what new information needs to be added into the cell state C through the information adding gate, which is also the function of adding gate. The formula is as follows: $i_t=\sigma(W_i\cdot[h{t-1},x_t]+b_i)$ $\tilde C_t=tanh(W_c\cdot[h_{t-1},x_t]+b_c)$ $C_t=f_t\cdot C_{t-1}+i_t\cdot\tilde C_t$ : It can be seen that the value range of C is relatively large. The Sigmoid layer determines what values need to be updated; The Tanh layer creates a new candidate vector $\tilde C_t$ ; And then by calculating $i_t\times\tilde C_t$ You can calculate what information needs to be added at the current moment, and then compare with $f_t\cdot C_{t-1}$ You add them up $C_t$ . To sum up, after passing through these two gates, we can determine the deletion and increase of transmitted information, namely, the update of cell state. 3. The outputAnd that part is getting the output based on the state of the cell. The formula is as follows: $o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)$ $h_t=o_t\times tanh(C_t)$ First, a sigmoID layer is used to determine which part of the cell state will be output, and then the tanH function is used to process the cell state to get a value from -1 to 1, which is multiplied by the sigmoID output to output the current output $h_t$ . Now that the structure explanation of LSTM is over, let’s look at the forward and backward propagation process of LSTM.

3.2 LSTM的BPTT

Description: In fact, after understanding the structure of LSTM, I don’t think it is too important to master the BPTT process. After all, now various frameworks have been packaged, so we don’t need to realize it by ourselves. Here, for the sake of completeness of this article, THE BPTT of LSTM is described, so as to help you understand the structure principle of LSTM. The BPTT of LSTM is similar to that of RNN and is not complicated. Now let’s do the derivation of BPTT process, which is not complicated and not brain-burning at all.1. Forward propagationForward propagation is relatively simple, which is the combination of the formulas of the above several gates: $f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)$ $i_t=\sigma(W_i\cdot[h{t-1},x_t]+b_i)$ $\tilde C_t=tanh(W_c\cdot[h_{t-1},x_t]+b_c)$ $o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)$ $C_t=f_t\cdot C_{t-1}+i_t\cdot\tilde C_t$ $h_t=o_t\times tanh(C_t)$ $\widehat{y}_t=S(W_y\cdot h_t+b_y)$ : This part is the output probability value at each moment,S is the classifier, such as Softmax. $e_t=\cfrac12(o_t-y)^2$ $E=\sum_1^te_t$ 2. Reverse propagationIn fact, we have derived it once in the RNN section, and the principle is the same, because in LSTM there are total $W_fW_iW_cW_ob_fb_ib_cb_ob_y$ There are nine parameters in total, and just to clarify, there may be articles that say there are eight groups of parameters in total, but as I mentioned in the previous article, when the code is implemented, for the $x_t$ and $h_{t-1}$ The W weights of are separated, and the formula is as follows:In my section, I just add an output part for each moment. During the code process, you can set the relevant parameters, whether to get the output of each moment. Back propagation is to obtain the gradient value according to the loss function E, and then update the parameters according to the gradient descent. The process is very simple, and I’m a little tired of writing it, but as long as you look at it carefully, you can do it without pressure if you know the chain rule and the gradient descent algorithm. I’ll leave it to you to do it yourself if you need to. Next, we enter another more important variant of RNN, GRU.

Four, GRU structure principle

Before starting GRU, it should be noted that according to the structure of LSTM, we can make some changes according to the actual situation of our business, so in fact there are many variations. Let’s list three famous variations and describe one of them, namely GRU.The first: An addition to this structurepeephole connectionsLayer, so that each gate also receives input from cell state C.The second: Add gates (first and second) by coupling forgetting gates and information gates; Instead of thinking separately about what to forget and what to add, they think together. The reason why this can be done is that the first gate controls what information is forgotten, and the second gate determines what information is added, so the two can be combined.Gated “Recurrent Unit” (GRU) This structure was proposed in 2014, which is actually the combination of the former two. Without the concept of cell state in LSTM, the forgetting gate and information increase gate are combined into update gate, and data unit state and hidden state are combined, which is simpler than LSTM structure. The formula is as follows: $r_t=\sigma(W_r\cdot[h_{t-1},x_t])$ $z_t=\sigma(W_z\cdot[h_{t-1},x_t])$ $\hat h_t=tanh(W\cdot[r_t*h_{t-1},x_t])$ $h_t=(1-z_t)*h_{t-1}+z_t*\hat h_t$ Passed in front of the understanding, then it’s easy to understand GRU helped, let them go, also could not escape our critical, and let they betray oneself, ha, ha, ha, based on this we also can design our own network structure, with more relevant to our actual business scenarios, get surprise effect.

Write at the end

RNN part of the content is here, write is also to record what I learned, in order to help more new friends, but also hope to communicate with more people, know their understanding of the inaccurate place, if you think it is good, remember to like you. By the way, I will post a paper by Google. The LSTM in this paper is different from the structure we commonly see. If you have conditions, you can also look at the address of the paper.