Different from traditional forward Neural Networks and convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) is a model that is good at processing sequential data, such as text, time series, stock market, etc. This paper mainly introduces the development process and structural differences of several important models RNN, LSTM and GRU, and deduces the causes of gradient explosion and gradient disappearance in RNN in detail.

1. Background of recurrent neural network

Forward neural networks and CNN have achieved good results in many tasks. However, these network structures are usually more suitable for some data without time or sequence dependence, and the input accepted is usually unrelated to the input at the last moment.

But sequence data is different, and there is a sequence between the inputs. The result of the current input is usually related to the previous input. For example, a sentence contains four input words: “I”, “go”, “shopping mall” and “taxi”. The four words in different order will have different meanings: “I take a taxi to the mall” and “I take a taxi to the mall”. So we usually need to read the sentences in a certain order to understand their meaning.

In this case, we need to use the recurrent neural network, which processes all the inputs in order. At each time t, there will be a vector H to store the information related to time T (it can be the information before time T or after time T). Through vector h and input vector x, the current result can be accurately judged. In the following notation:

Xt represents the input vector at time T (e.g. the word vector of the t th word)

Ht represents the hidden vector at time t (containing relevant information from the beginning to time T)

Yt represents the output vector at time t (usually the predicted result).

2.RNN

2.1 RNN structures

RNN is an early recurrent neural network with a relatively simple structure, as shown in the figure below.

In the figure, X, H and Y respectively represent the input, hidden state and output of RNN neuron.

U, W, and V are matrices that perform linear transformations on vectors x, h, and y.

In RNN, the same neuron is shared at every moment, and the neuron is expanded as shown in the figure below.

It can be seen that the neuron of RNN at time t receives input including xT at the current time and HT-1 at the hidden state at the last time. The output includes ht (hidden state) and YT (output) at the current time.

Therefore, input XT in RNN only contains t time information, not sequence information. Ht is calculated based on XT and HT-1, and contains historical information and current input information. Ht and YT are calculated as follows. Tanh is usually used as the activation function for HT calculation, and Softmax (classification) is usually used for YT calculation and output.

2.2 Defects of RNN (gradient disappearance and gradient explosion)

Let’s first look at a sequence with only three inputs, as shown in the figure above. At this point, the calculation formula of hidden layer H1, H2, H3 and output y1, y2, and y3 is as follows:

The loss function of RNN at time t is Lt, and the total loss function is L = L1 + L2 + L3.

The gradient of loss function L3 at t = 3 for network parameters U, W and V is as follows:

It can be seen that there is no long-term dependence on the gradient of parameter matrix V (corresponding to output YT), which is only related to the sequence at t = 3. However, the gradient of parameter matrix U (corresponding to input Xt) and parameter matrix W (corresponding to hidden state HT) both have long-term dependence, which depends on previous hidden layer states H1 and H2. The gradient of the loss function Lt at time t to U and W can be deduced as follows:

The serial term is the culprit of gradient disappearance and gradient explosion of RNN. The serial term can be transformed as follows:

Tangent of h prime is the derivative of tangent of h, and you can see that when RNN takes the gradient, it’s actually using tangent of h prime by W. When (Tanh ‘× W) > 1, multiple multiplications easily lead to gradient explosion. When (tanh’ × W) < 1, the gradient is easy to disappear.

LSTM (Long and Short Term Memory Network) emerged because of RNN’s problem in calculating gradients. LSTM crushes RNN in many ways and can alleviate the problem of gradient extinction and gradient explosion.

3.LSTM

We can better alleviate the problem of RNN gradient disappearance by LSTM. Let’s first understand the structure of LSTM.

3.1 LSTM structure

The figure above comes from Colah’s blog, and it can be seen that the neuron structure of LSTM and RNN is quite different. Conventional RNN neurons accept hT-1 of the last hidden state and xT of the current input.

On this basis, neurons of LSTM also input a cell state CT-1. Cell state C is similar to the hidden state H in RNN, and both preserve historical information from CT-2 to CT-1 to CT. In LSTM, C plays the same role as H in RNN, both of which are to save historical state information, while H in LSTM is more to save the output information of the last moment.

In addition, the internal calculation of LSTM is more complicated, including forgetting gate, input gate and output gate. Next, the functions of each gate are introduced respectively.

Forgetting gate: In the figure above, the red box is the part of LSTM forgetting gate, which is used to determine which information in cell state CT-1 should be deleted. σ indicates the activation function sigmoID. The input HT-1 and XT are activated by the SigmoID function to obtain ft, where each value is in the range [0, 1]. The closer the value of ft is to 1, the better the value of the corresponding position in cell state CT-1 should be remembered. The closer the value in ft is to 0, the more the value at the corresponding position in cell state CT-1 should be forgotten. The bitwise multiplication of FT and CT-1 yields C ‘t-1 after the useless information has been forgotten.

Input gate: The LSTM input gate in the red box is used to determine which new information should be added to cell state C ‘t-1. σ indicates the activation function sigmoID. The input HT-1 and XT can obtain new input information (Ct with wavy lines in the figure) through TANH activation function, but the new information is not all useful, so it needs to use HT-1 and XT to obtain IT through sigmoid function, and IT indicates which new information is useful. The result of the multiplication of the two vectors is added to C ‘t-1, that is, the cell state CT at time T is obtained.

Output gate: The LSTM output gate in the red box is used to determine which information should be output to the HT. Cell state CT obtains output information through tanH function, and then HT-1 and XT obtain a vector OT through sigmoid function. The range of each dimension of OT is [0, 1], indicating which positions should be removed and which positions should be retained. When you multiply these two vectors together, you’re going to end up with ht.

3.2 LSTM alleviates gradient disappearance and gradient explosion

As we know in the previous section, the main reason for gradient disappearance in RNN is that the gradient function contains a serial term. If the serial term can be removed, the gradient disappearance problem can be overcome. How do I get rid of the multiplication term? We can get rid of the multiplication by making it approximately equal to 0 or approximately equal to 1.

In LSTM, through the gate, the continuous term can be approximately equal to 0 or 1. First, let’s take a look at the calculation formula of CT and HT in LSTM.

In the formula ft and OT are obtained by the sigmoid function, which means that their values are either close to 0 or close to 1. Therefore, the multiplication term in LSTM becomes:

Therefore, when the gate gradient is close to 1, the multiplication term can ensure that the gradient is well transmitted in LSTM and avoid the gradient disappearing.

When the gradient of the gate is close to 0, it means that the information of the last moment has no effect on the current moment, and there is no need to transmit the gradient back.

This is why LSTM can overcome gradient disappearance and gradient explosion.

4.GRU

GRU is a variant of LSTM and has a simpler structure than LSTM. The LSTM has three gates (forget, input, output), while the GRU has only two gates (update, reset). In addition, the GRU does not have cell state C in the LSTM.

Zt and RT in the figure represent the update gate (red) and reset gate (blue), respectively. The reset gate RT controls the proportion of the information of the previous state HT-1 to the candidate state (HT with wavy lines in the figure). The smaller the value of reset gate RT is, the smaller the product with HT-1 is, and the less the information of HT-1 is added to the candidate state. The update gate is used to control how much information hT-1 of the previous state is retained in the new state HT. The larger (1-zt) is, the more information is retained.

5. To summarize

Cyclic neural network is suitable for sequence data and is also a required model in the process of learning NLP. Many APPLICATIONS and algorithms of NLP use cyclic neural network.

The traditional recurrent neural network RNN is prone to the problems of gradient disappearance and gradient explosion, so LSTM and its variants are commonly used at present.

In the process of practical use, it can also deepen the cyclic neural network, that is, multi-layer cyclic neural network; It is also possible to add reverse networks, such as biLSTM, which can take advantage of both forward and backward information.

6. References

Understanding LSTM Networks

Causes of gradient disappearance and gradient explosion in RNN