In the last article, we reviewed all RNN (Recurrent neural network), and summarized the RNN model. Since RNN also has the problem of gradient disappearance, it is difficult to process Long sequence data. Therefore, we improved RNN and obtained the special case LSTM (Long short-term Memory) of RNN, which can avoid the gradient disappearance of conventional RNN, so it has been widely used in the industry. The following is a summary of the LSTM model.

Long Short Term Memory Networks (LSTMs), a special RNN network, is designed to solve the Long dependency problem. The network was introduced by Hochreiter & Schmidhuber (1997), and many others improved and popularized it. Their work has been used to solve all kinds of problems and is still widely used today.

1. From RNN to LSTM

In the RNN model, we learned that RNN has the following structure, each sequence index positionThey all have a hidden state.

If we omit the layers, then the model of RNN can be simplified into the form shown below:

All recurrent neural networks have the form of repeating modular chains of neural networks. In a standard RNN, the repeating module would have a very simple structure, such as a single TANH layer.

You can clearly see it in the hidden statebyandGet it. Because of the RNN gradient disappearance problem, we have no idea about the sequence index positionIt can be said that the hidden structure is complicated by some techniques to avoid the problem of gradient disappearance. Such a special RNN is our LSTM.

LSTMs also has this kind of chain structure, but its repeating unit is different from the standard RNN network in only one network layer, it has four internal network layers. Since there are many varieties of LSTM, we take the most common LSTM as an example. The structure of LSTMs is shown below.

It can be seen that the structure of LSTM is much more complex than that of RNN. I really admire how people come up with such a structure, and then it can solve the problem of RNN gradient disappearance.

When explaining the detailed structure of LSTMs, first define the meaning of each symbol in the following figure, including the following symbols:

In the figure above, the yellow box is the neural network layer, the pink circle represents point operations such as vector addition and multiplication, the single arrow represents the data flow, the arrow merge represents the concat operation, and the arrow fork represents the vector copy operation.

2. Core ideas of LSTM

At the heart of the LSTMs is the Cell State, which is represented by a horizontal line running through the Cell.

The cell state is a bit like a conveyor belt. It goes all the way along the chain, with just some tiny linear interactions. Information is easy to flow without change. The cell status is shown below.

The LSTM does have the ability to remove or add information to cell state, and it is carefully moderated by structures called gates.

A gate is a way of selectively letting information through. They consist of a Sigmod network layer and a dot product operation.

Since the output of the SigmoID layer is a value of 0-1, this represents how much information can flow through the SigmoID layer. 0 means neither pass, and 1 means both pass.

An LSTM contains three gates to control the unit state.

3. Understand LSTM step by step

As mentioned earlier, LSTM is controlled by three gates, which are called forget gate, input gate and output gate respectively. Let’s talk about it one by one.

3.1 Forget Gate

The first step in LSTM is to determine what information the cell state needs to discard. This part of the operation is done by a method calledForget the doorSigmoid unit to process. It does this byInformation to output a vector between 0 and 1 in which the 0-1 value represents the cell stateWhat information is retained and how much is discarded. 0 indicates no reservation and 1 indicates both reservation. The forgetting gate is shown below.

3.2 Input Gate

To update the cell state, we need an input gate. First, we pass the previously hidden state and the current input toFunction. This determines which values are updated by converting the values to 0 through 1. 0 indicates unimportant, and 1 indicates important. You also pass the hidden state and current input toFunction to compress them between -1 and 1 to help tune the network. thenThe output andThe output is multiplied.

3.3 Cell State

Now we have enough information to calculate the cell state. First, the cell state is multiplied point by point by the forgetting vector. If it is multiplied by a value close to zero, it is possible to discard the value in the cell state. We then take the output from the input gate and add it point by point, updating the cell state to the new value associated with the neural network discovery. This gives us our new cell state.

3.4 Output Gate

And finally we have the exit door. The output gate determines what the next hidden state will be. Remember that the hidden state contains information about the previous input. Hidden states are also used for prediction. First, we pass the previously hidden state and the current input toFunction. We then pass the new cell state toFunction. willThe output andThe output is multiplied to determine what information the hidden state should carry. Its output is hidden. The new cell state and the new hidden state are then passed to the next time step.

The forgetting gate determines what is relevant to the previous time step.

The input gate determines what information is added from the current time step.

The output gate determines what the next hidden state should be.

4. LSTM variants

The LSTM structure described earlier is the most common. In actual articles, there are various variations of LSTM structure, although none of them are too big, they are worth mentioning.

One popular variation, proposed by Gers & Schmidhuber (2000), added the “Peephole Connections” structure to the LSTM structure, The peephole connections structure allows each gate structure to see cellular information, as shown in the figure below.

The image above adds “Peephole connections” to all doors, but many papers only add “peephole connections” to some doors.

Another variation is to introduce a coupling between the forget gate and the input gate. Different from the previous LSTM structure, the forget gate and the input gate are independent. This variant adds new information at the location where the forget gate deletes historical information, and deletes old information at the location where the new information is added. The structure is shown below.

A variant of LSTM that is more significant than the other forms is the gate cycle unit (GRU) proposed by Cho, et al. (2014). It combines the forget gate and the input gate into a new gate called the update gate. The GRU also has a gate called the reset gate. As shown in the figure below

5, summary

As mentioned before, RNNs has achieved good results, many of which are based on LSTMs, indicating that LSTMs is suitable for most sequence scenarios. The general writing method will pile up a bunch of formulas to scare people, I hope this article step by step to break down can help you understand. The use of LSTMs over RNNs is a big step forward. So the question remains, is there more progress to be made? For many researchers, but certainly, that is the advent of attention. The idea behind attention is to make the RNN pick up useful information from a larger set of information at each step of the process. For example, using the RNN model to generate letters for a frame of picture, it will select the parts of the picture that are useful to get useful inputs, thus producing effective outputs. In fact, Xu et al.(2015) has already done this. If you want to know more about attention, this is a good place to start. There’s still some exciting research in the attention direction, but there’s a lot more to explore……

6. Reference links

  • Colah. Making. IO/posts / 2015 -…
  • zhuanlan.zhihu.com/p/81549798