The author | Renu Khandelwal compile | source of vitamin k | Medium

Let’s start with the following question

  • Cyclic neural network can solve the problems existing in artificial neural network and convolutional neural network.
  • Where can I use RNN?
  • What is RNN and how does it work?
  • Challenge RNN’s gradient de-summation gradient explosion
  • How do LSTM and GRU address these challenges

Suppose we are writing a message “Let’s meet for___”, we need to predict what the next word will be. The next word could be lunch, dinner, breakfast or coffee. It’s easier to make inferences from context. Assuming we know that we are meeting in the afternoon and that this information remains in our memory, we can easily predict that we are likely to meet at lunch.

When we need to process sequence data that needs to be on multiple time steps, we use cyclic neural network (RNN).

Traditional neural networks and CNN require a fixed input vector and apply activation function to a fixed layer set to generate a fixed size of output.

For example, we use the input image of a 128×128 vector to predict the image of a dog, cat, or car. We can’t use variable size images to make predictions

Now, what if we need to operate on sequence data that depends on previous input states (such as messages), or sequence data can be in input or output, or both, and that’s where we use RNNs.

In RNN, we share the weights and feed the output back to the circular input, which is useful for processing sequence data.

RNN uses continuous data to infer who is speaking, what is being said, what the next word might be, and so on.

An RNN is a neural network with loops to store information. RNNS are called loops because they perform the same task on each element in the sequence, and the output element depends on the previous element or state. This is how RNN persists information to be inferred using context.

RNN is a kind of recurrent neural network

Where is RNN used?

The RNN described above can have one or more inputs and one or more outputs, variable-input and variable-output.

RNN can be used to

  • Classification of the image
  • Image acquisition
  • Machine translation
  • Video classification
  • Sentiment analysis

How does RNN work?

So let’s explain the notation.

  • H is the hidden state
  • X is the input
  • Y for the output
  • W is the weight
  • T is the time step

When we are working with sequence data, the RNN takes an input x on the time step T. RNN takes the value of hidden state on time step T-1 to calculate the hidden state H on time step T and applies tanH activation function. We use tanH or ReLU to represent the non-linear relationship between output and time T.

RNN is expanded into a four-layer neural network, and each step shares the weight matrix W.

The state connection hides information from the previous state and thus acts as a memory for the RNN. The output of any time step depends on the current input as well as the previous state.

Unlike other deep neural networks that use different parameters for each hidden layer, RNN shares the same weight parameters at each step.

We randomly initialize the weight matrix, and during training, we need to find the value of the matrix so that we have the desired behavior, so we calculate the loss function L. The loss function L is calculated by measuring the difference between the actual and predicted outputs. Calculate L using the cross entropy function.

RNN, where the loss function L is the sum of all losses of each layer

To reduce losses, we use back propagation, but unlike traditional neural networks, RNN shares weights at multiple levels, in other words, it shares weights at all time steps. Thus, the error gradient of each step also depends on the loss of the previous step.

In the example above, to calculate the gradient at step 4, we need to add the losses at step 4 to the losses at step 3. This is called back propagation via time-BPPT.

We calculate the gradient of the error relative to the weight to learn the correct weight for us and to get the desired output for us.

Because W is used in every step, up to the final output, we propagate back from t=4 to t=0. In a traditional neural network, we don’t share the weights, so we don’t need to sum the gradients, whereas in an RNN, we share the weights, and we need to sum the gradient of W at each time step.

Computing the gradient of H at the time step t=0 involves many factors of W, since we need to propagate back through each RNN cell. Even if we don’t weigh the matrix and multiply it by the same scalar value over and over again, it can be a challenge if the time steps are extremely large, say 100 time steps.

If the maximum singular value is greater than 1, the gradient will explode, called the explosion gradient.

If the maximum singular value is less than 1, the gradient will disappear, known as the vanishing gradient.

Weights are shared across all layers, resulting in gradient explosion or disappearance

For gradient explosion problems, we can use gradient clipping, where we can set a threshold in advance, and if the gradient is greater than the threshold, we can clipping it.

To solve the vanishing gradient problem, a common approach is to use long short-term memory (LSTM) or gated cyclic unit (GRU).

In our message example, in order to predict the next word, we need to return several time steps to learn about the previous word. It is possible that we have enough gaps between two relevant pieces of information. As the gap widens, RNNS have a hard time learning and connecting information. But this is the power of LSTM.

Short and Long-term Memory Network (LSTM)

LSTMs can learn long-term dependencies faster. LSTMs can learn to take 1000 step intervals. This is achieved through an efficient gradient-based algorithm.

To predict the next word in the message, we can store the context at the beginning of the message so that we have the correct context. This is exactly how our memory works.

Let’s take a closer look at the LSTM architecture and see how it works

LSTMs behave by remembering information over a long period of time, so it needs to know what to remember and what to forget.

LSTM uses four gates, which you can decide if you need to remember the previous state. Cell states play a key role in LSTMs. The LSTM can use four adjusting gates to decide whether to add or remove information from the cell state.

These doors act like faucets, determining how much information should pass through.

  1. The first step in LSTM is to determine whether we need to remember or forget the state of the cell. The forgetting gate uses the Sigmoid activation function with an output value of 0 or 1. The output 1 of the forgetting gate tells us to keep the value, and the value 0 tells us to forget the value.

  1. The second step determines what new information we will store in the cell state. This has two parts: one is the input gate, which determines whether to write the unit state by using the sigmoid function; The other part is to use the TANH activation function to determine what new information is added.

  1. In the last step, we create the cell state by combining the outputs of step 1 and Step 2, which are multiplied by the cell state after applying the TANH activation function of the current time step to the output gate output. The Tanh activation function gives the output range between -1 and +1

  2. The cell state is the internal memory of the cell, which multiplies the previous cell state by the forgetting gate, and then multiplies the output of the input gate I by the newly calculated hidden state (G).

Finally, the output will be based on the cell state

The back propagation from the current cell state to the previous cell state is only the cell multiplication of the forgetting gate, without the matrix multiplication of W, which eliminates the vanishing and explosion gradient problems using the cell state

The LSTM determines when and how to transform memories at each time step by deciding what to forget, what to remember, and what information to update. This is how LSTMs help store long-term memories.

The following example of how LSTM can predict our message

GRU, variant of LSTM

The GRU uses two gates, a reset gate and an update gate, as opposed to the three steps in LSTM. The GRU has no internal memory

The reset gate determines how to combine the new input with the memory of the previous time step.

The renewal gate determines how many old memories should be kept. An update gate is a combination of an input gate and a forget gate as we understand it in LSTM.

GRU is a simple variant of LSTM for solving vanishing gradient problems

Original link: medium.com/datadriveni…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/