1. Gradient explosion problem

Recently, I studied the application of multi-layer LSTM in sequential business scenarios. If the implementation is based on Keras framework, the activation function uses Relu, the training speed is relatively fast and the effect is good. However, when the implementation is based on Tensorflow framework, if the activation function is changed from the default tanh to Relu, The following problems occurred during the training:In the middle of deep learning model training, cost suddenly becomes large, and maybe it drops after several twists and turns, but most of it still rises sharply, and “Nan” appears.

Cost: 0.00532... Cost: 1097.2125 Cost: nan cost: NanCopy the code

Where, the activation function is set as follows:

        Replace default tanh activation function
        cell_list = tf.contrib.rnn.BasicLSTMCell(self.cell_size, 
                                                 forget_bias=1.0, 
                                                 state_is_tuple=True, 
                                                 activation=tf.nn.relu) 
Copy the code

Model initialization weight:

    def _weight_variable(self, shape, name='weights') :
        initializer = tf.random_normal_initializer(mean=0., stddev=1.0.)return tf.get_variable(shape=shape, initializer=initializer, name=name)

    def _bias_variable(self, shape, name='biases') :
        initializer = tf.constant_initializer(0.1)
        return tf.get_variable(name=name, shape=shape, initializer=initializer)
Copy the code

In fact, this kind of problem is a typical gradient explosion problem in deep learning training.

2. Solutions

2.1. Swap back the Tanh activation function?

In this practical case, the activation function targeted is Relu.

relu(x)=max(x,0)= \left\{\begin{matrix} x, & x\geqslant 0\\ 0 & x<0 \end{matrix}\right.

The graphical effect is shown below.What are the advantages of using the Relu function?

There is no saturated region, there is no gradient disappearance problem.
There is no complex exponential operation, the calculation is simple and the efficiency is improved.
It actually converges faster, much faster than Sigmoid/ TANh.
More consistent with biological neural activation mechanism than Sigmoid.

Relu disadvantages:

During training, ReLU units are vulnerable and can “die”. When a large gradient, for example, flows through a neuron in ReLU, it may cause the gradient to update to a particular state where the neuron cannot be reactivated by any other data point. If that happens, then all the gradients that flow through this neuron from here will be zero. In other words, this ReLU unit will die irreversibly during training, because this leads to the loss of data diversification.
If the learning rate is set too high, it may be found that 40% of the neurons in the network die (these neurons will not be activated for the entire training set). By setting the learning rate reasonably, the probability of this situation will be reduced.

In the neural network, the activation function of the hidden layer is best to choose ReLU. In LSTM, the default activation function Tanh is used. If the gradient explosion problem cannot be solved, the default activation function Tanh is used.

$tanh(x) = \frac{2} {1+e^{-2x}}-1$

The graphical effect is shown in the figure below.Sigmoid and tanh:

When sigmoid input is between [-1,1], the function value is sensitive to changes. Once it approaches or exceeds the interval, it loses its sensitivity and is in a saturation state, which affects the accuracy of neural network prediction.
The variation sensitivity interval of TANH was wide, and the derivative value gradually reached 0 and 1, which was in line with the law of human brain nerve saturation, and delayed the saturation period than sigmoid function.
Tanh is close to the form of y=x function near the origin. When the activation value is low, matrix operation can be performed directly and training is relatively easy.
Both TANH and SIGmoid are fully activated (fire), making the neural network heavy.

Relu converges faster and is more accurate than TANH

2.2. Optimize the initialization weight

We know that the basic composition of a neural network can be described as a linear function f(x)=wx+bf(x)=wx+bf(x)=wx+ wx+b. In the process of summing ∑imwixi+b\sum_{I}^{m}w_{I}x_{I} +b ∑imwixi+b, If the ‘weights’ are too large, the loss may be too large, and then the gradient explosion may occur.

Based on the characteristics of input and output (in this case, the normalized input is between 0 and 1), set the parameters as follows:

    def _weight_variable(self, shape, name='weights') :
        initializer = tf.random_normal_initializer(mean=0., stddev=0.1.)return tf.get_variable(shape=shape, initializer=initializer, name=name)

    def _bias_variable(self, shape, name='biases') :
        initializer = tf.constant_initializer(0.01)
        return tf.get_variable(name=name, shape=shape, initializer=initializer)
Copy the code

By setting smaller standard deviations and biases, gradient explosions are reduced, but cannot be prevented.

Random_normal_initializer (mean=0.0, STddev =0.1, seed=3)

Mean: The mean of a normal distribution, default value 0, a Python scalar or a scalar tensor. The mean of random values to generate.

Stddev: Standard deviation of a normal distribution, default value 1, a Python scalar or a scalar tensor. The standard deviation of the random value to be generated.

Seed: random seed, specifying the same seed value generates the same data, a Python integer. Used to create random seeds. Check the tf.set_random_SEED behavior.

Dtype: indicates the data type. Only the floating point type is supported

A smaller standard deviation means that these numbers are closer to the mean, and in general, a smaller standard deviation is better, because that means they’re more stable. Here is a simulation of the effect of standard deviation on neuronal computation.

import matplotlib.pyplot as plt
import math
import numpy as np

def testInitWeight() :
    x = np.random.uniform(0.1.30*25)
    t = 10000
    z_lst = np.empty(t)
    mu = [0.0.0]
    sigma = [1.0.0.1.0.01]    
    for j in range(3) :for i in range(t):
            w = np.random.normal(mu[j], sigma[j], 30*25)         
            b = 0
            # z is the weighted sum
            z = np.sum(x * w) + b           
            z_lst[i] = z                            
            
        print ('z mean: ', np.mean(z_lst))
        print ('z variance: ', np.var(z_lst))   
        plt.subplot(1.3,(j+1))
        plt.grid()   
        
        plt.hist(z_lst, bins=10)  
    
    plt.show()            

if __name__ == '__main__':
    testInitWeight()
Copy the code

Through practice, the standard deviation is set to be smaller, “STDdev =0.1” as shown above, which is more stable in the training process and reduces the occurrence of gradient explosion.

2.3. Gradient trim

However, the focus of this paper is to solve the problem of gradient explosion in TensorFlow. The principle is very simple: gradient pruning. I’m going to trim the derivative greater than one to one.

The gradient pruning function of Tensorflow is tf.clip_by_value(A, min, Max) : Input A tensor A and compress the value of each element in A between min and Max. If it’s less than min let it be equal to min, if it’s greater than Max let it be equal to Max.

    def compute_cost(self) :
        losses = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
            [tf.reshape(self.pred, [-1], name='reshape_pred')], 
            [tf.reshape(self.ys, [-1], name='reshape_target')],       
            [tf.ones([self.batch_size * self.n_steps*self.output_size], dtype=tf.float32)], 
            average_across_timesteps=True,
            softmax_loss_function=self.ms_error,
            name='losses'
        )
        
        with tf.name_scope('average_cost'):
            self.cost = tf.div(
                tf.reduce_sum(losses, name='losses_sum'),
                self.batch_size_,
                name='average_cost')
            tf.summary.scalar('cost', self.cost)
    
    def train_optimizer(self) :   
        # Use Adam gradient descent
        optimizer = tf.train.AdamOptimizer(LR)     
        [-1, 1] [-1, 1
        # Calculate the derivative, where cost is the loss function
        gradients = optimizer.compute_gradients(self.cost)
        Limit the derivative range from -1 to 1
        capped_gradients = [(tf.clip_by_value(grad, -1..1.), var) for grad, var in gradients if grad is not None]
        # Continue to apply the processed derivative to the LSTM algorithm
        self.train_op = optimizer.apply_gradients(capped_gradients)
Copy the code

The default values are clipnorm=1.0 and clipValue =0.5.

3. Principle description

3.1. What is gradient explosion

Gradient explosion refers to the continuous accumulation of large error gradients during neural network training, resulting in significant updating of model weights. The model is unstable and the training data cannot be used for learning.

Error gradient is the direction and quantity calculated during the neural network training process, which is used to update the network weight in the correct direction and with the appropriate quantity. In a deep network or RNN, error gradients may accumulate during the update process and eventually accumulate into very large gradients. This will lead to a large update of the network weights, which will lead to network instability. In extreme cases, the value of the weight may be large enough to overflow and cause a NaN value.

The exponential growth caused by the repeated multiplication of gradiens (values greater than 1.0) between network layers produces gradient explosions. In deep multi-layer perceptron networks, gradient explosions can cause network instability, with the best result being an inability to learn from the training data, and the worst result being an unupdatable NaN weight value.

How to determine if a gradient explosion has occurred?

The model cannot obtain updates (such as low losses) from the training data.
The model is unstable, resulting in significant changes in losses during the updating process.
During training, model loss becomes NaN.

The essence is the higher power of the matrix resulting from the chain rule of gradient transfer (back propagation multiplies the partial derivative of the function layer by layer).

RNN Results in nan values? Gradient explosion, resulting in non-convergent results. The gradient is too large, so we can reduce the learning rate (gradient change is directly smaller), reduce the batch size (cumulative gradient is smaller), and normalize features (to avoid a sudden large input).

3.2. Chain rule

The chain rule is widely used, and the back propagation algorithm in neural network is based on the chain rule. The chain rule of calculus (so as not to be confused with the chain rule of probability) is used to calculate the derivative of a composite function. Back propagation is an algorithm that computes the chain rule using a specific order of operations with high efficiency.

Let XXX be a real number, and FFF and GGG are functions that map from real numbers to real numbers. Suppose y = g (x) = y g (x) = g (x), y and z = f (x) (g) = f (y) z = f (x) (g) = f (y) z = f (g) (x) = f (y). So the chain rule says,

$\frac {dz}{dx}=\frac {dz}{dy}\frac {dy}{dx}$

Or in another form:

DZDX = f ‘g’ (g (x)) (x) \ frac {dz} {dx} = f ‘g’ (g (x)) (x) DXDZ = f ‘g’ (g (x)) (x)

The chain rule is the composition of FFF and GGG, and the derivative is equal to the derivative of the inside function times the value of the outside function.

Hinton in his IRNN paper (ArXIV: [1504.00941] A Simple Way to Initialize Recurrent Networks of Reasonable Linear Units) is explicit:

In other words, changing the activation function to ReLU in RNN will result in very large output values.

First of all, because the calculation result will become multiple W multiplies in the forward propagation:

Assuming that the activation function in the traditional RNN is replaced by ReLU and that the ReLU function is always in the activation region (that is, the input is greater than 0),

F (x)=xf(x)=xf(x)=x; (Uxt nett = Uxt + W – 1 + Wht – 2) net_ {t} = Ux_ {t} + W (Ux_} {t – 1 + Wh_ {2} t -) nett = Uxt + W (Uxt – 1 + Wht – 2)

If I expand it further, nettnet_{t}nett will eventually contain t W’s. If W is not unitary, nettnet_{t}nett will eventually approach zero or infinity, causing serious numerical problems.

At the same time, it is assumed that ReLU activation function is adopted and all neurons are activated at the beginning. After the gradient passes n layers, there are:

$\frac{net_{t}}{net_{t1}}=W^{n}$

As you can see, as long as W is not the identity matrix, the gradient will disappear or explode.

To sum up, when ReLU is used as the activation function, a better effect can be achieved only when W is near the identity matrix.

3.3. Gradient pruning principle

The method to solve the gradient explosion based on the principle of gradient pruning is that in a network with only one hidden node, the loss function and the weight W offset B constitute the error surface, as if there is a wall in it, as shown below.Each iteration of the loss function is one small step at a time, but when this wall is encountered and the gradient is calculated at a point on the wall, the gradient will increase instantly and point to some undesirable position. If we use scaling, we can keep the misdirection within acceptable limits, as shown by the dotted arrow.

3.4. Regularization of weights

Objective function = loss function + regularization term. The regularization term in the objective function is used to “penalize” the excessive weight, so that the weight will not be too large and the gradient explosion problem will be alleviated.

What is regularization

If it can be explained in one sentence, regularization is to make the network tend to learn smaller weights by increasing the weight penalty term to the loss function, so as to suppress the overfitting and increase the generalization ability of the model. Common regularization methods include L1L1L1 regularization, L2L2L2 regularization, and Dropout regularization. Where, the regularized formula for L2L2L2 is:

L = L0 + lambda ∑ 2 I = 1 n = L_ (w2) L + \ frac {0} {\ lambda} {2} \ sum_ {I = 1} ^ {n} (w ^ {2}) L = L0 + 2 lambda ∑ I = 1 n (w2) type L0L_ {0} L0 is original cost loss function, Lambda ∑ 2 I = 1 n (w2) \ frac {\ lambda} {2} \ sum_ {I = 1} ^ {n} (w ^ {2}) 2 lambda ∑ I = 1 n (w2) is L2L2L2 regularization loss function, including lambda \ lambda lambda is a weighting factor, the WWW as weights.

In TensorFlow, the graph is managed through collections including tensors, variables and resources:

Tf. add_to_collection Adds resources to a specific collection
Tf. get_collection Retrieves the corresponding resource from a specific collection

4. Summary

Generally, when the neural network composed of LSTM has fewer layers, its TANH function is generally used as the activation function by default, which is much better than Relu.

In recent years, Relu function has been used in winding machine neural network and found to solve the problem of gradient disappearance of deep neural network. In LSTM, as the network composed of LSTM deepens, tanH function continues to be used, there is a risk of gradient disappearance, which leads to lingering in a point and unable to search for the optimal solution. In this case, The Relu function can be used to adjust the learning rate. Note that the learning rate needs to be smaller to prevent entering the dead neurons.

In order to solve the problem of gradient explosion in the tensorFlow-based multi-layer LSTM model when the activation function adopts Relu, gradient pruning is adopted as the core solution, and the neural network and the parameter weights of input data are set as normal distribution, STD =0.1, to slow down the gradient explosion.

In the training method is also very important, the selection of appropriate batch, learning rate is also the starting point to solve the problem of gradient explosion.

Reference:

Deep Learning — Solving Gradient Explosions (including TensorFlow code), CSDN blog, super cool Jay Wen, June 2018

“Activation functions Sigmoid, Tanh, relu” in brief, SpikeKing, January 2019

“A Detailed analysis of Tensorflow LSTM implementation of Multi-dimensional input and output prediction”, Gold, Xiao yongwei, June 2021

Tensorflow-based Regularization implementation, AlexChung16, May 2020

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Tensorflow LSTM selected the Relu activation function with weight initialization and gradient pruning to solve the gradient explosion problem practice

1. Gradient explosion problem

2. Solutions

2.1. Swap back the Tanh activation function?

2.2. Optimize the initialization weight

2.3. Gradient trim

3. Principle description

3.1. What is gradient explosion

3.2. Chain rule

3.3. Gradient pruning principle

3.4. Regularization of weights

4. Summary

Tensorflow LSTM selected the Relu activation function with weight initialization and gradient pruning to solve the gradient explosion problem practice

1. Gradient explosion problem

2. Solutions

2.1. Swap back the Tanh activation function?

2.2. Optimize the initialization weight

2.3. Gradient trim

3. Principle description

3.1. What is gradient explosion

3.2. Chain rule

3.3. Gradient pruning principle

3.4. Regularization of weights

4. Summary

Related Posts

Common recommendation algorithms

LintCode 375. Clone binary tree (deep copy)

2 PAC Learning Framework (Page 17, 18)