When designing and training Tensorflow neural network, whether simple BP neural network, complex convolutional neural network or cyclic neural network, we have to face a series of problems such as gradient explosion, gradient disappearance and data crossing, which are also determined by computer resources and mathematical principles.

Nan cases of Loss and gradients often occur during model training, especially in non-image recognition models. Let’s discuss the situations encountered in this practice and the solutions

Phenomenon of 1.

During the training process of Tensorflow model, in many cases, after a period of training, the output loss is nan. Fortunately, it may pass by for a short time, and then normal training will continue for most of the time, and the training fails.

If you also output gradient, it is possible that gradient will output nan first.

Reason 2.

2.1. The principle of simplicity

Generally speaking, the neural unit of the most basic node of neural network is mathematically expressed as Yi = F (Wixi + BI)y_{I}= F (W_ {I}x_{I}+b_{I})yi= F (Wixi + BI). Generally, how each node affects an output is tracked through the reverse derivative mode. This is Back Propagation led by error, aiming to get the optimal global parameter matrix (weight parameters).

When the neural network has more layers and more neuron nodes, the numerical stability of the model tends to deteriorate. Suppose a network with LLL layers, the weight parameter of HlH_{L}Hl at LLL layer is WlW_{L}Wl, and bias parameter BBB is omitted for the convenience of discussion. For a network with a given XXX input, the LLL layer output Hl=XW1W2… WlH_{l}=XW_{1}W_{2}… W_{l}Hl=XW1W2… Wl. At this point, HlH_{L}Hl is prone to decay or explosion.

For example, the output of layer 30 will appear 0.230≈1×10−210.2^{30} \approx 1 \times 10^{-21}0.230≈1×10−21.

When we train the neural network, we think of the loss as a function of the weight parameter. Therefore, the occurrence of nan phenomenon is caused by the transgression of ∑ Wixi +bi\sum W_ {I}x_{I}+b_{I}∑ Wixi + Bi.

So what we need to do is how to prevent ∑ Wixi +bi\sum w_{I}x_{I}+b_{I}∑wixi+bi.

Back propagation is to adjust the parameters in the direction of the negative gradient of the target. The parameters are updated as wi← WI + δ WW_ {I} \leftarrow w_{I} + \Delta WWI ← WI + δ w.

Given the learning rate α\alphaα, we get: δ W =−α∂Loss∂w\Delta w= -\alpha \frac{\partial Loss}{\partial w} δ W =−α∂w∂Loss

Due to the deep web is multi-layered nonlinear function stack, the depth of network can be regarded as a composite nonlinear multivariate function (these nonlinear multivariate function is actually each layer activation function), so the loss function for different layer has a weight partial derivatives, equivalent to the application of gradient descent chain rule, the chain rule is a LianCheng form, So as the layers get deeper, the gradient will propagate exponentially.

If the gradient value of the activation function close to the output layer is greater than 1 after derivative, then when the number of layers increases, the final gradient is easy to increase exponentially, that is, gradient explosion. On the other hand, if it’s less than 1, then the chain rule, which is the continuous product, will easily decay to 0, so the gradient disappears.

2.2. Why

The reasons for all nan in the training process of neural network are as follows: generally, the value of season point in forward calculation exceeds the boundary, or the value of gradient exceeds the boundary in backpropagation. In either direction, there are basically only three operations that can lead to numerical transgressions:

  1. The value of node weight parameter or gradient gradually increases until it exceeds the boundary.
  2. There is division by zero, including 0 divided by 0, which is common in trend regression forecasting; Or, the cross entropy is log of zero or negative;
  3. Input data is abnormal, too large/too small input, resulting in instant NAN.

3. Solutions

3.1. Reduce the learning rate

Reducing the learning_rate is the most direct and easy way.

When the learning rate is high, the degree of direct influence on each update value is relatively large, so the pace of walking will be larger. Generally, too high a learning rate will lead to failure to reach the lowest point smoothly, and a slight mistake will lead to a jump out of the controllable area, at which point we will face the loss of multiple increases (cross magnitude).

In the training process, we can try to reduce the learning rate by 10 times, 100 times, or even more, and most of these problems can be solved (pay attention to the learning rate is not too small, otherwise the convergence is too slow and waste time, to balance between speed and stability).

3.2. Weight parameter initialization

The training process of neural network is essentially the process of automatically adjusting the parameters of the network. At the beginning of training, the parameters of the network always start from a certain state, and the setting of the initial state is the initialization of the neural network.

Suitable network initial value not only helps gradient descent to find the optimal value on a good “starting point”.

In general, the common initialization is the random normal Distribution initialization, and the initialization of weights and biases is in accordance with the Standard Noraml Distribution randomization method of mean 0 and Standard deviation 1. But is it the best initialization strategy?

        # Normal distribution with standard deviation of 0.1
        def weight_variable(shape,name=None) :
            initial = tf.truncated_normal(shape,stddev=0.1)
            return tf.Variable(initial,name=name)
        #0.1 deviation constant in order to avoid dead nodes
        def bias_variable(shape,name=None) :
            initial = tf.constant(0.1, shape=shape)
            return tf.Variable(initial,name=name)
Copy the code

If relU is used, it is best to use variance scaling initialization (HE Initial).

In TensorFlow variance scaling method writing tf. Contrib. The layers. The variance_scaling_initializer (). According to our experiments, this initialization method has better generalization/scaling performance than regular Gaussian initialization, truncated Gaussian initialization, and Xavier initialization.

In layman’s terms, variance-scaling initialization adjusts the variance of the initial random weight based on the number of inputs or outputs at each layer (which defaults to the number of inputs in TensorFlow), thus helping the signal propagate further across the network without the need for other techniques such as gradient clipping or batch normalization. Xavier is similar to variance scaling initialization, except that the variances of each layer in Xavier are almost the same; However, if the scale of each layer of the network is very different (common in convolutional neural networks), these networks may not be able to handle the same variance in each layer well.

        #def weight_variable(shape,name=None):
        # return tf.get_variable(name, shape ,tf.float32 ,xavier_initializer())
        
        def weight_variable(shape,name=None) :
            return tf.get_variable(name, shape ,tf.float32 ,tf.variance_scaling_initializer())
Copy the code

tf.variance_scaling_initializer()

Parameters are (scale=1.0,mode=”fan_in”,distribution=”normal”,seed=None, dType =dtypes. Float32)

Its training effect has been improved:

Step: 11800, total_loss: 161.1809539794922, accuracy: step 0.78125:11850, total_loss: 46.051700592041016, accuracy: Step 0.9375:11900, total_loss: 92.10340118408203, accuracy: step 0.875:11950, total_loss: 23.025850296020508, accuracy: 0.96875Copy the code

If the activation function uses SigmoID and TANH, it is best to use Xavir.

Xavier_initializer () method: This initializer is used to keep the gradient size of each layer roughly the same


W …… U [ 6 n j + n j + 1 . 6 n j + n j + 1 ] W \sim U [ – \frac {\sqrt{6}} {\sqrt{n_{j}+n_{j+1}}} , \frac{\sqrt{6}}{\sqrt{n_{j}+n_{j+1}}}]

Common initialization methods:

  1. Tf.constant_initializer () constant initialization
  2. Tf.ones_initializer () All 1 is initialized
  3. Tf.zeros_initializer () initializes all zeros
  4. Tf.random_uniform_initializer () uniformly distributed initialization
  5. Tf.random_normal_initializer () Initializes the normal distribution
  6. Tf.truncated_normal_initializer () truncates normal distribution initialization
  7. Tf. uniform_unit_scaling_initializer() The input variance is constant
  8. Tf.variance_scaling_initializer () adaptive initialization
  9. Tf.orthogonal_initializer () generates an orthogonal matrix

Sometimes, the initial value is small enough to ensure that w, commonly used to initialize | | w is far less than 1, the general recommended tf. Truncated_normal_initializer (stddev = 0.02)

3.3. Tailoring of forecast results

Nan values appear in the cross entropy Loss calculation, we see the cross entropy: Loss = – ∑ I = 1 nyilog (y ^ I) Loss = – \ sum_ {I = 1} ^ y_ {n} {I} the log (\ hat {y} _ {I}) Loss = – ∑ I = 1 nyilog (y ^ I)

Although y^\hat{y}y^ is obtained by the tf.nn.sigmoid function, boundary values 1 or 0 are given for very large or very small output parameters.

So, change the following code: cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv), name=’cost_func’) to: Cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv, 1E-10, 1.0)), name=’cost_func’) # The maximum value is 1, 0~1 after normalization

            # The second full connection layer, the output layer, uses softmax as the activation function
            y_conv = tf.nn.softmax(tf.matmul(h_fc2_drop,W_fc3) + b_fc3, name='y_')
            
            # Use the built-in cross entropy function of TensorFlow to avoid the problem of weight disappearing
            cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv, 1e-10.1.0)), name='cost_func') # The maximum value is 1, normalized to 0~1
            
Copy the code

3.4. Gradient pruning

To solve the gradient explosion problem in Tensorflow, the principle is simple gradient pruning, which prune the derivative greater than 1 to 1.

Tensorflow gradient clipping function is tf.clip_BY_value (A, min, Max) : Input A tensor A, compress the value of each element in A between min and Max. Let’s make it equal to min for less than min, and let’s make it equal to Max for more than Max.

            # Use optimizer
            lr = tf.Variable(learning_rate,dtype=tf.float32)
            #train_step = tf.train.AdamOptimizer(lr).minimize(cross_entropy)           
            optimizer = tf.train.AdamOptimizer(lr)
            gradients = optimizer.compute_gradients(cross_entropy)
            Gradient optimization clipping
            capped_gradients = [(tf.clip_by_value(grad, -0.5.0.5), var) for grad, var in gradients if grad is not None]
            train_step = optimizer.apply_gradients(capped_gradients) 
Copy the code

In addition, the network recommends clip_by_global_norm:

Tf.clip_by_global_norm rescales the tensor list so that the total norm of the vectors of all norms does not exceed the threshold. But it can be applied to all gradients at once, rather than individually to each gradient (that is, if necessary, all scale by the same factor, or none at all). This is even better because you can keep the balance between the different gradients.

3.5. Other

  1. Narrow Bathsize

By reducing the batCH_size, which is equivalent to reducing the input, the goal of reducing the weight set can be achieved. However, the training speed may be slowed down.

  1. Input normalization

For the input data, the picture class is easier to solve, the 0-255 normalization to 0~1, and other data, according to the data distribution analysis and processing, the use of appropriate normalization scheme.

  1. Adjusting activation function

Neural network, now often use RELU as the activation function, but prone to Nan problems, can be changed according to the actual situation of the activation function.

  1. batch normal

BN has the property of improving network generalization ability. With BN, you can either remove the dropout and L2 regularization terms set for overfitting problems, or use smaller L2 regularization parameters.

4. Summary

Tensorflow neural network training is prone to Nan and other problems, through practical analysis, in practical application, need to adopt comprehensive solutions, adopt all the methods mentioned above, comprehensive solution.

All things are difficult at the beginning, first of all from the design of learning rate, initial weight, activation function and other aspects of optimization, as far as possible to reduce intermediate intervention pruning.

Even then, there will be Nan and other problems, how to do? Break training and start a new round.

Due to the author’s limited level, welcome to exchange feedback.

Reference:

1. “Tensorflow LSTM Selection of Relu activation Function and Weight initialization, Gradient Pruning to Solve gradient explosion Problem practice”, Excavation, Xiao Yongwei, March 2021

2. Mathematical Principles and Solutions of Gradient Extinction and Gradient Explosion, CSDN blog, TY44111144TY, 2019.08

3. About TF.nn.softmax_cross_entropy_with_logits and TF.CLIp_BY_value blog Park, wechat official Number — Resonance Circle, 2017.03

4. “Tensorflow NAN Common Causes and Solutions” CSDN blog, Ranxu Su, February 2019

5. Understanding Backpropagation Algorithm, Zhang Xiaolei, August 2017

6. TensorFlow from 0 to 1-15: Rethinking neural Network Initialization, Zhihu, Yuan Chengxing, July 2020

7. In-depth Understanding of Deep Neural Networks, CSDN blog, Xiao Yongwei, June 2020

8. Home of Batch Normalization Scripts for TensorFlow, Marsjhao, March 2018