Deep learning-BP chain derivation, gradient explosion and gradient extinction

Let’s first think about what factors gradient is associated with in a simple DNN.

Let’s start with a simple three-layer neural network

h_i = \sigma(u_i) = \sigma(\sum_{k=1}^{K}w_{ki}x_k+b_i)

h_j^{‘} = \sigma(u_j^{‘})=\sigma(\sum_{i=1}^{I}w_{ij}^{‘}h_i+b_j)

E = CE(y_j; h_j^{‘}) = y_jlog(h_j^{‘}) + (1 – y_j)log(1-h_j^{‘})

\frac {\delta E}{\delta w_{ij}^{‘}} = \frac {\delta E}{\delta h_j^{‘}}\frac {\delta h_j^{‘}}{\delta u_j^{‘}}\frac {\delta u_j^{‘}}{\delta w_{ij}^{‘}}

There are three parts, one part at a time

Cross entropy derivative

\frac {\delta E}{\delta h_j^{‘}} = \frac { (y_j)}{h_j^{‘}}\frac { (y_j)}{(1- h_j^{‘})} = \frac {y_j – h_j^{‘}}{{h_j^{‘}}(1- h_j^{‘})}

Take the derivative of the activation function (suppose sigmoID)

\frac {\delta h_j^{‘}}{\delta u_j^{‘}} = h_j^{‘}(1-h_j^{‘})

And then we take the derivative with respect to the weights

\frac {\delta u_j^{‘}}{\delta w_{ij}^{‘}} = h_i

The three parts are combined to form the weight of the second layer

\frac {\delta E}{\delta w_{ij}^{‘}} = (y_j – h_j^{‘})h_i

You can see that the weight of this layer is related to the input value hih_ihi, the output value hj ‘h_j^{‘}hj’, and the label yjy_JYj.

The process is similar, and the result is zero

\frac {\delta E}{\delta w_{ki}} = (y_j-h_j^{‘})w_{ij}^{‘}h_i(1-h_i)x_{ki}

You can see that the gradient of a certain layer is zero

The output value of the later layer
The label
The weight of the latter layer
The derivative of the activation function hi(1− Hi)h_i(1-h_i)hi(1−hi) is the derivative of sigmoid.
The input values

Here is why gradient explosion/gradient disappearing explosion occurs:

The weight initialization is not reasonable. In the process of back propagation, the weight of the later layer will affect the gradient of this layer, and the explosion will occur if the average weight is greater than 1

Vanish:

The derivative value of the activation function, such as sigmoid, is maximal near 0, but is only 0.25, which is passed down to 🈚️