Let’s first think about what factors gradient is associated with in a simple DNN.

Let’s start with a simple three-layer neural network

The output of the first layer is


h i = sigma ( u i ) = sigma ( k = 1 K w k i x k + b i ) h_i = \sigma(u_i) = \sigma(\sum_{k=1}^{K}w_{ki}x_k+b_i)

The output of the second layer is


h j = sigma ( u j ) = sigma ( i = 1 I w i j h i + b j ) h_j^{‘} = \sigma(u_j^{‘})=\sigma(\sum_{i=1}^{I}w_{ij}^{‘}h_i+b_j)

The loss function is


E = C E ( y j ; h j ) = y j l o g ( h j ) + ( 1 y j ) l o g ( 1 h j ) E = CE(y_j; h_j^{‘}) = y_jlog(h_j^{‘}) + (1 – y_j)log(1-h_j^{‘})

Derivation:

Take the derivative of the second one


Delta t. E Delta t. w i j = Delta t. E Delta t. h j Delta t. h j Delta t. u j Delta t. u j Delta t. w i j \frac {\delta E}{\delta w_{ij}^{‘}} = \frac {\delta E}{\delta h_j^{‘}}\frac {\delta h_j^{‘}}{\delta u_j^{‘}}\frac {\delta u_j^{‘}}{\delta w_{ij}^{‘}}

There are three parts, one part at a time

Cross entropy derivative


Delta t. E Delta t. h j = ( y j ) h j ( y j ) ( 1 h j ) = y j h j h j ( 1 h j ) \frac {\delta E}{\delta h_j^{‘}} = \frac { (y_j)}{h_j^{‘}}\frac { (y_j)}{(1- h_j^{‘})} = \frac {y_j – h_j^{‘}}{{h_j^{‘}}(1- h_j^{‘})}

Take the derivative of the activation function (suppose sigmoID)


Delta t. h j Delta t. u j = h j ( 1 h j ) \frac {\delta h_j^{‘}}{\delta u_j^{‘}} = h_j^{‘}(1-h_j^{‘})

And then we take the derivative with respect to the weights


Delta t. u j Delta t. w i j = h i \frac {\delta u_j^{‘}}{\delta w_{ij}^{‘}} = h_i

The three parts are combined to form the weight of the second layer


Delta t. E Delta t. w i j = ( y j h j ) h i \frac {\delta E}{\delta w_{ij}^{‘}} = (y_j – h_j^{‘})h_i

You can see that the weight of this layer is related to the input value hih_ihi, the output value hj ‘h_j^{‘}hj’, and the label yjy_JYj.

Take the derivative of the first layer

The process is similar, and the result is zero


Delta t. E Delta t. w k i = ( y j h j ) w i j h i ( 1 h i ) x k i \frac {\delta E}{\delta w_{ki}} = (y_j-h_j^{‘})w_{ij}^{‘}h_i(1-h_i)x_{ki}

You can see that the gradient of a certain layer is zero

  1. The output value of the later layer
  2. The label
  3. The weight of the latter layer
  4. The derivative of the activation function hi(1− Hi)h_i(1-h_i)hi(1−hi) is the derivative of sigmoid.
  5. The input values

related

Here is why gradient explosion/gradient disappearing explosion occurs:

  1. The weight initialization is not reasonable. In the process of back propagation, the weight of the later layer will affect the gradient of this layer, and the explosion will occur if the average weight is greater than 1

Vanish:

  1. The derivative value of the activation function, such as sigmoid, is maximal near 0, but is only 0.25, which is passed down to 🈚️