Automatic differentiation and back propagation in neural networks

If there is a problem with the formula or picture display, please check the original text.

Automatic Differentiation (AD)

Automatic differentiation is the decomposition of a compound function into an output (root node) and a series of inputs (leaf node) and a basic function (intermediate node) to form a Computational Graph that computes the gradient between any two nodes:

Addition rule: the gradient between any two nodes is the sum of partial derivatives of all paths between them;
Chain rule: The partial derivative of a path is the serial product of partial derivatives between adjacent nodes on the path.

A case in point

Take an example from Qiu Xipeng’s Neural Networks and Deep Learning. Function? f(x; w, b) = \frac{1}{\exp{(-(wx + b))} + 1}? When (x, w, b) = (1, 0, 0), the calculation figure is as follows:

Each edge is marked with the partial derivative of the result of the forward calculation (red number) and the end node (upper node) and the head node (lower node).

Because the gradient between any two nodes is the sum of the partial derivatives of all the paths between them (not shown in this example), where the partial derivatives of each path is the product of the partial derivatives of each edge of the path. There are:

The latter two equations use the results of the first equation, which can also be seen from the calculation diagram: when calculating the gradient of the root node to other nodes, the gradient of the upper edge in the path is reused more. Therefore, if the calculation graph is complex (multiple paths) or there are many variables that need to calculate the partial derivative, the calculation amount can be reduced by calculating the gradient of the edge from the upper layer and assisting the calculation of the partial derivative of the root node to the lower node by caching it. This is the back propagation algorithm.

Back Propagation (BP) algorithm

In the neural network, as long as each component and the loss function are differentiable, the loss function is the differentiable compound function of each input. The gradient can be calculated using automatic differentiation, and the error can be reduced using the gradient descent algorithm.

Therefore, BP algorithm is actually automatic differentiation, but the root node is the loss function to measure the error, so the partial derivative between nodes is called the error term.

Fully connected neural network FNN

Digraph {label=" fully connected network "; rankdir=LR; node [shape=none]; x; w[label="W_i"]; b[label="b_i"]; y1 [label="y'"]; y; E [label = "Δ"]; node [shape=circle]; { rank=same; y -> L; } subgraph cluster_hidden_Layers {label=" hidden layer ×N"; h1 [label="x"]; h2 [label="+"]; H3 [label = "sigma"]; { rank=same; w -> h1; } { rank=same; b -> h2; } } x -> h1 -> h2 -> h3 -> y1 -> L -> e; }Copy the code

In the figureThese are vectors,Is matrix, x is matrix multiplication,Is the activation function, y is the actual result, and L is the differentiable loss function.

The full connection is represented by matrix multiplicationIn the.

Convolutional Neural network CNN

Convolutional neural networks add convolution layer and pooling layer to fully connected networks.

Convolution Layer

Convolution operationIs the compound of multiplication and addition. It is easy to examine the partial derivatives of each group of components one by one, and obtain:

Among themIs convolution,It’s the flip wide convolution,Phi is a scalar function.

One thing to noteThis is not matrix calculus, but element-wise partial derivatives.

So convolution can be viewed as a kind of basic function that you can put into a graph and differentiate automatically.

Digraph {label=" convolution "; rankdir=LR; Conv [label = "⊗", shape = circle]; node [shape=none]; Conv -> conv -> conv -> conv { rank=same; "Convolution kernel W" -> conv; }}Copy the code

In addition, if the convolution operation is regarded as a multiplication, the calculation graph of CNN is the same as that of FNN.

The more general case is thisAre tensors (W is the convolution kernel), and when multiplying matrices, multiply the elements using convolution (the elements are matrices).

Pooling Layer

Max Pooling: the gradient of upper node is 1 for the largest node in the Pooling area, and the gradient of other nodes is 0.
Mean Pooling: assume that the Pooling area is N, then the gradient of the upper node for each element is 1/ N.

The derivative of the Max operation is not continuous, so the gradient here is only an approximation.

Recurrent neural network RNN

Let the model of RNN be:? h_t = \sigma(U \cdot h_{t-1} + W \cdot x_t + b)? The calculation figure is as follows:

Digraph {label=" Computation diagram of recurrent neural network "; node [shape=none]; x1; w1[label="W"]; b1[label="b"]; h1; x2; w2[label="W"]; b2[label="b"]; u2[label="U"]; h2; u3[label="U"]; etc[label="..."] ; node [shape=circle]; The node [label = "x"]; op11; op20; op21; op30; node [label="+"]; op12; op22[style=bold, xlabel="z2"]; The node [label = "sigma"]; op13; op23; edge [arrowhead=open]; { rank=same; x1 -> op11; op11 -> op12 -> op13 [arrowhead=normal, style=bold, color=red]; op13 -> h1; } op11 -> w1 [dir=back, arrowhead=normal, style=bold, color=red]; op12 -> b1 [dir=back]; op13 -> op20 [arrowhead=normal, style=bold, color=red]; { rank=same; op20 -> u2 [dir=back]; } op20 -> op22 [arrowhead=normal, style=bold, color=red]; { rank=same; x2 -> op21; op21 -> op22 -> op23 -> h2 [arrowhead=normal, style=bold, color=red]; } op21 -> w2 [dir=back, arrowhead=normal, style=bold, color=red]; op22 -> b2 [dir=back]; op23 -> op30; { rank=same; op30 -> u3 [dir=back]; } op30 -> etc; }Copy the code

All W are the same matrix, which can be regarded as the same node distributed to all W nodes in the figure through identity transformation. B, U.

It’s just an intermediate result that can be output by other differentiable transformationsAnd then you add the loss function to get the error, which is the same thing as FNN so you omit it.

Gradient calculation

For example, to calculate, connect W andBack Propagation Through Time (BPTT) algorithm

In fact, if you cache the gradient you need as you go forward, you can end up propagating the error term back to the cache. For example, in the figureAt the same time, calculate, when calculating theThere is no need to backtrack to each previous layer as BPTT does. This is the real-time Recurrent Learning (RTRL) algorithm.