One, foreword

This is a Back Propagation movement dominated by Error, aiming at obtaining the optimal global parameter matrix, and then applying multilayer neural networks to classification or regression tasks.

The input signal is transmitted forward until the output error occurs, and the error information is transmitted back to update the weight matrix. These two sentences describe the flow direction of information well, and the weight can be optimized in the two-way flow of information, which reminds me of the night scene of Beijing, with endless traffic, heavy traffic, you come and go (* ॑꒳ ॑*)⋆*.

As for why the back-propagation algorithm is proposed, can’t I directly apply Gradient Descent? Surely you must have had such a question. The answer is no. Gradient descent, though powerful, is not a panacea. Gradient descent can handle cases with explicit derivative functions, or cases where errors can be calculated, such as Logistic Regression, which we can think of as networks with no hidden layers; But for many of the hidden layer neural network, the output layer to update the parameters can be directly calculated error, but one of the hidden layer there is no error, therefore cannot be directly applied to it gradient descent, but first the error back propagation to the hidden layer, and application of gradient descent, including transfer error from the layer at the end of the process need the Chain Rule (Rule) Chain of help, So back propagation algorithm can be said to be gradient descent in the chain rule application.

Two, for example

To help you understand the concept of backpropagation, and to get a sense of it, let’s take a number game as an example.

2.1 Two people guess the number

This process is similar to a neural network without hidden layers, such as logistic regression, in which the yellow hat represents nodes in the output layer, the left side receives input signals, the right side produces output results, and the blue cat represents errors to guide parameters to be adjusted in a better direction. Since the little Blue cat can directly feed the error back to the little Yellow Hat, and only one parameter matrix is directly connected to the little Yellow Hat, the parameter optimization can be carried out directly through the error (solid vertical line). After several iterations, the error will be reduced to the minimum.


2.2 Three people guess the number

This process is similar to a three-layer neural network with a hidden layer, in which the little girl represents the node of the hidden layer, and the little yellow hat still represents the node of the output layer. The little girl receives the input signal on the left side and produces the output result through the node of the hidden layer. The little blue cat represents the error and guides the parameters to be adjusted in a better direction. Since little Blue Cat can directly feed the error back to Little Yellow Hat, the left parameter matrix directly connected to little Yellow Hat can be optimized directly through the error (solid vertical line). The left parameter matrix directly connected to the little girl cannot be directly optimized because it cannot get direct feedback from the little blue cat (the imaginary brown line). However, since the back propagation algorithm enables the feedback of the little blue cat to be transmitted to the little girl, thus generating indirect error, the left weight matrix directly connected to the little girl can obtain weight update through indirect error. After several iterations, the error will be reduced to the minimum.


Iii. Complete process

The above chestnut from an intuitive point of view to understand the back propagation, the next two processes will be introduced in detail forward propagation and back propagation, before the introduction of the first unified mark.

3.1 Mathematical notation




3.2 Forward propagation

How to transmit the signal of input layer to hidden layer, hidden layer nodes c, for example, standing on the node c back (the direction of the input layer), you can see there are two arrows point to the node c, so a, b node information will be passed to the c, at the same time each arrow has a certain weight, so for c nodes, the input signal is as follows:





Similarly, input signal of node D is:





Since computers are good at doing tasks with loops, we can express this by multiplying matrices:





Therefore, the output of hidden layer node after nonlinear transformation is shown as follows:





Similarly, the input signal of the output layer is expressed as the weight matrix multiplied by the output of the previous layer:





Similarly, the final output of output layer nodes after nonlinear mapping is expressed as:





With the help of the weight matrix, the input signal gets the output of each layer and finally reaches the output layer. It can be seen that the weight matrix plays the role of transport soldier in the process of forward signal transmission and plays the function of connecting the preceding and the following.

3.3 Back Propagation

Since gradient descent requires a clear error for each layer to update the parameters, the next focus is on how to propagate the error from the output layer back to the hidden layer.





The errors of the nodes of the output layer and the hidden layer are shown in the figure. The errors of the output layer are known. Next, the error analysis is made for the first node C of the hidden layer. Standing on node C again, the difference is that this time we look forward (the direction of the output layer). We can see that the two thick blue arrows pointing to node C start from nodes E and F, so the error for node C must be related to nodes E and F in the output layer.

Is not hard to find, the output layer nodes e have arrows pointing to the hidden layer nodes of c and d, so for error can not be hidden nodes of hidden nodes e c bully for himself, but to obey the principle of distribution according to work (by weight), the same node f error should also be subject to such principles, so for the error of the hidden layer nodes c as follows:





Similarly, the error of node D of hidden layer is:





To reduce the workload, we’re happy to write this as a matrix multiplication:





You’ll find that this matrix is a bit cumbersome, and it would be nice to simplify it to something like forward propagation. We can actually do it this way, as long as we don’t destroy the ratio, so we can ignore the denominator, so we can rewrite it as a matrix:





If you look closely, you will see that the weight matrix is actually the transpose of the weight matrix W for forward propagation, so the shorthand is as follows:





It is not difficult to find that the error of the output layer is transmitted to the hidden layer with the help of the transpose weight matrix, so we can use the indirect error to update the weight matrix connected with the hidden layer. It can be seen that weight matrix also plays the role of transport soldier in the process of back propagation, but this time it’s the output error of handling, instead of the input signal (we don’t produce error, but just error porter (flying signal -). ).

Fourth, the chain derivative

The third part introduces the forward propagation of input information and backward propagation of output error, and then updates parameters according to the obtained error.





First, we will update the parameters of the hidden layer W11. Before updating, let’s deduce from back to front until we foresee W11:





Therefore, the partial derivative of the error with respect to W11 is as follows:





The derivative yields the following formula (all values are known) :





Similarly, the partial derivative of the error with respect to W12 is as follows:





Similarly, the derivation can obtain the evaluation formula of W12:





Similarly, the partial derivative of error with respect to bias is as follows:





Substitute into the above formula:





Then we update the parameters of the input layer w11, and we continue to work backwards until we anticipate the w11 of the first layer (only this time we need to work backwards a bit longer) :





Therefore, the partial derivative of the error to w11 of the input layer is as follows:






ฅ ́ ˘ ฅ ̀




Similarly, the partial derivatives of the other three parameters of the input layer can be obtained in the same way, which will not be described here.

In the case that the partial derivative of each parameter is clear, the gradient descent formula can be substituted (not emphasized) :





At this point, the chain rule is used to update the parameters of each layer.

5. Introduce deltas

Use the chain rule to update the weights and you’ll find that it’s actually a very simple method, but it’s a long way out. Since the updating process can be seen as updating from the input layer to the output layer of the network from front to back, the errors of nodes need to be recalculated during each update, so there will be some unnecessary repeated calculations. In fact, the nodes that have already been calculated can be used directly, so we can look at the problem again and update from the back. The weights behind are updated first, and then the intermediate values generated by the weights behind are used to update the earlier parameters on this basis. This intermediate variable is the delta variable described below, which simplifies the formula and reduces the amount of computation. It is a bit of a dynamic programming step.

Next with the facts, you carefully look at the part in the fourth part of the chain derivative of error for w11 output layer and hidden layer w11 to partial derivatives as well as the offset for the partial derivatives of the process, you will find that the three formula is the same part, and parameters of hidden layers on the processes used by partial derivatives of the output layer parameters for part of the formula of partial derivatives, This is why the intermediate variable delta is introduced (the formula in the red box is the definition of delta).





Take a look at the classic book Neural Networks and Deep Learning. Delta is described as the error on the JTH neuron at layer L and defined as the partial derivative of the error with respect to the current weighted input. The mathematical formula is as follows:





Therefore, the error of the output layer can be expressed as (red box formula above) :





The error of the hidden layer can be expressed as (blue box formula in the figure above) :





Meanwhile, the weight update can be expressed as (green box formula in the figure above) :





In fact, the bias update is shown as (red box above) :





The four formulas above are actually the four back-propagation formulas mentioned in the book Neural Networks and Deep Learning (detailed derivation can be referred to this book) :





Careful observation, you will find that the combination of BP1 and BP2 can play the most effective, you can calculate the error of any layer, as long as the first use of BP1 formula to calculate the error of the output layer, and then use BP2 layer transfer, invincible, this is also the reason of the error back propagation algorithm. Meanwhile, the weight w and bias B can be calculated by BP3 and BP4 formulas.

So far, we have introduced the relevant knowledge of backpropagation. At the beginning, when we read the backpropagation materials, we always feel relatively independent. This textbook says so, and another blog says it in a different way. We introduced the origin and origin of back propagation from the general process first, and then used the chain derivative rule to calculate the weight and partial derivative of bias, and then we introduced the same kind of conclusion as the classic works, so I think more detailed, should have certain reference significance for beginners, I hope to help you.

For more articles, see my Zhihu Zhang Xiaolei on my GitHub Zhang Xiaolei

Welcome to pay attention to my official account [Machine Learning Travel Notes]. More interesting articles will be published here in the later period. Thanks.


Machine Learning Travel Notes

======== This is a nice dividing line ======

Code word diagram insert formula is not easy, favorite table forgot to like ೖ(⑅σ̑ᴗσ̑) ⑅

Nielsen M A. Neural networks and deep learning[M]. 2015.

Rashid T. Make your own neural network[M]. CreateSpace IndependentPublishing Platform, 2016.