Backpropagation is a common method to train neural networks. In the neural network Basics section, we mentioned that after determining the weight and bias of each neural node, the output of the neural network can be determined through feedforward, but this is only the output predicted value. We also need to deduce from the predicted value how the weights and deviations need to be adjusted to make the predicted value and expected value error smaller. That’s what backpropagation does.

To understand how backpropagation works, a concrete example is the simplest and most straightforward calculation. Suppose there is a 3-layer neural network as follows:

Suppose there is an existing text, the input is 0.1, 0.5, expected output is 0.9, 0.1. Our initial weight and deviation are shown as follows:

And then we’re going to adjust the initial weights by training this sample.

feedforward

First we use feedforward to determine the output of the neural network, which is the predicted value. Reviewing the input and output of the neural node, its input is the result of the weight and deviation calculation of each node in the upper layer, which is denoted as Z; Its output is the output generated by z through activation functions, denoted as A. The general formula is as follows:



Where, the activation function is defined as


Don’t worry too much about the meaning of each character of the formula, just need to know the calculation method.

To:





Calculate the output layer with adjustment:





In this way, we complete a feedforward calculation and get two predicted values (outputs) of the neural network, but there is a certain error with the expected value (target) of the sample. We use squared error function to calculate error C, and the general formula is:


Substitute values are:




In other words, the predicted value of the neural network has an error of 0.248076212 with the actual value. Our goal is to make the error smaller by adjusting the weight and deviation.

Back propagation

Back propagation is essentially a derivative process. If we want to know how to changeSo the total error energy is going to be smaller, so this is essentially the total error C pairPartial derivatives. The chain rule


What this formula actually means is thatWhat happens to C when you change delta, and then you take the three parts of the right-hand side.

First, because


Among themIt doesn’t matter. So C is rightPray for partial derivatives


The next oAnd the


so


again


So we can substitute


Finally, let’s seeBecause the


so


In conclusion,


To make the error smaller, we useSubtract this number and introduce a smaller one hereLearning rate(learning rate) is used to control the speed of gradient descent. Here, take 0.5, then:


And the same thing goes for the calculation bias, let’s take B3


while


so


the


so


In the same way, we can calculate the new one,,,,And then calculate the new one according to the formula,,,,,. In this way, we have completed a back propagation.