Make writing a habit together! This is the sixth day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

Build back propagation from scratch

In forward propagation, we connect the input and hidden layers to the output layer. In backpropagation, we use the opposite process. Each weight in the neural network is changed a small amount at a time. The change in the weight value will have an impact on the final loss value (loss increase or loss decrease), and we need to update the weight in the direction of loss reduction. By slightly updating the weight each time and measuring the error change caused by the weight update, we can accomplish the following operations:

  • Determine the direction of weight update
  • Determine the magnitude of weight updates

Before implementing back propagation, we first understand another important concept of neural networks: learning rate. The learning rate helps us build more stable algorithms. For example, when determining the size of the weight update, we don’t make big changes to it all at once, but take a more cautious approach to slowly update the weight. This makes the model more stable; In later studies, we will also look at how learning rate can help improve stability. The whole process of updating weights to reduce errors is called gradient descent technique, and stochastic gradient descent is a means to minimize errors. More intuitively, a gradient represents difference (that is, the difference between actual and predicted values), while a decline represents a decrease in difference; A random representative selects a random sample for training and makes a decision based on it. In addition to stochastic gradient descent, there are many other optimization techniques that can be used to reduce loss values. Later, we will discuss different optimization techniques.

Backpropagation works as follows:

  • The forward propagation process is used to calculate the loss value.
  • Change all the weights slightly.
  • The influence of weight change on the loss function is calculated.
  • The weight value is updated in the direction of loss reduction according to whether the weight update increases or decreases the loss value.

Perform a training process (forward propagation + back propagation) for all the data in the data set, which is called an epoch. To further consolidate our understanding of back propagation in neural networks, let’s fit a simple function we know and see how to get weights. Assuming that the fitting function is y=3xy =3xy =3x, we expect to derive the weight value and bias value (3 and 0, respectively).

x 1 3 4 8 10
y 3 9 12 24 30

The above data set can be represented as linear regression y= AX + BY = AX + BY = AX +b, and we will try to calculate the values of AAA and BBB (although we know they are 2 and 0 respectively, our purpose is to study how to obtain these values using gradient descent), The aaa and BBB parameters are randomly initialized to values of 2.2692.2692.269 and 1.011.011.01. Next, we will build the back-propagation algorithm from zero so that we can clearly see how weights are calculated in a neural network. For simplicity, we will build a simple neural network with no hidden layers.

  1. Initialize the dataset as follows:
x = np.array([[1], [3], [4], [8], [10]])
y = np.array([[3], [9], [12], [24], [30]])
Copy the code
  1. Randomly initializing weights and bias values (only one weight and one bias value are needed when trying to determine the optimal values for AAA and BBB in the equation y= Ax +by = AX + B) :
w = np.array([[[2.269]], [[1.01]]])
Copy the code
  1. Define the neural network and calculate the squared error loss value:
import numpy as np
def feed_forward(inputs, outputs, weights) :
    out = np.dot(inputs, weights[0]) + weights[1]
    squared_error = np.square(out - outputs)
    return squared_error
Copy the code

In the code above, the input and randomly initialized weight values are matrix multiplied, and then added to the randomly initialized offset values. Once you have the output value, you can calculate the square error of the difference between the actual value and the predicted value.

  1. Increase each weight and bias value slightly, and calculate a squared error loss value for each weight and bias update.

If the square error loss value decreases with the increase of weight, then the weight value should increase, and the size of the weight value should increase is proportional to the size of the loss value reduced by the change of weight. And vice versa. In addition, the learning rate ensures that the increased weight value is less than the loss value change caused by the weight change, which can ensure that the loss value decreases more smoothly. Next, create a function called update_weights that performs a back-propagation procedure to update the weights in epochs times:

from copy import deepcopy
def update_weights(inputs, outputs, weights, epochs) :  
    for epoch in range(epochs):
Copy the code
  1. Pass the input through the neural network to calculate the loss if the weight is not updated:
        org_loss = feed_forward(inputs, outputs, weights)
Copy the code
  1. Make sure to make a deep copy of the weight list, since weights will be manipulated in subsequent steps, deep copy solves the problem of affecting parent variables due to changes to child variables:
        wts_tmp = deepcopy(weights)
        wts_tmp2 = deepcopy(weights)
Copy the code
  1. Iterate over the ownership weight and then make minor changes to it (+ 0.0001) :
        for ix, wt in enumerate(weights): 
            wts_tmp[ix] += 0.0001
Copy the code
  1. When the weight is modified, the forward propagation loss of the update is calculated. Calculate the change in loss due to a small change in weight, and since we are calculating the mean square error of all input samples, divide the change in loss by the amount of input data:
            loss = feed_forward(inputs, outputs, wts_tmp)
            del_loss = np.sum(org_loss - loss)/(0.0001*len(inputs))
Copy the code

Updating the weight with a smaller value and then calculating its effect on the loss value is equivalent to calculating the derivative of the weight change (i.e., reverse gradient propagation).

  1. Weights are updated by loss changes. The weight is slowly updated by multiplying the change in loss by a small number (0.01), which is the learning rate parameter:
            wts_tmp2[ix] += del_loss*0.01
            wts_tmp = deepcopy(weights)
Copy the code
  1. Returns the updated weights and deviations:
        weights = deepcopy(wts_tmp2)
return wts_tmp2
Copy the code

The overall update_weights() function looks like this:

from copy import deepcopy
def update_weights(inputs, outputs, weights, epochs) :  
    for epoch in range(epochs):
        org_loss = feed_forward(inputs, outputs, weights)
        wts_tmp = deepcopy(weights)
        wts_tmp2 = deepcopy(weights)
        for ix, wt in enumerate(weights): 
            wts_tmp[ix] += 0.0001
            loss = feed_forward(inputs, outputs, wts_tmp)
            del_loss = np.sum(org_loss - loss)/(0.0001*len(inputs))
            wts_tmp2[ix] += del_loss*0.01
            wts_tmp = deepcopy(weights)

        weights = deepcopy(wts_tmp2)
return wts_tmp2
Copy the code

Check the parameters and bias values in the network after training by updating the network 1000 times:

weights = update_weights(x, y, w, 1000)
print(weights)
Copy the code

The print weights are shown below, and you can see that they are very close to the expected results (w=3.0, b=0.0) :

[[[2.99929065]]
 [[0.00478785]]]
Copy the code

Another important parameter in the neural network is the batch size that needs to be considered when calculating the loss value. In the example above, we calculate the loss value for all the data simultaneously. However, when we have thousands of data, increasing the incremental contribution of a large number of data when calculating the loss value will lead to training difficulties, and may even exceed the memory limit and cannot be calculated. Therefore, data is usually divided into multiple batches and sent to the network for training in one epoch. Batch sizes commonly used for modeling range from 16 to 512.