Bole online
Yale is afraid of cold
Namco



iamtrask
Translation group

Bottom line: Code is the most effective way to learn. This tutorial explains the BP back propagation algorithm through a very simple example implemented with a short python code.

The code is as follows:

X = np. Array ([[0, 1], [0,1,1], [1, 1], [1,1,1]]) y = np, array ([[0,1,1,0]]). T syn0 = 2 * np in random, random syn1 ((3, 4)) - 1 = 2*np.random. Random ((4,1)) -1 for j in xrange(60000): l1 = 1/(1+np.exp(-(np.dot(X,syn0)))) l2 = 1/(1+np.exp(-(np.dot(l1,syn1)))) l2_delta = (y - l2)*(l2*(1-l2)) l1_delta = l2_delta.dot(syn1.T) * (l1 * (1-l1)) syn1 += l1.T.dot(l2_delta) syn0 += X.T.dot(l1_delta)Copy the code

Of course, the procedure may be too succinct. I will briefly break it down into several parts for discussion.


Part one: a concise neural network

A neural network trained with BP algorithms tries to predict outputs from inputs.

Consider the following situation: Given three columns of input, try to predict the corresponding column of output. We can solve this problem by simply measuring the data of the input and output values. In this way, we can see that the leftmost column of input values and output values are perfectly matched/correlated. Intuitively speaking, the back propagation algorithm is to measure the statistical relationship between data in this way and then get the model. Let’s get down to business and practice.

Two-layer neural network:

import numpy as np # sigmoid function def nonlin(x,deriv=False): if(deriv==True): Return * (1 -) x x return 1 / (1 + np. Exp (x)) # input dataset x = np, array ([[0, 1], [0,1,1], [1, 1], [1,1,1]]) # output dataset y = np.array([[0,0,1,1]]).T # seed random numbers to make calculation # deterministic (just A good practice) np. Random. Seed (1) # initialize weight - 1 with mean 0 syn0 = 2* NP iter in xrange(10000): # forward propagation l0 = X l1 = nonlin(np.dot(l0,syn0)) # how much did we miss? l1_error = y - l1 # multiply how much we missed by the # slope of the sigmoid at the values in l1 l1_delta = l1_error * nonlin(l1,True) # update weights syn0 += np.dot(l0.T,l1_delta) print "Output After Training:" print l1Copy the code

Output After Training:

[[0.00966449]

[0.00786506]

[0.99358898]

[0.99211957]]

variable specifies
X Input data set, in the form of matrix, each 1 row represents 1 training sample.
y Output data set, in the form of matrix, each 1 row represents 1 training sample.
l0 Network layer 1, namely network input layer.
l1 Layer 2 of the network, often referred to as the hidden layer.
syn0 Layer 1 weight, synaptic 0, connects layer L0 to layer L1.
* So when two vectors of equal length are multiplied by each other, they are the same length.
Element subtraction, so the subtraction of two vectors of equal length is equal to the subtraction of their counterparts, resulting in vectors of equal length.
x.dot(y) If x and y are vectors, the dot product is performed. If both are matrices, the matrix multiplication operation is performed. If one of them is a matrix, the vector multiplies the matrix.

As you can see in “Post-training Result Output,” the program executes correctly! Before describing the process, I recommend that the reader try to understand and run the code to get a sense of how an algorithmic program works. It’s a good idea to run all of this in ipython Notebook (or if you want to write your own script, I highly recommend notebook). Here are a few key points to help you understand the program:

  • Compare the state of L1 layer at the first iteration and the last iteration.
  • Look closely at the “nonlin” function, which gives us a probability value as output.
  • Watch carefully how l1_Error changes during the iteration.
  • Breaking down the expression in line 36 shows that most of the secret sauce is in there.
  • Read line 39 carefully. Everything in the network prepares you for this operation.

Now, let’s walk through the code line by line.

Tip: Open this blog post on two screens so you can read the article against the code. That’s exactly what I do when I write my blog. 🙂

Line 1: Here imports a linear algebra library called NUMpy, which is the only external dependency in this program.

Line 4: Here is our “non-linear” section. Although it could be any number of functions, in this case the nonlinearity used is mapped to a function called sigmoid. The Sigmoid function maps any value to a value in the range 0 to 1. It allows us to convert real numbers into probabilities. The Sigmoid function also has several other nice features for neural network training.

Line 5: Note that the derivative of sigmod is also obtained from the body of the “nonlin” function (when the parameter deriv is True). One of the nice features of the Sigmoid function is that its derivative can be obtained by using only its output value. If the output value of Sigmoid is expressed by the variable OUT, its derivative value can be obtained simply by the expression out *(1-out), which is very efficient.

If you’re not familiar with taking derivatives, a derivative is the slope of a sigmod curve at a given point (as shown in the figure above, the slope is different at different points on the curve). For more on derivatives, check out the Khan Academy’s derivative Solution tutorial.

Line 10: This line of code initializes our input data set as a matrix in NUMpy. Each behavior has a “training instance,” and each column corresponds to an input node. Thus, our neural network has three input nodes and four training instances.

Line 16: This line initializes the output data set. In this case, to save space, I generated the dataset in a horizontal format (1 row, 4 columns) definition. Dot T is the transpose. After transpose, the y-matrix contains 4 rows and 1 column. Consistent with our input, each row is a training instance, and each column (and only one column) corresponds to an output node. So our network has three inputs and one output.

Line 20: It’s a good practice to seed your random numbers. So you still get a random distribution of the initial set of weights, but every time you start training, you get exactly the same distribution of the initial set of weights. This makes it easy to observe how your strategy changes affect your network training.

Line 23: This line of code implements the initialization of the neural network weight matrix. Syn0 is used to refer to the “synapse zero” (the weight matrix between the input layer and the first hidden layer). Since our neural network has only two layers (input layer and output layer), only a weight matrix is needed to connect them. The dimension of the weight matrix is (3,1) because the neural network has three inputs and one output. In other words, the size of l0 layer is 3 and the size of L1 layer is 1. Therefore, in order to connect each neuron node in l0 layer with each neuron node in l1 layer, a connection matrix with a dimension of (3,1) is needed. 🙂

Also, note that the mean value of the randomly initialized weight matrix is 0. There’s a lot of science in weight initialization. Since we’re just practicing for now, it’s ok to set the mean to 0 when the weights are initialized.

Another realization is that the so-called “neural network” is actually this weight matrix. Although there are “layers” L0 and L1, they are based on the instantaneous values of the data set, that is, the input and output states of the layer vary with different input data, and these states do not need to be saved. In the process of learning and training, only syn0 weight matrix is stored.

Line 25: This line of code is the neural network training code from the start. This for loop iteratively executes the training code many times, making our network better fit the training set.

Line 28: We know that l0, the first layer of the network, is our input data. About this, we will elaborate further below. Remember that X contains four training instances (rows)? In this partial implementation, we will process all instances at once, a training style called “batch” training. So, although we have four different L0 rows, you can treat them all as a single training instance, and it makes no difference to do so. (We can load 1,000 or even 10,000 instances at a time without changing a single line of code).

Line 29: This is the forward prediction stage of the neural network. Basically, the network is first asked to “try” to predict the output based on a given input. Then, we’ll look at how well the prediction works, so that we can make some adjustments so that the network performs a little better with each iteration.

(4 x 3) dot (3 x 1) = (4 x 1)

This line of code contains two steps. First, matrix multiply l0 and syn0. The result is then passed to the sigmoid function. Specifically considering the dimensions of each matrix:

(4 x 3) dot (3 x 1) = (4 x 1)

There are constraints on matrix multiplication, such that the two dimensions in the middle of the equation must agree. The resulting matrix has the rows of the first matrix and the columns of the second matrix.

Since four training instances are loaded, the final result is four guesses, namely a matrix of (4 x 1). Each output corresponds to a guess by the network about the correct result given the input. Perhaps this is an intuitive explanation for why we can “load” any number of training instances. In this case, matrix multiplication still works.

Line 32: Thus, for each input, l1 has a corresponding “guess” result. Then the effect of network prediction can be compared by subtracting the real result (y) from the predicted result (L1). L1_error is a vector of positive and negative numbers that reflects how much error the network has.

Line 36: Now, we’re in for the dry stuff! This is where the secret weapon is! This line of code is relatively informative, so it is divided into two parts to analyze.

Part one: Take the derivative

nonlin(l1,True)Copy the code

If L1 can be represented as three points, as shown in the figure below, the above code produces three slashes in the figure. Notice that the slashes are very flat when the output is very large at x=2.0 (green) and very small at x=-1.0 (purple). As you can see, the point with the highest slope is at x=0 (blue). This feature is very important. And you can also see that all the derivatives are in the range of 0 to 1.

Overall understanding: weighted derivative value of error term

l1_delta = l1_error * nonlin(l1,True)Copy the code

Of course, the term “weighted derivative value of the error term” has a more rigorous mathematical description, but I think this definition accurately captures the intent of the algorithm. L1_error is a (4,1) matrix, and nonlin (l1,True) returns a (4,1) matrix. All we do is multiply them “element by element” to produce a matrix of (4,1) size, l1_delta, each of whose elements is the result of multiplying the elements.

When we multiply the “slope” by the error, we are actually reducing the prediction error with a high degree of confidence. Go back and look at the sigmoid function graph! When the slope is very flat (close to 0), then the network output is either a very large value or a very small value. That means the network is pretty sure if it’s one thing or the other. However, if the network’s decision results correspond to (x = 0.5, y = 0.5), it is less certain. For this “plausible” prediction case, we make the maximum adjustment, and for the certain case not much, multiplied by a number close to zero, the corresponding adjustment is negligible.

Line 39: The update network is now ready! Let’s look at a simple training example.

In this training example, we are ready for weight updates. Let’s update the leftmost weight (9.5).

Weight update = input value * l1_delta

For the leftmost weight, in the above equation it is 1.0 times l1_delta. As you can imagine, this increment in weight 9.5 is negligible. Why are there so few updates? It is because we are very confident about the forecast and have a high degree of confidence that it will be right. A smaller error and slope means a smaller update. Considering all the connection weights, the increments of these three weights are very small.

However, because of the “batch” training mechanism, the above update steps were performed on all four training instances, which also looked somewhat similar to the image. So what does line 39 do? In this simple line of code, it completed the following operations: first, calculate the weight update amount corresponding to each weight in each training instance, then add up all the updates of each weight, and then update these weights. You can see how it does this by doing the matrix multiplication operation yourself.

Key conclusions:

We now know how neural networks are updated. Go back and look at the training data and do some deep thinking. When both input and output are 1, we increase the connection weight between them. When the input is 1 and the output is 0, we reduce its connection weight.

Therefore, in the following four training examples, the weights between the first input node and the output node will continue to increase or remain unchanged, while the other two weights will increase or decrease simultaneously in the training process (the intermediate process is ignored). This enables the network to learn based on the connections between inputs and outputs.


Part TWO: A slightly more complicated problem

Consider the following situation: given the first two columns of input, try to predict the output column. A key point is that these two columns have no relation to the output; each column has a 50% chance of predicting a 1 and a 50% chance of predicting a 0.

So what is the output mode now? It seems to have nothing to do with the third column, and its value is always 1. Columns 1 and 2 can be more clearly understood when one of the columns is 1 (but not both!). , the output is 1. Here’s the pattern we’re looking for!

The above can be regarded as a “nonlinear” model because there is no one-to-one relationship between individual inputs and outputs. There is a one-to-one relationship between the input combination and the output, in this case the combination of columns 1 and 2.

 

Believe it or not, image recognition is a similar problem. If there are 100 images of a pipe and a bicycle of the same size, there is no single pixel position that can directly determine whether an image is a bicycle or a pipe. From a purely statistical point of view, these pixels may also be randomly distributed. However, some combinations of pixels are not random, that is, they form a bicycle or a person.

Our strategy

It can be seen from the above that there is a one-to-one relationship between the product after pixel combination and the output. To accomplish this combination first, we need to add an additional network layer. The input of the first layer is combined, and then the output of the first layer is taken as the input, and the final output result is obtained through the mapping of the second layer. Before I give you an implementation, let’s look at this table.

After the random initialization of weights, we get the implicit value of layer 1. Notice anything? The second column (the second hidden node) is already relevant to the output! It’s not perfect, but it’s admirable. Believe it or not, finding such correlations is a big part of neural network training. (Arguably, this is the only way to train a neural network), and subsequent training is to make the connection even bigger. The SYN1 weight matrix maps the composite output of the hidden layer to the final result, and the SYN0 weight matrix needs to be updated along with updating SYN1 to better generate these combinations from the input data.

Note: Model more combinations of relationships by adding more middle layers. This strategy is widely known as “deep learning” because it is modeled by adding layers of deeper networks.

Three-layer neural network:

import numpy as np def nonlin(x,deriv=False): if(deriv==True): Return * (1 -) x x return 1 / (1 + np. Exp (x)) x = np, array ([[0, 1], [0,1,1], [1, 1], [1,1,1]]) y = np, array ([[0], [1], [1]. [0]]) np. Random. Seed (# 1) randomly initialize our weights with mean 0 syn0 = 2 * np. Random, random syn1 = ((3, 4)) - 1 2*np. Random. Random ((4,1)) -1 for j in xrange(60000): # Feed forward through layers 0, 1, and 2 l0 = X l1 = nonlin(np.dot(l0,syn0)) l2 = nonlin(np.dot(l1,syn1)) # how much did we miss the target value? l2_error = y - l2 if (j% 10000) == 0: print "Error:" + str(np.mean(np.abs(l2_error))) # in what direction is the target value? # were we really sure? if so, don't change too much. l2_delta = l2_error*nonlin(l2,deriv=True) # how much did each l1 value contribute to the l2 error  (according to the weights)? l1_error = l2_delta.dot(syn1.T) # in what direction is the target l1? # were we really sure? if so, don't change too much. l1_delta = l1_error * nonlin(l1,deriv=True) syn1 += l1.T.dot(l2_delta) syn0 += l0.T.dot(l1_delta)Copy the code

variable specifies
X Input data set, in the form of matrix, each 1 row represents 1 training sample.
y Output data set, in the form of matrix, each 1 row represents 1 training sample.
l0 Network layer 1, namely network input layer.
l1 Layer 2 of the network, often referred to as the hidden layer.
l2 Assume as the last layer of the network, its output should gradually approach the correct result as the training progresses
syn0 Layer 1 weight, synaptic 0, connects layer L0 to layer L1.
syn1 The second weight, synaptic 1, connects l1 to L2.
l2_error This value indicates the number of “misses” in neural network prediction.
l2_delta This value is the error of the neural network weighted by confidence, which is approximately equal to the prediction error except that the confidence error is very small.
l1_error This value is the result of l2_delta weighted by SYN1 so that the intermediate/hidden layer error can be calculated.
l1_delta This value is the error of l1 layer of neural network weighted by confidence, approximately equal to l1_error except that the confidence error is very small.

It all looks so familiar! This is simply made up of two previous implementations stacked on top of each other so that the output of layer 1 (L1) is the input of layer 2. The only new thing is line 43.

Line 43: By “confidence weighting” the errors of L2 layer, the corresponding errors of L1 layer are constructed. To do this, the error is simply transmitted by the connection weights between L2 and L1. This is also known as “contribution weighted error” because we are learning how much the output value of each node in l1 contributes to the error of each node in L2. Then, the SYN0 weight matrix is updated using the same steps in the previous 2 layers of neural network implementation.


Part three: Summary and prospect

Personal advice:

If you want to get serious about understanding neural networks, here’s a tip: Try to reconstruct them from memory. I know it sounds crazy, but it will help. If you want to be able to create arbitrary neural networks based on new academic articles, or to read sample programs with different network structures, I think this training would be a killer. Even if you are using open source frameworks such as Torch, Caffe or Theano, this can be helpful. I worked with neural networks for several years before performing this exercise. That was one of the best investments I made in the area (and it didn’t take long).

Job prospects

This example still needs a few additional features to be truly comparable to the industry’s best network architecture. If you want to further improve your network, here are some reference points. (More updates may follow)

Learning rate

  • Bias unit
  • Small batch
  • The Delta clip
  • Parameterize the network layer size
  • regularization
  • The probability of the output
  • Momentum factor
  • Batch normalization
  • GPU compatibility
  • Some of your own fancy ideas