Before we introduce neural networks, let’s do a quick review of perceptrons from the previous article:

We started with an introduction to what a perceptron is, an intuitive look at the basic structure of a perceptron and what it can do. Secondly, the realization principle of and gate is explained in detail through the perceptron, and the training process of parameters is analyzed through mathematical formulation (gradient descent algorithm). After that, the logic circuit of and gate is realized in the form of code, and the realization result of and and or gate is finally obtained in the same way. In addition, at the end of the second section, we also throw out the problem that the single-layer perceptron cannot solve the or door, thus providing an “introduction” to the third section. Finally, we find that the xOR gate can be realized in the form of logic circuit combination. In the model structure, that is, a multi-layer perceptron is used to solve the or problem. Finally, the real validity of the model is verified by the code.

The specific content of the article: deep learning alchemy – Taoye not speak yards, and hydrology, incredibly write perceptron such a simple content: www.zybuluo.com/tianxingjia…

Next comes the neural network, which is basically the same as the perceptron described above, except that the activation function used in the perceptron is a step function, while the activation function used in the neural network needs to be defined differently according to the actual problem. In addition, some other extensions have been made in the neural network to play a more powerful function.

In this article, we will not use a variety of mathematical formulas to carry out the principle of neural network derivation, nor will we move out of the model of learning and training and back-propagation algorithms, not to mention too much code implementation, which Taoye will give both hands in the back.

This paper mainly introduces the basic structure, design and implementation of neural network, so that readers can have a comprehensive understanding of it as a whole. It is mainly divided into the following parts:

  • Smooth transition from perceptron to neural network
  • The emergence of activation functions
  • Design and Basic Implementation of Neural Network (Forward Propagation)
  • Forward propagation process is realized based on handwritten digital data set

First, smooth transition from perceptron to neural network

We can represent a simple neural network as follows:

We call the leftmost layer the input layer, which is mainly for the input of our data, and the leftmost layer the output layer, which is for the output of the results of the neural network, and except for the input layer and the output layer, all the layers in the middle we call the middle layer, which is also called the hidden layer.

Note: the middle/hidden layer can sometimes have multiple layers for practical reasons, but in the figure above there is only one.

There are altogether three layers of neurons in the neural network above, but in fact only two groups of weight parameters are processed, namely, input layer -> intermediate layer, and intermediate layer -> output layer, respectively. It is like that we want to take the train from Shanghai to Shenzhen, there is a transfer station in Nanchang, so the route is:


Shanghai > nanchang > shenzhen Shanghai -> Nanchang -> Shenzhen

Although we say there are three locations, in fact there are only two train running processes, and the structural layers of the neural network above can be understood in the same way.

In the last article, we wrote the structure and processing process of the perceptron in detail, that is, after summing the inner product of the weight vector and the feature vector of the sample data, we handed it to a step function to get the output of the perceptron. The process can be illustrated as follows:

Where h() H ()h() represents the activation function, which specifically refers to the step function, namely


h ( x ) = { 0 . x Or less 0 1 . x > 0 h(x) = \begin{cases} 0, x \leq 0\\ 1, x>0\\ \end{cases}

Therefore, here, the final output result of the perceptron yyy is:


y = h ( w 1 x 1 + w 2 x 2 + b ) y = h(w_1x_1 + w_2x_2 + b)

The above is the perceptron mentioned in our last article, and the processing process is relatively simple. Just now we have mentioned the activation function, which takes the sum of the input signals and converts it into the output signal by special processing. We call it the activation function.

The activation function is the bridge between perceptron and neural network.

In the explanation of perceptron in the previous article, the activation function used is step function, while in neural network, different activation functions need to be used for different practical problems. Sometimes, the choice of activation function will also have a great influence on the training results of neural network model.

The most commonly used activation functions in neural networks are Sigmoid, ReLU, and others as well as various variants. There is still a long way to go, and good activation functions still need to be discovered by scratch – and – rub – hardened researchers.

Now, let’s take a concrete look at these commonly used activation functions in neural networks.

Second, the emergence of activation functions

Sigmoid function is one of the activation functions frequently used in neural networks. Its expression form and corresponding code image are as follows:


h ( x ) = 1 1 + e x h(x) = \frac{1}{1+e^{-x}}

Through the comparison of Sigmoid function and step function, we can find that they mainly have the following differences:

  • smooth

Comparing the specific expressions or images of the two, it is not difficult to know that the step function is a segmented form, bounded by 0, less than 0 output 0, greater than 0 output 1, the output of the step function has a relatively large change.

In contrast, the Sigmoid activation function is a smooth curve with continuous output changes as the input changes, rather than a sharp step function.

  • The output value

For the step function, its output value is either 0 or 1, and the Sigmoid function can be the output of 0.1, 0.21212, 0.8231 and other real numbers. In other words, the output of the step function is either 0 or 1 of the two signal results (classification), whereas the output of the Sigmoid is a continuous real value signal.

In addition, for Sigmoid, another important property is that its output interval is between 0 and 1, that is to say, whatever the value of the input Sigmoid is, it will be mapped to a value in 0-1, which is very consistent with the property of probability. We took full advantage of this when we talked about Logistic regression earlier, so you can jump to this for a moment: Machine Learning in Action — Taoye tells you how Logistic regression works

The differences between the step function and the Sigmoid function were briefly introduced.

In fact, both step function and Sigmoid function are nonlinear functions. And must be used in neural network nonlinear function as the activation function, because if say neural network using linear function as the activation function, will not be able to pass the deepening of the neural network layers to improve the competence of the models, the nonlinear function for the output of the model increase the possibility of more, also can solve the problem of more complex changes.

The reader should focus on understanding the content of the previous paragraph. So why is it that the expression ability of the model cannot be improved by deepening the layers of the neural network when the linear function is used as the activation function?

Because if we use linear function as the activation function, even if we deepen the number of layers of the network, we can still find a new linear function to replace the previous multiple linear functions, which is equivalent to deepening the number of layers of the network is futile. For example we h (x) = CXH (x) = CXH (x) = cx as the activation function, the y (x) = h (h) (h (x)) (x) = y h (h) (h (x)) (x) = y h (h (h (x))) as a three-layer neural network output, Then we use y(x)=c3xy(x)= C ^3xy(x)=c3x as the activation function, which can actually replace the first three layers of neural network (C3C ^ 3C3 is still a constant).

Therefore, the definition of multilayer neural network is meaningless if linear function is used as the activation function.

In the development of neural network, the activation function that was most used in the early stage was Sigmoid. After the emergence of ReLU activation function, ReLU was more favored by demanders. Let’s take a look at ReLU below.

The expression of ReLU is very simple. When the input is less than 0, 0 is directly taken as the output. When the input is greater than 0, the output is equal to the input. Its expression, code and image are shown below:


h ( x ) = m a x ( 0 . x ) = { 0 . x Or less 0 x . x > 0 h(x) = max(0, x)= \begin{cases} 0, x \leq 0\\ x, x>0\\ \end{cases}

We can find that ReLU activation function is also a piecewise function. Compared with Sigmoid, ReLU has two obvious advantages:

  • The ReLU expression is somewhat more efficient than Sigmoid because it is simpler
  • ReLU activation function can solve the problem of gradient disappearance to a certain extent, while Sigmoid activation function is prone to the problem of gradient disappearance during the actual back propagation, which leads to almost no update of parameters. Therefore, Sigmoid is not suitable for the training of deep neural networks

We will explain these two points in detail later.

Note: there are more activation functions in neural network than these two. Readers can learn more activation functions by themselves. If other activation functions are needed later, Taoye will also be singled out and arranged for you

Iii. Design and Basic Implementation of Neural Network (Forward propagation)

Resources: Chapter 3 of Introduction to Deep Learning: Python-based Theory and Implementation

Then we will introduce the design and basic implementation of neural network in detail, that is, its forward processing process from input to output. Take the following three-layer neural network as an example:

In the figure above, X1, X2x_1, X_2x1 and X2 respectively represent two different attribute features of a single sample. Y1, y2Y_1,y_2y1 and y2 represent two final outputs after neural network processing.

Each of these circles can be viewed as a neuron, or node. The intermediate neuron/node is only used as the intermediate bridge from the input node to the output node, and the intermediate node is calculated by the neuron and weight parameter of the previous layer. Let’s take the signal transmission from the input layer to the neuron of the first layer as an example to explain this process in detail.

In order to facilitate the unification of paranoid BBB into the transmission process of neuron signal, we introduce an additional input node/neuron whose value is always 1.

It is also important to note that the neurons at the upper and lower layers are connected by a WWW weight parameter, which is a Cartesian product. (For cartesian products, Mysql readers should join, can be understood as a full join)

Therefore, we can obtain the value of a1(1)a_1^{(1)}a1(1) of the operation result of the first neuron in the first layer:


a 1 ( 1 ) = x 1 w 11 ( 1 ) + x 2 w 12 ( 1 ) + b 1 ( 1 ) a_1^{(1)}=x_1w_{11}^{(1)}+x_2w_{12}^{(1)}+b_1^{(1)}

Similarly, we can get the first layer neurons in a2 (1), a3 (1) a_2 ^ {} (1), a_3 ^ {} (1) a2 (1), a3 (1) specific results:


a 2 ( 1 ) = x 1 w 21 ( 1 ) + x 2 w 22 ( 1 ) + b 2 ( 1 ) a 3 ( 1 ) = x 1 w 31 ( 1 ) + x 2 w 32 ( 1 ) + b 3 ( 1 ) \begin{aligned} & a_2^{(1)}=x_1w_{21}^{(1)}+x_2w_{22}^{(1)}+b_2^{(1)} \\ & a_3^{(1)}=x_1w_{31}^{(1)}+x_2w_{32}^{(1)}+b_3^{(1)} \\ \end{aligned}

The above is the result obtained after the input node is processed by one parameter. In order to facilitate mathematical representation and later code implementation, we generally express the above process in the form of matrix and vector. If we put the first result of output layer neurons a1 (1), a2 (1), a3 (1) a_1 ^ {} (1), a_2 ^ {} (1), a_3 ^ {} (1) a1 (1), a2 (1), a3 (1) as a vector, the said we can through the following ways:


( a 1 ( 1 ) a 2 ( 1 ) a 3 ( 1 ) ) = ( x 1 w 11 ( 1 ) + x 2 w 12 ( 1 ) + b 1 ( 1 ) x 1 w 21 ( 1 ) + x 2 w 22 ( 1 ) + b 2 ( 1 ) x 1 w 31 ( 1 ) + x 2 w 32 ( 1 ) + b 3 ( 1 ) ) = ( w 11 ( 1 ) w 12 ( 1 ) w 21 ( 1 ) w 22 ( 1 ) w 31 ( 1 ) w 32 ( 1 ) ) ( x 1 x 2 ) + ( b 1 ( 1 ) b 2 ( 1 ) b 3 ( 1 ) ) \begin{aligned} \left( \begin{matrix} a_1^{(1)}\\ a_2^{(1)}\\ a_3^{(1)}\\ \end{matrix} \right) & = \left( \begin{matrix} x_1w_{11}^{(1)}+x_2w_{12}^{(1)}+b_1^{(1)}\\ x_1w_{21}^{(1)}+x_2w_{22}^{(1)}+b_2^{(1)}\\ x_1w_{31}^{(1)}+x_2w_{32}^{(1)}+b_3^{(1)}\\ \end{matrix} \right) \\ & = \left( \begin{matrix} w_{11}^{(1)} & w_{12}^{(1)}\\ w_{21}^{(1)} & w_{22}^{(1)}\\ w_{31}^{(1)} & w_{32}^{(1)}\\ \end{matrix} \right) \left( \begin{matrix} x_1\\ x_2\\ \end{matrix} \right) + \left( \begin{matrix} b_1^{(1)}\\ b_2^{(1)}\\ b_3^{(1)}\\ \end{matrix} \right) \end{aligned}

We may as well define the specific values of the above vectors and matrices, and then simulate the above operation process:


Assumptions: ( x 1 . x 2 ) T = ( 0.6 . 0.8 ) T ( w 11 ( 1 ) w 12 ( 1 ) w 21 ( 1 ) w 22 ( 1 ) w 31 ( 1 ) w 32 ( 1 ) ) = ( 0.05 1.6 0.3 0.7 0.8 1.2 ) ( b 1 ( 1 ) . b 2 ( 1 ) . b 3 ( 1 ) ) T = ( 0.05 . 0.6 . 1.3 ) T \begin{aligned} & Assume: \ \ & (x_1, x_2) ^ T = (0.6, 0.8) ^ T \ \ & \ left (\ begin w_ {matrix} {11} ^ {} (1) & w_ {12} ^ {(1)} \ \ w_ {21} ^ {} (1) & w_ {and} ^ \ \ {(1)} W_ {31} ^ {} (1) & w_ {32} ^ {(1)} {matrix} \ \ \ \ end right) = \ left (\ begin {matrix} \ \ 0.3 & 0.7 & 1.6 0.05 0.8 & 1.2 \ \ \ \ \ \ {matrix} \ \ end right) & (b_1 ^ {} (1), b_2 ^ {} (1), b_3 ^ {} (1)) ^ T = (0.05, 0.6, 1.3) ^ T \ end} {aligned

Then we can be obtained by the operation (a1 (1), a3 (1), a3 (1)) T (a_1 ^ {} (1), a_3 ^ {} (1), a_3 ^ {} (1)) ^ T (a1 (1), a3 (1), a3 (1)) * * T value (the specific value of arbitrary definition, for simulating the forward propagation process, Does not represent any meaning) ** :


( a 1 ( 1 ) a 2 ( 1 ) a 3 ( 1 ) ) = ( 0.05 1.6 0.3 0.7 0.8 1.2 ) ( 0.6 0.8 ) + ( 0.05 0.6 1.3 ) = ( 1.2 . 0.14 . 2.74 ) T \begin{aligned} \left( \begin{matrix} a_1^{(1)}\\ a_2^{(1)}\\ a_3^{(1)}\\ \end{matrix} \right) & = \left( \begin{matrix} 0.3 & 0.05 & 1.6 \ \ 0.7 \ \ 0.8 & 1.2) {matrix} \ \ \ \ end right \ left (\ begin {matrix} \ \ 0.8\0.6 \ {matrix} \ \ end right) + (\ \ left the begin {matrix} \ \ 0.6\0.05\1.3 \ {matrix} \ \ \ end right) \ \ & = (1.2, 0.14, 2.74) ^ T \ end} {aligned

At this point, we have completed the calculation of the weighted sum (wx+b) of the first hidden layer. As mentioned above, in neural network, in order to improve the expression ability of the model, we often process the calculation results through activation function after a calculation of hidden layer. Here we can use Sigmoid as our activation function, Set of (a1 (1), a2 (1), a3 (1)) (a_1 ^ {} (1), a_2 ^ {} (1), a_3 ^ {} (1)) (a1 (1), a2 (1), a3 (1)) for the results of after activation process (z1 (1) and z2 (1), z3 (1)) (z_1 ^ {} (1), z_2 ^ {} (1), z_3 ^ {} (1)) (z1 (1), z2 (1), z3 (1)), namely:


( z 1 ( 1 ) . z 2 ( 1 ) . z 3 ( 1 ) ) T = S i g m o i d ( ( a 1 ( 1 ) . a 2 ( 1 ) . a 3 ( 1 ) ) T ) = ( S i g m o i d ( a 1 ( 1 ) ) . S i g m o i d ( a 2 ( 1 ) ) . S i g m o i d ( a 3 ( 1 ) ) ) T = ( 0.2315 . 0.5349 . 0.9393 ) T \begin{aligned} (z_1^{(1)},z_2^{(1)},z_3^{(1)})^T & = Sigmoid((a_1^{(1)},a_2^{(1)},a_3^{(1)})^T) \\ & = (Sigmoid (a_1 ^ {} (1)), Sigmoid (a_2 ^ {} (1)), Sigmoid (a_3 ^ {} (1))) ^ T \ \ & = (0.2315, 0.5349, 0.9393) ^ T \ end} {aligned

We can also do this with code (NumPy operations) :

This is the whole process from the input layer to the first hidden layer. Generally speaking, it is very simple. It is nothing more than weighting and then processing through the activation function and passing the result signal to the next layer.

The whole process is quite simple, the most important thing is to know the input, output and shape values of the w weight parameter and the B bigot parameter, which is quite important. The weight parameters of w shape mainly depends on the input and output, because they are in the form of cartesian product operation, is closely connected to each other, if the input is m, the output is n, w shape values as (m, n), this can be through the matrix calculations to understand * * (not to rote learning, the most important thing is to understand). In addition, the number of b weight parameters mainly depends on the output, because each output corresponds to a paranoid, so the number of paranoid is consistent with the output (emphasis on understanding) **

The signal transmission process from the input layer to the first hidden layer is shown in the left figure below:

In the picture on the right above, similarly, we transfer the signal from the first hidden layer to the second hidden layer, changing only the data, and the internal details are exactly the same. By w (2), (2) b w ^ {} (2) and b ^ {} (2) w (2) and b (1) (2) parameters on z z ^ {(1)} z (1) for processing visible:


( a 1 ( 2 ) a 2 ( 2 ) ) = ( z 1 ( 1 ) w 11 ( 2 ) + z 2 ( 1 ) w 12 ( 2 ) + z 3 ( 1 ) w 13 ( 2 ) + b 1 ( 2 ) z 1 ( 1 ) w 21 ( 2 ) + z 2 ( 1 ) w 22 ( 2 ) + z 3 ( 1 ) w 23 ( 2 ) + b 2 ( 2 ) ) = ( w 11 ( 2 ) w 12 ( 2 ) w 13 ( 2 ) w 21 ( 2 ) w 22 ( 2 ) w 23 ( 2 ) ) ( z 1 ( 1 ) z 2 ( 1 ) z 3 ( 1 ) ) + ( b 1 ( 2 ) b 2 ( 2 ) ) \begin{aligned} \left( \begin{matrix} a_1^{(2)}\\ a_2^{(2)}\\ \end{matrix} \right) & = \left( \begin{matrix} z_1^{(1)}w_{11}^{(2)}+z_2^{(1)}w_{12}^{(2)}+z_3^{(1)}w_{13}^{(2)}+b_1^{(2)}\\ z_1^{(1)}w_{21}^{(2)}+z_2^{(1)}w_{22}^{(2)}+z_3^{(1)}w_{23}^{(2)}+b_2^{(2)}\\ \end{matrix} \right) \\ & = \left( \begin{matrix} w_{11}^{(2)} & w_{12}^{(2)} & w_{13}^{(2)} \\ w_{21}^{(2)} & w_{22}^{(2)} & w_{23}^{(2)}\\ \end{matrix} \right) \left( \begin{matrix} z_1^{(1)}\\ z_2^{(1)}\\ z_3^{(1)}\\ \end{matrix} \right) + \left( \begin{matrix} b_1^{(2)}\\ b_2^{(2)}\\ \end{matrix} \right) \end{aligned}

( z 1 ( 2 ) . z 2 ( 2 ) ) = S i g m o i d ( ( a 1 ( 2 ) . a 2 ( 2 ) ) T ) = ( S i g m o i d ( a 1 ( 2 ) ) . S i g m o i d ( a 2 ( 2 ) ) ) T \begin{aligned} (z_1^{(2)},z_2^{(2)}) & = Sigmoid((a_1^{(2)},a_2^{(2)})^T) \\ & = (Sigmoid(a_1^{(2)}),Sigmoid(a_2^{(2)}))^T \\ \end{aligned}

OK, the second level of processing has also been completed, and you can see that the first two processes are almost identical. Then comes the last layer of processing, that is, the signal transmission process from the second hidden layer to the final output of the neural network. In this process, it is generally different from the previous hidden layer, which needs to be defined according to our actual problems.

Generally, we can borrow identity function for regression problems, Sigmoid function for binary classification problems, and Softmax function for multivariate classification problems.

At this point, we might as well use the identity function to complete the last layer of signaling. An identity function is one that outputs an input without processing it.

The above is the complete process of forward propagation of the three-layer fully connected neural network. Next, we can realize the forward propagation process in the form of code. Note: there is no parameter training involved here, just to make readers familiar with the forward propagation of neural networks.

First of all, let’s define the activation function. Through the processing of the above three layers of neural network, we can know that there are two activation functions involved. Sigmoid is used in the processing of the first layer and the second layer, and identity function is used in the third layer.

Subsequently, we need to define a forward method to achieve a layer of signal processing, mainly involving NumPy operation content, through NumPy to achieve matrix operations, NumPy, Taoye before finishing an article to introduce in detail, need to charge readers can jump to: ah! It smells good to play NumPy like this! , the forward is defined as follows:

Since this part of our content aims to introduce the process of neural network forward propagation in detail to readers, and does not involve the training process of parameters (left in the next article), we need to manually define additional parameters that each layer needs to deal with, and we use the Initial_Params method to complete this function. Finally, the signal forward propagation process of the three-layer neural network is completed:

The complete code for the above procedure is as follows:

import numpy as np def sigmoid(in_data): return 1 / (1 + np.exp(-in_data)) def identity(in_data): Return in_data """ Explain: the process that implements single-layer forward propagation Parameters: x_data: neuron data of the upper layer w_data: w weight matrix parameter b_data: Def forward(x_data, w_data, b_data): Return np.matmul(w_data, x_data) + b_data "" Explain: "def initial_params(): W_1 = np. Array ([[0.05, 1.6], [0.3, 0.7], [0.8, 1.2]]) b_1. = np array ([[0.05], [0.6], [1.3]]) w_2 = np, array ([[0.15, 0.65, 0.3], [1.4, 0.38, 0.53]]) b_2. = np array ([[1.3], [0.72]]) return w_1, b_1, w_2, b_2 if __name__ = = "__main__" : X_data = Np.array ([[0.6], [-0.8]]) A_1 = forward(x_data, w_1, w_2, b_2 = initial_params()) B_1) # Signal transfer from the input layer to the first hidden layer Z_1 = SIGmoID (a_1) # First SIGmoID activation A_2 = forward(Z_1, W_2, B_2) # Signal transfer from the first hidden layer to the second hidden layer z_2 = sigmoID (A_2) # second sigmoID activation y = identity(Z_2) # Print (a_2.t, z_2.t) print(y.T) print(y.T) print(y.TCopy the code

4. Realize forward propagation process based on handwritten digital data set

As for handwritten number recognition, we have explained its algorithm principle in detail before when we explained KNN. The specific content can be jumped to: Machine Learning in Action — Female students asked Taoye how to play KNN to complete the level

And in the middle of the neural network, want to handwritten digit recognition, requires the training model of w and b parameters, about the model of training process, we stay in the back of the article to explain, this section is only simulated neural network in the forward propagation of handwritten numeral recognition, that is to say, if we have obtained the parameter w and b, How to calculate the results of the model through the neural network.

In the previous section, we modeled forward propagation through small cases. It should be noted that the previous section dealt with the signal transmission process of a single sample, while in practical problems, we often deal with batches of multiple samples, such as handwritten digit recognition this time. In addition, we found that for the neural network mentioned above, the input of each sample should be the shape similarity of a vector, and the shape=(28, 28) of the handwritten number is equivalent to the input of a matrix. In this regard, flatten straightening is also required for it to be used as the input signal, and the processing intention is consistent with KNN.

Let’s look at the process in detail.

First is the import of handwritten digital data, here we through tensorflow. Keras. Datasets. Mnist. Load_data () for data loading. Note: The Tensorflow we use here is only used to import data, not to use its internal interface to complete the forward propagation of handwritten numbers.

We can know that for a handwritten digital image, the shape after flatten is 784, that is, each handwritten number has 784 attribute features. So, for a single handwritten digital data, the number of input signals is 784, and we know that each digital image corresponds to a label from 0 to 9, so the number of output signals is 10.

Above is the input signal and output signal number determination, next we consider the design of the hidden layer.

For the hidden layer, we usually determine its parameter number (dimension of parameter matrix and layer number of neural network) through our own experience and multiple attempts. Here we can define two hidden layers to deal with, the number of neurons in the first hidden layer is 50, and the number of neurons in the second hidden layer is defined as 100. In this way, To deal with the first hidden layer parameter information for w_1. Shape = (784, 50), b_1. Shape = w \ _1 (50) shape = (784, 50), b \ _1 shape = (50) w_1. Shape = (784, 50), b_1. Shape = (50), Shape =(50,100),b_2.shape=(100) W \_2.shape=(50, 100) 100), the b \ _2 shape = (100) w_2. Shape = (50100), b_2. Shape = (100), Shape =(100,10), b_3.Shape =(10) W \_3.shape=(100, 10) 10), b \ _3 shape = (10) w_3. Shape = (100, 10), b_3. Shape = (10). The parameter information of the neural network is summarized as follows:


w _ 1. s h a p e = ( 784 . 50 ) . b _ 1. s h a p e = ( 50 . ) w _ 2. s h a p e = ( 50 . 100 ) . b _ 2. s h a p e = ( 100 . ) w _ 3. s h a p e = ( 100 . 10 ) . b _ 3. s h a p e = ( 10 . ) \begin{aligned} & w\_1.shape=(784, 50),b\_1.shape=(50,) \\ & w\_2.shape=(50, 100),b\_2.shape=(100,) \\ & w\_3.shape=(100, 10),b\_3.shape=(10,) \end{aligned}

The corresponding array shape transform:

The processing process of forward propagation of neural network is basically consistent with that described in section 3. The first two signals transmitted to the hidden layer are calculated by weighted sum (Wx + B), and then processed by Sigmoid activation function. One difference is that Softmax can be used instead of the identity function for the second hidden layer to the final output. Because the output here has ten possibilities, which is a problem of 10 categories, we can convert the result into the form of probability through Softmax, and the index corresponding to the maximum probability can take it as the final output result of the model.

The specific expression of Softmax is as follows:


y k = e a k i = 1 n e a i y_k = \frac{e^{a_k}}{\sum_{i=1}^ne^{a_i}}

Using Softmax does translate the results into probabilities for a classification purpose. However, in the process of calculation, if the value is too large and then raised to the power, it may cause memory overflow. In this regard, we can optimize Softmax as follows:

C ‘C^{‘}C’ Max (a1,a2… ,an)max(a_1,a_2,… ,a_n)max(a1,a2,… ,an), for which the softmax method is defined to achieve the result output of the last layer:

As we know, the data of a single sample processed by SoftMax represents a probability vector, and the value of each element inside represents the predicted probability of a number. If we need to get the number corresponding to the maximum probability, we can use Np.argmax () as follows:

np.argmax(y, axis = 1)
Copy the code

To do this, let’s see if the forward propagation operation for the full code works:

We can find that the forward propagation of handwritten digits has been achieved normally, and finally get the predicted results for each data sample. Since we are only dealing with the forward propagation process of analog handwritten digit recognition, we do not involve the real training and prediction of the model, so the training of parameters and the accuracy of results will not be introduced here. The complete code for this section is shown below:

from tensorflow import keras from matplotlib import pyplot as plt def establish_data(): # loading handwritten digital data set (x_train y_train), (x_test, y_test) = keras. Datasets. Mnist. Load_data (#) we just before the simulation to the spread of the process, so choose 100 samples before, X_data = x_train[:100].Flatten ().0 ([100, -1]) return x_data / 255 def sigmoid(0): 0 return 1 / (1 + np.exp(-in_data)) def identity(in_data): return in_data def softmax(in_data): in_data = in_data - np.tile(in_data.max(axis = 1).reshape([in_data.shape[0], 1]), [1, in_data.shape[1]]) exp_data = np.exp(in_data) exp_sum_data = np.tile(exp_data.sum(axis = 1).reshape([in_data.shape[0], 1]), [1, in_data.shape[1]]) return exp_data/exp_sum_data """ Explain: "" def forward(x_data, w_data, b_data): Return np.matmul(x_data, w_data) + b_data "" Explain: "def initial_params(): w_1 = np.random.randn(784, 50) b_1 = np.random.rand(50) w_2 = np.random.randn(50, 100) b_2 = np.random.rand(100) w_3 = np.random.randn(100, 10) b_3 = np.random.rand(10) return w_1, b_1, w_2, b_2, w_3, b_3 if __name__ == "__main__": x_data = establish_data() w_1, b_1, w_2, b_2, w_3, b_3 = initial_params() a_1 = forward(x_data, w_1, b_1) z_1 = sigmoid(a_1) a_2 = forward(a_1, w_2, b_2) z_2 = sigmoid(a_2) a_3 = forward(z_2, w_3, b_3) y = softmax(a_3) print(np.argmax(y, axis = 1))Copy the code

This article for the time being, although the content is much, but in fact there is not much, so mainly to let readers understand the neural network forward propagation process. As for the forward propagation of neural network, it is important to understand the transformation of shape value and the number of parameters in the process of signal transmission, as well as how to achieve the design of the final output layer. Understand these contents, forward propagation will be understood almost in place, the value of this article will also have, and then we will introduce the neural network learning process in detail.

I am Taoye, love study, love to share, is keen on all kinds of technology, the study of anime like playing chess, listening to music, chat, hoping to worlds to record your growth process as well as the life intravenous drip, also hope to be able to strong more within the circle of like-minded friends, more welcome visiting WeChat princess: cynicism Coder.

I’ll see you next time. Bye

References:

[1] Introduction to Deep Learning: Python Based Theory and Implementation, Yasuyi Saito, Posts and Telecommunications Press

Recommended reading

Deep Learning Alchemy — Taoye doesn’t speak code, but hydrology, Taoye Taoye will tell you about Logistic regression. Machine Learning in Action — Talk about linear regression “Machine Learning in Action” — female students asked Taoye, How KNN should play to beat Machine Learning in Action — both know and don’t know. Nonlinear support vector Machine Machine Learning in Action Machine Learning in Action, Taoye, takes a look at support vector machines Optimization of SMO “Machine Learning in Action” — analysis of support vector machines, one-hand tearing linear SVM

The series of articles on hand-tearing machine learning have stopped updating for the time being. Currently, SUPPORT vector machine SVM, decision tree, KNN, Bayes, linear regression and Logistic regression have been completed. For other algorithms, please allow Taoye to give credit here for the first time.

The whole content of this series of articles is Taoye pure hand, also refer to a lot of books and open resources, the total number of words in the series is about 15W (including source code), the total number of pages is 138, later will be filled up slowly. To improve your reading experience, Taoye, a series of articles on shredding machine learning, has been compiled into PDF and HTML. The results are very good, and you can download them for free on the public account [Cynical Coder]. The document can be circulated freely, but be careful not to modify its contents.