This is my 10th day of the November Challenge

The previous method of weight selection was based on independent Gaussian random variable to select weight and bias, which were normalized to mean 0 and standard deviation 1. This approach has obvious disadvantages. Assume that the normalized Gaussian distribution is used to initialize the weight of the first hidden layer connected, and ignore other factors to focus on the weight of the connection, as shown in the figure below:

To simplify the problem, it is assumed that the scale of training input data set XXX is n=1000n=1000n=1000, where half of the input neuron value xjX_jxj is 0 and the other half is 1. For the weighted inputs of neurons z=∑ JWJXJ +bz=\sum_jw_jx_j+bz=∑ JWJXJ +b, where 500 terms have been eliminated, leaving 500 WWW and one BBB. The value of the 501 term follows a Gaussian distribution with a mean of 0 and a standard deviation of 1, so the value of ZZZ will be a Gaussian distribution with a mean of 0 and a standard deviation of 501≈22.4 SQRT {501}\approx 22.4501 ≈22.4 as shown below:

σ(z)\sigma(z)σ(z) is close to 1 or 0. When the hidden neurons are close to saturation, the learning rate slows down. This is also the reason for the slow learning.

Therefore, in order to better initialize the values of WWW, assuming ninn_{in}nin input neurons, the gaussian distribution with mean of 0 and standard deviation of 1Nin \frac{1}{\ SQRT {n_{in}} nin 1 is used to initialize the weights. In the same case as above, the distribution of ZZZ becomes a Gaussian distribution with a mean of 0 and a standard deviation of 3/2\ SQRT {3/2}3/2, as shown in the figure below:

So that the neurons can not saturate, alleviate the slow learning problem.

Exercise: Assuming ninn_{in}nin input neurons, use the gaussian distribution initialization weight with mean 0 and standard deviation 1nin\frac{1}{\ SQRT {n_{in}} nin 1, and the Gaussian distribution initialization bias with mean 0 and standard deviation 1. Suppose the size of training input dataset XXX is nin=1000n_{in}=1000nin=1000, where half of the input neuron value xjx_jxj is 0 and the other half is 1. Verify that z=∑ JWJXJ +bz=\sum_jw_jx_j+bz=∑ JWJXJ +b is 3/2\ SQRT {3/2}3/2.

Hint :(a) the variance of the sum of independent random variables is the sum of the variances of each independent random variable. (b) the variance is the square of the standard deviation.

Based on the above conditions, the variance of ZZZ is:


( 1 n i n ) 2 x 500 + 1 2 x 1 = 1 2 + 1 = 3 2 (\frac{1}{\sqrt{n_{in}}})^2\times 500+1^2\times1=\frac{1}{2}+1=\frac{3}{2}

So the standard deviation of ZZZ is 32\ SQRT {\frac{3}{2}}23.

Code implementation and comparison

Original WWW initializer:

Def large_weight_initializer(self) def large_weight_initializer(self) self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]] self.weights = [np.random.randn(y, x) for x, y in list(zip(self.sizes[:-1], self.sizes[1:]))]Copy the code

New WWW initialization method:

Def default_weight_initializer(self) def default_weight_initializer(self): self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]] self.weights = [np.random.randn(y, x) / np.sqrt(x) for x, y in list(zip(self.sizes[:-1], self.sizes[1:]))]Copy the code

Comparison of results:

It can be seen that this improved method significantly improves the learning speed, but does not change the final performance of the network.

How to select hyperparameters of neural network

It is difficult to learn how to choose the rate η\etaη and the normalized parameter λ\lambdaλ in practice. Basically, the selection of hyperparameters here is based on heuristic ideas, and there is no universal correct selection method:

Broad strategy

Do not set the network structure too complex at the beginning, you can only use a simple three-layer network architecture to test the influence of parameter changes on the results to find the law of parameter optimization network.


eta \eta
Learning rate adjustment

Test the influence of three different learning rate values on learning:

You can see when
eta = 2.5 Eta = 2.5 \
In order to understand this phenomenon go back to the principle of stochastic gradient descent:

The value of etaetaeta affects the descending step size. If η\eta is too large, it may lead to the algorithm crossing the valley bottom and causing repeated oscillations when it is close to the minimum value. Therefore, the value of etaetaeta should be smaller. (But too little will affect the speed of learning)

The learning rate is usually set to constant, but a variable learning rate may be more efficient. In the early stages of learning, use a larger learning rate to make the weight change faster, and later you can lower the learning rate so that you can make better adjustments. How do you set the learning rate based on this idea? It’s natural to think:

  • Initially, keep η\etaη as a constant until validation accuracy begins to deteriorate
  • Reduce the η\etaη by a certain ratio, such as 10 or 2, and repeat several times until the η\etaη falls below the initial value of 11000\frac{1}{1000}10001

Epoch uses early stop to determine the number of iterations

As mentioned in the previous fitting section, learning is stopped when the accuracy of the verification set is no longer improving.


Lambda. \lambda
The selection of normalized parameters

The value of λ\lambda lambda can be uncertain at first (i.e., λ=0\lambda=0λ=0), and then try to vary the value of λ\lambda lambda after the η\etaη is established.

Mini_batch Size selection

How to set the size of small batch data? In order to simplify the problem, we can assume that the learning size of small batch data is 1, and the weight change is as follows:


w w = w eta C x w\rightarrow w’=w-\eta\nabla C_x

Compared with the weight change rate of small batch data of 100 size: W → W ‘= W − 1100∇Cxw\rightarrow W ‘= W -\eta\ Frac {1}{100}\ Nabla C_xw→ W ‘= W − 1001∇Cx, the weight change rate of small batch data of 1 size is 100 times faster. However, the problem is that small batch data of size 1 is updated frequently (that is, it needs to calculate gradient in a circular way), while small batch data of size 100 can be used to calculate gradient by matrix technology to improve the calculation speed.

This is also what needs to be considered when selecting the size of small batch data. If the size is too small, the learning speed will be reduced; if the size is too large, The Times of updating weight will be too small. Fortunately, the choice of small batch sizes is actually a relatively independent hyperparameter (a parameter outside the overall network architecture), so you don’t need to optimize other parameters to find good small batch sizes.