1, What is BatchSize Batch? The purpose of setting batCH_size is to let the model select Batch data for processing each time in the training process. In general, the objective function in machine learning or deep learning training can be simply understood as the sum of the objective function values obtained on each training set sample, and then the weight values are adjusted according to the objective function values. Most of the time, parameters are updated according to the gradient descent method.

Batch Size intuitively means the number of samples selected for a training. Batch Size affects the degree and speed of model optimization. At the same time, it directly affects the GPU memory usage. If your GPU memory is not large, the value had better be set to a smaller value.

2. Why Batch_Size? Before Batch Size is used, it means that all data (the whole database) are input into the network at one time during network training, and then their gradient is calculated for reverse propagation. Since the whole database is used in gradient calculation, the gradient direction obtained by calculation is more accurate. However, in this case, it is difficult to use a global learning rate due to the huge difference in the calculated gradient values. Therefore, Rprop, a training algorithm based on gradient symbols, is generally used to carry out gradient update separately. In small sample Size databases, not using Batch Size is feasible and works well. But once it is a large database, all the data into the network at once, will certainly cause memory explosion. Therefore, the concept of Batch Size is proposed.

3. How to set the Batch_Size value? Suppose only one sample is trained at a time, i.e. Batch_Size = 1. The error surface of the mean square error cost function of a linear neuron is a paraboloid with an ellipse in cross section. For multi-layer neurons and nonlinear networks, the paraboloid is still approximate locally. At this point, each correction direction is modified according to the gradient direction of the respective sample, so it is difficult to achieve convergence.

Since Batch_Size is a full dataset or Batch_Size = 1, both have their drawbacks, how to set an appropriate BatchSize? The sample and there is a certain relationship, can make a big variance when the sample size is little, and this leads us in this big variance is poorly gradient descent into the local optimal point (only slightly convex down the most advantage) and saddle point is not stable, a carelessly because the arrival of a large noise leads to blast out of the local optimal point. On the contrary, when the sample size is large and the variance is small, the estimation of gradient is much more accurate and stable. Therefore, it is easy to stay confidently when the local optimal advantage and saddle point are bad, which leads to the convergence of neural network to a very bad point, just like the bad bug.

The batch size cannot be too large or too small. Therefore, mini-batch is most commonly used in practical projects, and the size is usually set to dozens or hundreds. For the second-order optimization algorithm, the improvement of convergence speed by reducing batch is far inferior to the performance decline caused by the introduction of a large number of noises. Therefore, when using the second-order optimization algorithm, large batch is often used. At this time, the batch is often set to thousands or even 10,000 or 20,000 to give full play to the best performance. So when setting BatchSize, pay attention to the following points:

1) If the batch number is too small and there are too many categories, the loss function may oscillate and not converge, especially when your network is complex.

2) With the increase of batchsize, the speed of processing the same amount of data is faster.

3) With the increase of batchsize, more and more Epochs were needed to achieve the same accuracy.

4) Due to the contradiction of the above two factors, Batch_Size increases to a certain time and reaches the optimal time.

5) As the final convergence accuracy will fall into different local extreme values, Batch_Size will increase to some time to reach the optimal final convergence accuracy.

6) The result of too large batchsize is that the network is easy to converge to some bad local optimum. The batch that is too small also has some problems, such as slow training speed and difficult training convergence.

7) The selection of specific batch size is related to the number of samples in the training set.

8) GPU-batch power of 2 can play a better performance, so set to 16, 32, 64, 128… Is usually better than multiples of 10 or 100

When I set BatchSize, I first select a larger BatchSize to fill up the GPU and observe the convergence of Loss. If there is no convergence or the convergence effect is not good, the BatchSize will be reduced. Generally, 16,32,64, etc.

4. What are the benefits of increasing Batch_Size within a reasonable range? The memory utilization is improved and the parallelization efficiency of large matrix multiplication is improved. The number of iterations required to complete an EPOCH (full dataset) is reduced, and the processing speed for the same amount of data is further accelerated. Within a certain range, generally speaking, the larger Batch_Size is, the more accurate the descending direction is determined, and the smaller the training shock is caused by it. 5. What are the disadvantages of blindly increasing Batch_Size? Memory utilization is up, but memory capacity may not hold up. The number of iterations required to run an EPOCH (full data set) is reduced, and the time it takes to achieve the same accuracy is greatly increased, making the modification of parameters much slower. When Batch_Size increases to a certain extent, its definite descending direction has basically stopped changing. 6. What is the effect of Batch_Size adjustment on training effect? Batch_Size is too small, and the model performance is extremely poor (error surge). As Batch_Size increases, the same amount of data can be processed faster. As Batch_Size increases, more and more epochs are needed to achieve the same accuracy. Due to the contradiction of the above two factors, Batch_Size increases to a certain point and reaches the optimal time. Since the final convergence accuracy will fall into different local extreme values, Batch_Size will increase to some time to reach the optimal final convergence accuracy. 7. Why does the increase of Batch size make the gradient of the network more accurate? The variance of the gradient is expressed as:

Because the samples are randomly selected and are independently identically distributed, all samples have the same variance

So the above equation can be simplified as:

It can be seen that when the Batch size is M, the variance of the sample is reduced by m times and the gradient is more accurate.

If you want to keep the gradient variance of the original data, you can increase the learning rate LRLR

This also indicates that when batch size is set larger, the general learning rate should increase. However, the increase of LRLR is not set very large at the beginning, but gradually increases during training.

A specific example analysis: In distributed training, Batch size increases with the increase of workers with data parallel. Suppose that the baseline Batch size is BB, the learning rate is LR, and the training epoch is NN. If the baseline LR is maintained, the convergence rate and accuracy are generally not very good. Reason: For the convergence speed, if there are KK workers, each batch is KBKB, so the number of iterations of an epoch is 1K of the baseline, while the learning rate LR remains unchanged. Therefore, in order to achieve the same convergence as the baseline, the epoch should be increased. According to the above formula, epoch needs to be increased by KNKN epoch at most, but generally it does not need to be increased by that much. As for the convergence accuracy, the use of Batch size makes the gradient more accurate and the noise is reduced, so it is easier to converge.