preface

We often set batch_size when we are training the network. What is the batch_size for? What is the difference between 1, 10, 100 or 10000 for a dataset of 10,000 graphs?

# Handwritten number recognition network training method
network.fit(
  train_images,
  train_labels,
  epochs=5,
  batch_size=128)
Copy the code

Batch Gradient Descent (BGD)

Gradient descent algorithm is usually used to minimize the loss function: feed the raw data network to network, the network will be carried out in certain calculation, will get a loss function, represents the network calculation results and the actual gap, the gradient descent algorithm is used to adjust the parameters, the result of training a better fitting with the reality, this is the meaning of gradient descent.

Batch gradient descent is the most primitive form of gradient descent, the idea is to use all of the training data with gradient update, gradient descent algorithm to loss function derivative, as you can imagine, if the training data set is large, all of the data needs to be read in together, in the network to train together, together for peace, would be a huge matrix, The calculation would be enormous. Of course, this also has the advantage that the network must be optimized to the optimal (extreme) direction because all training sets are taken into account.

Stochastic Gradient Descent (SGD)

Different from batch gradient descent, the idea of stochastic gradient descent is to take out one of the training sets each time, conduct fitting training, and conduct iterative training. Training is the process of the first to come up with a training data, network modification parameters to fit it and modify the parameters, and then take out the next training data, with just modify the good network to fitting and modify the parameters, so the iteration, until every data input through the network, start all over again and again until the parameter is stable, the advantage is every fitting in a training data, The speed of each round of update iteration is extremely fast. The disadvantage is that only one training data is considered in each fitting. The optimization direction is not necessarily the overall optimal direction of the network in the training set, and it often jitter or converges to the local optimum.

Mini-batch Gradient Descent (MBGD)

Small batch gradient descent (LGDA) is a compromise that is most commonly used in computers. Instead of feeding the entire training data set into the network, training is performed on a subset of the training data set, such as 20 at a time. It can be imagined that this will not cause too much data and slow calculation, nor will it cause severe jitter or non-optimal optimization of the network because of some noise characteristics of a training sample.

The three gDA algorithms are compared: batch GDA is the operation of large matrix, so it can be considered to use matrix calculation optimization for parallel calculation, which requires high performance of hardware such as memory. Each iteration of stochastic gradient descent depends on the calculation results of the previous one, so it cannot be computed in parallel and requires less hardware. However, in small-batch gradient descent, each iteration is a small matrix, and the hardware requirements are not high. Meanwhile, the matrix calculation can be carried out in parallel, and serial calculation can be used between multiple iterations, which will save time on the whole.

Look at the following figure, which can better reflect the iterative process of network optimization by the three shaving algorithms, and you will have a more intuitive impression.

conclusion

For the optimization of gradient descent algorithm, batch gradient descent is directly used because the training data set is very small. Only one training data can be obtained at a time, or the training data transmitted online in real time, using stochastic gradient descent; In other cases or general cases, batch gradient descent algorithm is better.

  • This post was originally posted by RAIS