• I have to defend my doctoral thesis tomorrow. I only have a 12G second-hand card. I have to finish 10 model experiments overnight tonight

  • I dug a slot and suddenly came up with a T9 thunderbolt model, but I couldn’t load it into my 12G second-hand card. I felt I would miss the Best Paper award this year

The main problem is that the machine is not enough and the memory is not enough. During deep learning training, the batch size of data is limited by GPU memory, which will affect the final accuracy of the model and the performance of the training process. In the case that GPU memory remains unchanged, the model becomes larger and larger, which means that the batch size of data is intelligently reduced. In this case, Gradient Accumulation can be used as a simple solution to solve this problem.

The orange part HERE in the figure below is the approximate position of the gradient accumulation algorithm in the AI system, which is generally in the expression layer of THE AI framework /AI system and closely combined with the algorithm.

The role of Batch size

Batch size of training data has a key influence on the convergence of the training process and the final accuracy of the training model. In general, there is an optimal value or value range for each neural network and Batch size data set.

Different neural networks and different data sets may have different optimal Batch size sizes.

When selecting Batch size, two issues are mainly considered:

Generalization: Large Batch sizes may fall into local minima. Falling into a local minimum means that the neural network will perform well on samples outside the training set, a process called generalization. Therefore, generalization generally means overfitting.

Convergence rate: A small Batch size may slow down the algorithm learning convergence rate. The update of network model in each Batch will determine the update starting point of the next Batch. Each Batch will train the data set and randomly select training samples, so the gradient obtained is based on the estimation of partial data noise. The fewer samples used in a single Batch, the lower the accuracy of gradient estimation. In other words, a smaller Batch size may make the learning process more volatile and essentially prolong the time required for algorithm convergence.

Considering the above two major problems, an appropriate Batch size needs to be selected before training.

Batch size Specifies the impact of Batch size on memory

While traditional computers have access to a large amount of RAM on top of the CPU, SSDS can also be used for secondary caching or virtual caching mechanisms. But ai-accelerated chips like Gpus have much less memory. At this time, the Batch size of training data has a great impact on the GPU memory.

To understand this further, let’s first examine the contents of the memory in the AI chip memory during training:

  • Model parameters: weight parameters and biases used by the network model.
  • Optimizer variables: Variables required by the optimizer algorithm, such as momentum momentum.
  • Intermediate calculation variables: Intermediate values generated by network model calculations that are temporarily stored in the memory of the AI acceleration chip, for example, the output of each layer activation.
  • Workspace: the kernel implementation of AI acceleration chip is the local variable that needs to be used, and the temporary memory generated by it, for example, the local variable generated when B/C is calculated in operator D=A+B/C.

Therefore, the larger Batch size is, the more samples are needed for neural network training, resulting in a surge of variables that need to be stored in the AI chip memory. In many cases, there is not enough Memory for the AI acceleration chip, and setting the Batch size too large will result in OOM errors Out of Memory.

Use the large Batch size method

One way to solve the memory limitation of AI acceleration chip and run large Batch size is to split the Batch of data Sample into smaller batches, called mini-batch. These mini-batches can operate independently and average or sum gradients during network model training. There are two main implementations.

1) Data parallelism: use multiple AI acceleration chips to train all mini-batches in parallel, and each data is on a single AI acceleration chip. All mini-batch gradients are accumulated, and the results are used to sum and update network parameters at the end of each Epoch.

2) Gradient accumulation: Mini-batch is executed in sequence and gradient is accumulated at the same time. The cumulative results are calculated after the last mini-batch and the model variables are updated on average.

Although the two technologies are quite similar, the problem is that the larger Batch size cannot be implemented in memory, but the gradient accumulation can be completed by using a single AI acceleration chip, while data parallelism requires multiple AI acceleration chips. Therefore, students who only have a second-hand 12G card on hand should hurry to use the gradient accumulation.

Gradient accumulation principle

Gradient accumulation is a method of training neural network data Sample by Batch divided into several small batches, and then calculated in sequence.

Before we discuss gradient accumulation further, let’s look at the neural network calculation.

The deep learning model is composed of many interconnected neural network units, and in all neural network layers, sample data will be continuously propagated forward. After passing through all layers, the network model outputs the predicted values of the samples, then calculates the loss values (errors) for each sample through the loss function. The neural network calculates the gradient of the loss value relative to the model parameters through back propagation. Finally, the gradient information is used to update the parameters in the network model.

Optimizer is a mathematical formula used to update the weight parameters of a network model. Take a simple stochastic gradient descent (SGD) algorithm as an example.

Assume that the formula of Loss Function is:


L o s s ( Theta. ) = 1 2 ( h ( x k ) y k ) 2 Loss(\theta)=\frac{1}{2}\left(h(x^{k})-y^{k}\right)^{2}

In modeling, the optimizer is used to calculate algorithms that minimize losses. Here, the SGD algorithm uses Loss function to update the weight parameter formula:


Theta. i = Theta. i 1 l r g r a d i \theta{i}=\theta_{i-1}-lr * grad_{i}

Where theta is the trainable parameter (weight or deviation) in the network model, LR is the learning rate, and Grad is the loss relative to the parameters of the network model.

Gradient accumulation only computes the neural network model, but does not update the parameters of the network model in time. At the same time, the gradient information obtained during the calculation is accumulated, and finally, the cumulative gradient is used to update the parameters.


a c c u m u l a t e d = i = 0 N g r a d i accumulated=\sum_{i=0}^{N} grad_{i}

When model variables are not updated, the original data Batch is actually divided into several mini-batches, and the samples used in each step are actually smaller data sets.

Variables are not updated within N steps, so that all mini-batches use the same model variables to calculate the gradient, so as to ensure that the calculated gradient and weight information is the same, which is equivalent to using the original Batch size that has not been segmented. That is:


Theta. i = Theta. i 1 l r i = 0 N g r a d i \theta{i}=\theta_{i-1}-lr * \sum_{i=0}^{N} grad_{i}

The cumulative gradient in the above steps will eventually produce a gradient sum of the same size as using the global Batch size.

Of course, in practical engineering, there are two points to pay attention to about parameter tuning and algorithm:

Learning rate: Under certain conditions, the larger the Batch size is, the better the training effect will be. Gradient accumulation simulates the effect of increasing the Batch size. If Accumulation steps is 4, the Batch size increases by 4 times.

Batch Norm: When The Accumulation steps were 4, Batch size was used to simulate the amplification effect. Compared with the real Batch size, the data distribution was not exactly the same. The mean value and variance calculated by BN of 4 times the Batch size were different from the actual data. Therefore, some implementations use Group Norm instead of Batch Norm.

Gradient accumulation realization

Normal training of a batch of pseudocode:

for i, (images, labels) in enumerate(train_data):
    # 1. Forwared forward calculation
    outputs = model(images)
    loss = criterion(outputs, labels)
    
    # 2. Calculate the gradient with backward propagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
Copy the code
  • model(images)Input image and label, calculate forward.
  • criterion(outputs, labels)The predicted value is obtained by forward calculation and the loss function is calculated.
  • ptimizer.zero_grad()Clear historical gradient information.
  • loss.backward()Perform back propagation and calculate the gradient of the current batch.
  • optimizer.step()Network parameters are updated according to the gradient obtained by back propagation.

That is, input a batch of data in the network, calculate a gradient, update a network.

After using gradient accumulation:

# gradient accumulation parameter
accumulation_steps = 4


for i, (images, labels) in enumerate(train_data):
    # 1. Forwared forward calculation
    outputs = model(imgaes)
    loss = criterion(outputs, labels)
 
    # 2.1 Regularization of loss regularization
    loss += loss / accumulation_steps
    
    Calculate the gradient with backward Propagation
    loss.backward()
 
    # 3. update parameters of net
    if ((i+1) % accumulation)==0:
        # optimizer the net
        optimizer.step()
        optimizer.zero_grad() # reset grdient
Copy the code
  • model(images)Input image and label, calculate forward.
  • criterion(outputs, labels)The predicted value is obtained by forward calculation and the loss function is calculated.
  • loss / accumulation_stepsLoss is updated each time, so each time divided by steps accumulates to the original gradient.
  • loss.backward()Perform back propagation and calculate the gradient of the current batch.
  • Multiple iterations of pseudo-code steps 1-2, without clearing the gradient, add the gradient to the historical gradient.
  • optimizer.step()After the gradient is accumulated for a certain number of times, network parameters are updated according to the accumulated gradient.
  • optimizer.zero_grad()Clear historical gradients in preparation for the next gradient accumulation.

Gradient accumulation means that one batch of data is obtained each time, and the gradient is calculated once. At this time, the gradient does not clear, but keeps accumulating. After accumulating a certain number of times, network parameters are updated according to the accumulated gradient, and then all gradient information is cleared for the next cycle.

reference

  • [1] Hermans, Joeri R., Gerasimos Spanakis, Effects of Accumulated Gradient on Machine Learning. Asian Conference on Machine Learning. PMLR, 2017.
  • [2] Lin, Yujun, et al. “Deep gradient compression: Reducing the communication bandwidth for distributed training.” arXiv preprint arXiv:1712.01887 (2017).
  • [3] how-to-break-gpu-memory-boundaries-even-with-large-batch-sizes
  • [4] what-is-gradient-accumulation-in-deep-learning