preface

The official account has published three articles on the interpretation, analysis and summary of BatchNorm (the link is at the end of the article). Readers who have read these three articles should have a deeper understanding of BatchNorm and the normalization method. In this paper, I will introduce an important paper about BatchNorm, which carries out many experiments and considers Batch in BatchNorm very comprehensively.

Welcome to CV technical guide, focusing on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Motivation

The key factor that distinguishes BatchNorm from other deep learning operators is that it operates on batch data instead of a single sample. BatchNorm mixes the information in batches to calculate normalized statistics, while other operators independently process each sample in batches. Thus, the output of the BatchNorm depends not only on the attributes of the individual samples, but also on how the samples are grouped.

As shown in the upper left figure, the sampling methods of BatchNorm in the upper, middle and lower figures are entire DATASer, Mini-Label and Subset of Mini-Label respectively according to the sampling size.

As shown in the figure on the right above, according to the sampling style, the sampling methods of BatchNorm in the top, middle and bottom three figures are entire domain, each domain and mixture of each domain respectively.

This paper examines these options for batching in BatchNorm and demonstrates that applying the batching specification without considering different options for batching builds can have negative effects in many ways, but model performance can be improved by making careful choices in how batching is done.

Review of BatchNorm

In a Mini-Fandom, the mean and variance of each label are calculated for each channel in each BN layer, and then the normalized data are normalized. The normalized values have the characteristics of zero mean and unit variance. Finally, two studiable parameters, Gamma and Beta, are used to scale and shift the normalized data.

In addition, the mean and variance of each LABEL of each MINI-Fandom layer are also preserved during the training, and the expected values of the mean and variance of all mini-Fandom layers are calculated as the mean and variance of each LABEL of the BN layer in the reasoning process.

Whole Population as a Batch

During training, BatchNorm used mini-batch samples to calculate normalized statistics. However, when models are used for testing, there is usually no longer a concept of mini-batch. BatchNorm was originally proposed to be normalized by the overall statistics μ and σ calculated over the entire training set during testing. μ and σ are defined as Batch statistics µ, while σ uses the entire population as “Batch”.

EMA algorithm is widely used to calculate µ σ, but it can not always train population data accurately. Therefore, a new algorithm PreciseBN is proposed in this paper.

Inaccuracy of EMA

EMA: exponential moving average

The algorithm formula is as follows:

EMA will cause the model to estimate Population data to be suboptimal for the following reasons:

  • When λ is very large, the statistical data convergence is slow. Because each update iteration contributes only a small part (1-λ) to EMA, a large number of updates are required for EMA to converge to a stable estimate. The situation got worse as the model was updated: EMA was dominated by input features from the past, which became obsolete as the model was trained.

  • When λ is small, THE EMA statistics are dominated by a small number of recent mini-batches and do not represent the whole Populatioin.

PreciseBN

PreciseBN approximates Population statistics by following two steps.

  • (Fixed) models are applied on many small batches to collect Batch statistics;

  • Aggregate per-batch statistics into aggregate statistics.

PreciseBN has two more important attributes than EMA:

  • Statistics are calculated entirely based on fixed model state, different from EMA using model history state;

  • All samples are equally weighted.

The experimental conclusion

1.PreciseBN is more stable than BN.

2. When batchsize is large, THE EMA algorithm is unstable. The authors suggest that the instability is caused by two factors that impair the statistical convergence of EMA during mass training :(1) a 32-fold learning rate leads to more drastic changes in characteristics; (2) Due to the reduction of the total number of training iterations, the number of EMA updates was reduced by 32 times.

3.PreciseBN only needs a sample to get stable results.

4. Small Batch will accumulate errors.

Batch in Training and Testing

The Batch statistics used during training and reasoning were inconsistent: Mini-batch statistics were used during training, and population statistics approximated by ALL Mini-batch through EMA algorithm were used during reasoning. This paper analyzes the effect of such inconsistencies on model performance and points out that in some cases the inconsistencies can be easily eliminated and the performance can be improved.

To avoid confusion, SGD Batch size or Total Batch size is defined as the total batch size across all Gpus, and normalization Batch size is defined as the batch size across a single GPU. (Note: This is mentioned in the Article “Summary of Normalized Methods”. When multiple Gpus are used, the actual Mini-batch statistics are only based on the batchsize/ NUMBER of Gpus sample.)

There is a direct impact of the Normalization Batch size on training noise and training test inconsistency: The larger batch pushes the mini-batch statistics closer to the overall statistics, thereby reducing training noise and training test inconsistency

In order to facilitate analysis, the paper observed the error rate of three different evaluation methods:

  • Mini-batch statistics were evaluated on the training set

  • The Mini-Batch statistics are evaluated on the validation set

  • Population statistics were evaluated on the validation set

The experimental conclusion

The small normalization Batch size (such as 2 or 4) does not perform well, but the model actually performs well when using mini-Batch statistics (blue curves). The results show that the huge inconsistency between mini-batch statistics and overall statistics is the main factor affecting the performance of mini-batch.

On the other hand, when the normalization batch size is larger, small inconsistencies provide regularization to reduce validation errors. This causes the red curve to perform better than the blue curve.

Based on the above conclusions, two methods to eliminate inconsistencies to improve performance are presented

  • Use Mini-batch in Inference

  • Use Population Batch in Training

Batch from Different Domains

The training process of the BatchNorm model can be viewed as two separate stages: first learning features through SGD and then training overall statistics using these features through EMA or PreciseBN. We call these two phases “SGD Training” and “Population Statistics training”.

In this section, the paper analyzes two situations where domain gaps occur: when the model is trained on one domain but tested on other domains, and when the model is trained on multiple domains. Both complicate the use of the BatchNorm.

The experimental conclusion

  1. When there is significant domain offset, the model will obtain the best error rate after the overall statistical training by using the domain used in the evaluation compared with SGD training. Intuitively, Batch composition by domain can reduce the inconsistencies of training tests and improve the generalization ability of new data distribution.

  2. BatchNorm’s domain-specific training on mixture of multi-domain data is often proposed in previous work. The names are “domain-specific BN”, “Split BN”, “Mixture BN”, “Auxiliary BN” and “Transferable Norm”. Each of these approaches involves some of the following three options.

  • Domain-specific SGD training

  • Domain-specific population statistics

  • Domain-specific affine transform

By eliminating the above three choices, we show the importance of using a consistent strategy between SGD training and Population statistics training, although this implementation may not seem intuitive.

Information Leakage within a Batch

In the Summary of Normalized Methods, I concluded that one of the three defects of BN was the poor performance when the samples in mini-batch were not independently and idenally distributed, which the author believed was caused by Information Leakage.

The experiment found that the verification error would increase when the mini-batch statistic with random sampling was used, while the verification error would gradually increase with the increase of the epoch when the population statistic was used, which verified the existence of BN information leakage problem.

In order to deal with the problem of information leakage, SyncBN was commonly used to weaken the correlation between samples within mini-Batch. Another solution is to randomly scramble RoI features among Gpus before entering head, which assigns each GPU a random subset of samples for normalization and also weakens the correlation between min-Batch samples, as shown in the figure below.

As shown in the figure below, the experiment proves that both Shuffling and SyncBN effectively solve the information leakage problem, allowing HEAD to summarise population statistics well when tested.

In terms of speed, shuffle requires less cross-GPU synchronization for the deep model, but transmits more data per synchronization than SyncBN layer. Therefore, their relative efficiency varies from model architecture to model architecture. More data than SyncBN. Therefore, the relative efficiency of shuffling and SyncBN depends on the concrete model architecture.

conclusion

BatchNorm algorithm is reviewed in this paper. The PreciseBN approximate statistical algorithm is proposed, and the PreciseBN approximate statistical algorithm is more stable than the common EMA algorithm. The effect difference of composing mini-batch according to different domains is analyzed. Two methods to deal with the non-independent homogeneous distribution of mini-batch samples are analyzed.

Combination of the three articles in front of the Batch Normalization “, “visual BatchNorm – need it the way it works and why neural network”, the Normalization method summary | aka “BN and its wave after the” “, believe that the readers will BatchNorm will have a very comprehensive understanding, And further deepen the understanding of neural networks.

This article comes from the public CV technical guide of the paper sharing series.

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.

Other articles

CV technical Guide – Summary and classification of essential articles

Summary of tuning methods for hyperparameters of neural networks

CVPR2021 | to rethink BatchNorm in Batch

ICCV2021 | to rethink the visual space dimension of transformers

CVPR2021 | Transformer used for End – to – End examples of video segmentation

ICCV2021 | (tencent optimal figure) count and rethink the crowd positioning: a purely based on the framework

Complexity analysis of convolutional neural networks

A review of the latest research on small target detection in 2021

Self attention in computer vision

Review column | attitude estimation were reviewed

CUDA optimizations

Why is GEMM at the heart of deep learning

Why are 8 bits enough to use deep neural networks?

Capsule Networks: The New Deep Learning Network

Classic paper series | target detection – CornerNet & also named anchor boxes of defects

What about the artificial intelligence bubble

Use Dice Loss to achieve clear boundary detection

PVT– Multifunctional backbone without convolution intensive prediction

CVPR2021 | open the target detection of the world

Siamese network summary

Past, present and possibility of visual object detection and recognition

What concepts or techniques have you learned as an algorithm engineer that have made you feel like you’ve improved by leaps and bounds?

Summary of computer vision terms (a) to build the knowledge system of computer vision

Summary of under-fitting and over-fitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of efficient Reading methods of English literature in CV direction

A review of small sample learning in computer vision

A brief overview of intellectual distillation

Optimize the read speed of OpenCV video

NMS summary

Loss function technology summary

Summary of attention mechanism technology

Summary of feature pyramid technology

Summary of pooling techniques

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classical model

Summary of CNN structural evolution (II) Lightweight model

Summary of CNN structure evolution (iii) Design principles

How to view the future trend of computer vision

Summary of CNN visualization technology (I) Feature map visualization

Summary of CNN visualization technology (II) Convolution kernel visualization

CNN visualization technology summary (iii) class visualization

Summary of CNN visualization technology (IV) Visualization tools and projects