Make writing a habit together! This is the 13th day of my participation in the “Gold Digging Day New Plan · April More Text Challenge”. Click here for more details.

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. This article introduces the implementation and comparison of BN and IN from a deployment perspective.

Deep learning students should have been exposed to the classic and practical residual network (Resnet), which generally includes RESNET18, RESNET34, RESNET50, RESNET101 and RESNET152. The main difference is the number of ResidualBlock. Resnet innovatively proposed the residual block structure, in which the identity mapping does not introduce additional calculation parameters and facilitates the transmission of information. It initially proved that the network can develop in a deeper direction and solve the problem of network degradation. The residual block structure of Resnet is as follows:

For detection and identification tasks, Resnet is a very good backbone, which can effectively help you extract features and then connect neck and head to form a complete network. We introduce ResNet because the bn and IN implementations are embedded in Resnet, so let’s get back to bn and in.


1. Talk about BN

Batch Normalization (BN) uses normalization to distribute data in a linear range, which increases the gradient and makes the model more bold in gradient descent. As a result, the gradient dispersion caused by the deepening of the network is solved, and the original data distribution is destroyed, which solves the over-fitting problem to some extent.

The principle of batch normalization is as follows. As shown in the following figure, BN uses the same channel of the whole batch for normalization. The more common point is that THE NHW of each channel C is separated for normalization.

The calculation process of bn operator is mathematically expressed as follows:

There are a lot of conV + BN + activation structures in ResNet, which should be very sensitive for algorithm accelerators to see and want to put them together for operator fusion. Taking the most classical conv + BN fusion as an example, the mathematical principle is as follows:

Conv layer:

Bn layer:

Conv is substituted into BN:

The fusion forms a large convolution, equivalent to:

Here is actually more critical, have thought about why to do operator fusion? Intuitively, operator fusion can reduce the number of operators, reduce the communication between operators and intermediate storage, so as to achieve the purpose of algorithm acceleration. This is of course fine, but the most substantial improvement for conv + BN fusion is: Since the mean and var of BN are calculated offline, w_new and b_new in the convolution of conv + BN can be calculated in advance. What is this equivalent to? It is that I need to do a convolution (matrix times, Tensorcore/Cube) and one bn (dot product, vector unit), and then I cut the vector unit down to one, and you can think of a huge performance improvement.

2. Talk about in

The full name of IN is instance Normalization, which applies to image style migration. As the results of image generation mainly depend on a single image, it is not suitable for normalization of the whole batch as BN does. In normalization in style migration accelerates model convergence. You can also maintain independence between each image instance.

The principle of instance Normalization is as follows. As shown in the following figure, there is a single path of normalization in a single image. The more common point is that each HW is normalized, independent of the path and batchsize.

The calculation process of the whole in operator is mathematically expressed as follows:

Among them,

  • T: indicates the index of the picture.
  • I: represents the index of feature map;

Since in is primarily used in the area of style migration, what does it have to do with our detection/recognition or other areas? Let’s take a look at the following network structure and a set of data to make it clear. The following is the comparison diagram of IBN structure and the residual structure in RESNET mentioned above. It can be seen that IBN-A cuts the feature map in half, with half going in and half going bn, forming a parallel structure of BN and IN. Ibn-b adds an IN before relU activation after residual to form bn and IN series structure.

Here’s a set of experimental data:

For classification problems, ibn-a and ibn-b are better than resnet50 as follows.

For segmentation, when the training set and test set come from the same data, the mIoU of IBN-A model can be 4 points higher than the original model RESnet50. However, ibN-B model is more dominant when training sets and test sets are different, indicating that IBN-B can perform better in cross-domain problems.

Some summary about IBN:

(1) IBN-A is suitable for the problem that the current domain is consistent with the target domain. For example, ibN-A can be used when the classification ability of RESNet50 needs to be improved;

(2) IBN-B is suitable for the problem of inconsistency between current domain and target domain. For example, cross-domain scenarios are often involved in pedestrian re-recognition, which is why IBN-NET is widely used in the field of pedestrian re-recognition.

Above, for different application scenarios, some networks begin to use backbone with IBN structure to improve the feature extraction ability of backbone in specific scenarios, so we also need to study the deployment of in operator.

3. Talk about BN and IN in deployment

From the perspective of deployment, bn and IN, in fact, the previous section of BN has carried out some introduction to bn deployment, including conv + BN fusion, and why fusion can be accelerated. Let’s talk about how in is different from BN when deployed.

In was initially used for image style transfer. For style generation, the independence and dynamics of single graph are relatively important, which results in that the mean and variance var of IN are often calculated online during inference. There are parameters in PyTorch’s nn.instancenorm2d () operator that store mean and variance offline during training. At this time, let’s take a look first. For deployment, BN needs four weights, namely mean, VAR, scale and bias, and corresponding IN also needs these four weights. Of course, the calculation process of the two is different. Going back to what I said earlier, if I store the mean and var of in offline during training, and THEN I do a wave of fusion like BN, with the four weights offline, I can calculate in advance the new_scale and new_bias of the fused large convolution, If you want to use tensorcore or Cube, you can use it to speed things up.

The problem comes. We verify experimentally that, in our scenario, the model accuracy obtained by using mean and variance offline storage mode is much lower than that obtained by inferential online calculation… In this case, we have to calculate the mean and variance of in online, so that even if you can do operator fusion, the mean and var in new_scale and new_bias will need to be calculated dynamically. At this time, the original conV + IN need of four calculations (CONV 1 + mean 1 + VAR 1 + IN 1) is reduced to three calculations (mean 1 + var 1 + CONV 1), which will improve performance. However, it is much less than the calculation after conv + BN fusion.

I used the TE module of TVM to implement bn and in operators, I write the calculation part out for you to see.

Bn_compute = out = te.compute…

In_compute = in_compute = in_compute = in_compute = in_compute = in_compute Te.com pute = = = = = = = = = = = = = = = = = = = = = = .

The TVM compute part is given above, in fact op schedule is the same, in compared with BN will have more calculation scheduling process for mean and var.

To sum up, from the perspective of deployment, bn and IN operator, if the mean and var of IN operator are stored offline during training, then in and BN have similar deployment and reasoning efficiency. If the mean and var of in operator are calculated online during inference, then IN will be less efficient than BN, and such IN is not very friendly for deployment.

I have shared bn and in operators from the perspective of deployment. I hope my sharing can help you a little bit.


[Public Account transmission]

Bn and in operators from a Deployment Perspective