Introduction to BoTNet: A Bottleneck Transformers for visual identification

gossip

Recently, MY thoughts are a little negative and I am confused about my future. I don’t know what to do and I am worried about the serious internal examination of the industry. I haven’t written a summary of several papers after I finished reading them, which is why I don’t update them often. Deny yourself while giving yourself power. Maybe the road of scientific research is to walk in darkness, we need to give ourselves a light, or do their own sun. In negative times, I usually read a book, TV show or movie, go for a walk or exercise. Life is precious and there may be some things we can’t change. But how we improve ourselves is up to us. There are countless beautiful things in the future, waiting to meet with a better you. Believe that the sweat you sweat, the books you read, the roads you walk, and the mountains you build will all come back to you.

preface

Bottleneck Transformers for Visual Recognition

code

BoTNet: A simple but powerful backbone, this architecture integrates self-attention into a variety of computer vision tasks, including image classification, object detection and instance segmentation. This method significantly improves the baseline in instance segmentation and target detection, while reducing parameters to minimize latency.

The innovation points On the left is the ResNet bottleneck structure, and on the right is the multi-head self-attention-introducing bottleneck, called BoT. The only difference between the two is the 3 by 3 convolution and the MHSA, nothing else. When ResNet50 in Mask RCNN was added to BoT and all other overparameters remained unchanged, Beachmark’s Mask AP for COCO instance segmentation improved by 1.2%.

In this paper, the classical network ResNet is modified by replacing 3*3 convolution in ResNet with multi-head self-attention, and other modifications are not made. This simple change can improve performance.

BoTNet is a hybrid model (CNN + Transformer). Recently huawei Noah also proposed CMT using this hybrid model.

CMT: Convolutional Neural Networks Meet Vision Transformers

code

DeiT directly divides the input image into non-overlapping image blocks, and the structural information of image blocks is weakly modeled by linear projection. The STEM architecture is similar to ResNet, which consists of three 3*3 convolution, but the activation function uses GELU instead of ResNet ReLU.

Similar to classical CNN(such as ResNet) architectural designs, the proposed CMT consists of four stages to generate multi-scale features (important for dense prediction tasks). To generate hierarchical representation, convolution is used to reduce feature resolution and improve channel dimension before each stage. At each stage, multiple CMT modules are stacked for feature transformation while maintaining the same feature resolution. Each CMT module can capture local and long-distance dependencies simultaneously. At the tail of the model, GAP+FC method is adopted for classification.

The above table shows the performance comparison of the proposed method with other CNN and Transformer, from which we can see:

  • The proposed CMT has better accuracy, less parameter number and less computational complexity.
  • With 4.0B FLOPs, the proposed CMT-S achieves a TOP1 accuracy of 83.5%, which is 3.7% higher than deIT-S and 2.0% higher than CPVT.
  • The proposed CMT-S efficientnet-B4 index is 0.6% higher than that of efficientnet-B4 and has lower computational complexity.

Review the ResNet

Taking you through a paper series on Computer vision –ResNet

ResNet was proposed by Microsoft LABS in 2015 and won the first prize in ImageNet classification task and object detection. Obtained the first place of target detection and image segmentation in COCO dataset.

Note: The output eigenmatrix shape for main branch and shortcut must be the same.

Resnet-34 layer network architecture

Highlights in the network:

  • Ultra deep network structure (more than 1000 layers)
  • Present the Residual module
  • Accelerated training using Batch Normalization (drop Dropout)

The deeper the network, the better.

why:

  • Gradient disappearance or gradient explosion
  • degradation

The problem of gradient disappearance or gradient explosion can be solved through data normalization, weight initialization and BN processing. The degradation problem can get better results by deepening the network continuously through residual structure.

Botnets,

You can think of the traditional Transformer model on the left, the main BoTNet in the middle, and a BoT block on the right.

ResNet Botlteneck with an MHSA layer can be thought of as a Transformer block with a bottleneck. , but there are some subtle differences, such as residual join, selection of normalized layer, and so on.

MHSA in Transformer and MHSA in BoTNet:

Normalization. Transformer uses Layer Normalization while BoTNet uses Batch Normalization.

Transformer uses only one nonlinear activation in the FPN block module, BoTNet uses three nonlinear activations.

3. Output projection. The MHSA in Transformer contains an output projection, but BoTNet does not.

Optimizer. Transformer uses Adam optimizer for training and BoTNet uses SGD + Momentum

Deep learning optimization algorithm has experienced the development process of SGD -> SGDM -> NAG ->AdaGrad -> AdaDelta -> Adam -> Nadam.

Algorithms are great, but data is fundamental.

Fast descent with Adam followed by SGD tuning

  1. When to switch optimization algorithms? If it had been too late, Adam would have gone into his own basin and SGD would never have gotten out.Copy the code
  2. What learning rate is used after switching algorithms? -- Adam uses adaptive learning rate, which depends on the accumulation of second-order momentum. If SGD continues training, what kind of learning rate does SGD use?Copy the code

1. First of all, there is no agreement on which algorithms are better than others. For starters, SGD+Nesterov Momentum or Adam is preferred. The two recommended updates to use are either SGD+Nesterov Momentum or Adam);

2. Choose an algorithm you’re familiar with — you’ll be more adept at using your experience to tune arguments.

3. Know your data well — if the model is very sparse, give priority to an adaptive learning rate algorithm.

4, according to your needs to choose — in the process of model design experiment, to quickly verify the effect of the new model, you can first use Adam for rapid experimental optimization; The model can be optimized to perfection with fine-tuned SGD before the model goes live or the results are published.

5. Experiment with small data sets first. Some papers pointed out that the convergence speed of the STOCHASTIC gradient descent algorithm has little relation with the size of the data set. The mathematics of stochastic gradient descent are amazingly independent of The training set size. In particular, The asymptotic SGD convergence rates are independent from the sample size. [2]) Therefore, we can first test the best optimization algorithm with a representative small data set. The optimal training parameters are searched by parameter search.

6. Consider combinations of different algorithms. Adam was used for rapid descent, then switched to SGD for full tuning. You can refer to the methods described in this article for switching policies.

7. The data set must be fully shuffled. In this way, when using the adaptive learning rate algorithm, it can avoid the occurrence of some feature sets, resulting in sometimes excessive learning, sometimes insufficient learning, and the problem of deviation in the descending direction.

8. During the training process, continuously monitor the changes of the target function values, accuracy or AUC and other indicators on the training data and validation data. The monitoring of training data is to ensure that the model has been fully trained — the descending direction is correct and the learning rate is high enough; Validation data is monitored to avoid over-fitting.

9. Develop an appropriate learning rate attenuation strategy. Periodic decaying strategies can be used, such as decaying every number of epochs; Or use performance indicators such as accuracy or AUC to monitor. When the indicators on the test set remain unchanged or decline, the learning rate will be reduced.

The above is the difference between resNET-50 and Botnet-50. The only difference is in C5. Moreover, the number of parameters of BotNE-50T is only 0.82 times that of RESNET-50, but steptime is slightly increased.

Only the residual structure in THE C5 block of ResNet is replaced with the MHSA structure.

The experimental results

BoTNet vs ResNetSet the training strategy: a training cycle is defined as 12 epochs, and so on. Multi-scale Jitter is more helpful to BoTNet It’s a multi-scale change in resolution. It can be concluded from the above figure that the higher the image resolution is, the higher the performance will be improved.

Add Relative Position Encodings to improve performance even further!

The first line is R50 what night not added, original ResNet; The second line only adds the self-attention mechanism, which increases by 0.6; The third line only adds relative position coding. Relative position encoding has a slightly higher performance than self-attention. In the fourth line, self-attention mechanism and relative position coding are added, which improves performance by 1.5. The fifth line adds the self-attentional mechanism and the absolute position code improves by only 0.4.

This graph compares the performance of several networks. BoTNet is red, SENet is black and EfficientNet is blue. T3 and T4 performed less well than SENet, while T5 performed almost the same. It can be concluded that the accuracy of pure convolution model can reach 83%. With 84.7% accuracy, the T7 EfficientNet EfficientNet B7 EfficientNet improved 1.64 times, outperforming deIT-384.

BoTNet replaces the 3*3 convolution part of ResNetPart of the MHSA code

conclusion

  • Without any additional conditions, BoTNet obtained 44.4% of MaskAP and 49.7% of Box AP on COCO instance segmentation baseline by using the Mask R-CNN framework.
  • Convolution and blending (convolution and self-attention) models are still strong models;
  • BoTNet achieves a top-1 accuracy of 84.7% on ImageNet, and its performance is superior to SENet and EfficientNet

Similar to this paper, there is also jd AI’s open source variant of ResNet, CoTNet– plug and play visual recognition module.

The paper

code

In their exploration of self-attention mechanism, Mei Tao’s team from JINGdong AI Research Institute creatively integrated dynamic context information aggregation of self-attention mechanism in Transformer with static context information aggregation of convolution, which is different from existing attention mechanism that only adopts local or global way to obtain context information. A novel “plug and play” CoT module with Transformer style is proposed, which can directly replace convolution in existing ResNet architecture with significant performance improvement. Both ImageNet classification and COCO detection and segmentation, the proposed CoTNet architecture achieves significant performance improvement and the number of parameters is at the same level as FLOPs. For example, the proposed SE-CotnetD-152 efficientnetd-152 achieved 84.6 percent and a 2.75 times faster inference speed, compared with 84.3 percent for Efficientnet-B6.

Using the “CoT module “, it can directly replace the convolution in the existing ResNet architecture with significant performance improvement. Both ImageNet classification and COCO detection and segmentation, the proposed CoTNet architecture achieves significant performance improvement and the number of parameters is at the same level as FLOPs. For example, the proposed SE-CotnetD-152 efficientnetd-152 achieved 84.6 percent and a 2.75 times faster inference speed, compared with 84.3 percent for Efficientnet-B6.

Specifically, CotNET-50 directly replaces convolution in Bottlenck with CoT; Similarly, COtNext-50 uses THE CoT module to replace the corresponding group convolution. In order to obtain similar computation, the number of channels and groups is adjusted: the number of parameters of CotNext-50 is 1.2 times that of ResNext-50, and the number of FLOPs is 1.01 times.

You can see:

  • Both CoTNet and CoTNeXt have better performance than other ResNet improved versions.
  • Compared with ResNEst-50 and ResNEst-101, the proposed CotNext-50 and CotNext-101 achieved a performance improvement of 1.0% and 0.9% respectively.
  • Compared with BoTNet, CoTNet also has better performance. Moreover, THE SE-CotnetD-152 (320) efficientnet-B7 efficientnet-152 has achieved better performance than Botnet-S1-128 (320) and Efficientnet-B7.