Make writing a habit together! This is the third day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

A ConvNet for the 2020s

Reference: www.bilibili.com/video/BV16Y…

Abstract

The visual recognition revolution began with the introduction of ViTs, which quickly replaced ConvNet as the most advanced image classification model. On the other hand, although the original ViTs were difficult to apply to downstream tasks, many hierarchical ViTs such as Swin Transformer reintroduced several ConvNet priors, making Transformer possible as a universal visual backbone, And the performance is remarkable in all kinds of downstream tasks. However, the effectiveness of this hybrid method is mainly due to the inherent advantages of ViTs, rather than the inherent inductive bias of convolution.

This article reexamined the design space and tested the limits of what pure ConvNet could achieve, specifically, progressively “modernizing” a standard ResNet into a VITs-like architecture implementation, and in the process found several key components that contributed to performance improvements.

Consisting entirely of standard ConvNet modules, ConvNeXt competes with Transformer in accuracy and scalability, achieving 87.8% top-1 accuracy on ImageNet, Surpassed Swin Transformers in COCO target detection and ADE20K semantic segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

review

The evolutionary path of ConvNet

The rise of convolutional neural networks can be traced back to AlexNet in 2012, which won the champion of ImageNet classification competition in that year, which made people see the infinite possibilities of convolutional neural networks in computer vision.

Two years later, VGGNet believed that using small convolution kernels in series could achieve the same effect as large convolution kernels, with lower parameter number, and more nonlinearity could be added between layers to achieve better performance. Since VGG, continuously stacked 3×3 networks have become the mainstream convolutional neural network design paradigm.

Inceptions, another paper from the same year, won ImageNet, while VGG came in second. Inceptions explored the impact of multiple branches on the model, which continues to this day.

In 2015, a paper attracted worldwide attention came out, which was Deep Residual Learning for Image Recognition by He Kaiming. The structure of Residual was proposed in the paper, and ResNet with its design also won the champion of ImageNet, with a sudden fame. The implications are profound. Even today, there are many papers that take ResNet as Baseline for research and exploration, bringing it over the 100,000 citations mark.

In 2016, ResNeXt combined ResNet with Inceptions and used grouped convolution to reduce parameters, a design that has had a profound impact on subsequent lightweight neural network design.

In the same year, DenseNet took the connection between layers to the extreme, taking advantage of the characteristics of different layers and alleviating the problem of gradient loss.

In 2017 and 2018, MobileNet published two papers that further explored the power of grouped convolution — deeply separable convolution — and inverse residual structures, as opposed to ResNet, which have influenced the design of many lightweight networks.

In 2019, EfficientNet EfficientNet proposed the composite model expansion method, a model scaling method that scales the model from three dimensions of depth, width and input image resolution, a design that has become common in subsequent neural network designs.

There’s more…

These designs have established the current convolutional neural network design paradigm and have influenced countless networks. But it also somewhat constrained the design of convolutional neural networks in the 2020s, so today’s ConvNeXt is in desperate need of an overhaul.

In common

What are the commonalities behind these excellent neural network designs?

Local calculation, translation and other variations and feature stratification.

These excellent design priors were also used in ViT, resulting in Swin Transformer.

step

Transformer

In 2017, Transformer made its debut in the NLP space and soon became the dominant player.

Limited by the differences between NLP and CV, ViT was not introduced into the visual field until 2020, and its road to dominance began.

challenge

At the time, ViT’s success was limited to the task of classifying images, but computer vision was much more than that.

At the same time, the huge time complexity of self-attention limits the input size, and using high-resolution images seems like a stretch. However, segmentation, detection and other downstream tasks require high resolution to obtain more information.

The development of

ViT didn’t stop there, and its success seemed inevitable. VITs of various metamorphoses have also proved themselves capable.

Swin Transformer came out in 2021, and several SOTAS took on various tasks at one stroke. Their paper also won the Marr Award, the best paper of the year.

Swin Transformer borrows the priors of ConvNets to calculate self-attention in local Windows and share the weight of different Windows, while the layered structure also becomes its asset in downstream tasks.

However, swINS are so complex that cyclic shifting is criticized for being inelegant and sliding window is not implemented natively in order to implement interactions between local Windows.

But convolution already has those elements!

ConvNeXt

Let’s start with the basics of ResNet and modernize step by step!

Training strategy

To ensure fairness, ConvNets and ViTs should be trained in the same configuration. Its performance improved by 2.6.

Macro design

In ResNet, the computation ratio of each layer is 3:4:6:3, while in SWIN, it is 1:1:3:1.

Another point is that Patch Embedding is used to replace the initial maximum pooling of 7×7 convolution kernel of ReaNet with the convolution of 4×4stride 4.

ResNeXt,

Using depth-separable convolution, in ViT, the weight and operation of attention is also pre-channe, while increasing the number of channels to compensate for the performance loss.

Inverse bottleneck structure

In the self-attention calculation, the number of channels expands from C to 3C to split into KQV, and the hidden layers in subsequent MLPS also expand fourfold.

Bigger convolution kernel

Expanding the 3×3 size of classical convolution to 7×7, the same size as Swin, does not increase the larger size any more, which is also shown in RepLKNet, probably because the 7×7 size is sufficient for ImageNet classification tasks and can increase the size further for downstream tasks requiring larger size.

Micro design

Draw lessons from the micro design of Transformer, as follows:

  1. Change ReLU to GeLU;
  2. Use fewer activation functions and normalization layers;
  3. Change BatchNorm to LayerNorm;
  4. Separate downsampling layers and normalization layers are used, with the same resolution within each Block.

The results of

The modern ConvNeXt comparison is as follows:

ConvNeXt was clearly simple and easy to implement, and its performance was as good as that of the Big Fire ViTs.

ConvNeXt provides guidance for the design of convolutional networks in the post-VIT era, and its influence on the design paradigm of convolutional neural networks in the 2020s and subsequent convolutional neural networks will be immeasurable, perhaps starting a new ConvNet design paradigm innovation.