This article is reprinted from the Heart of the Machine.

Microsoft’s Swin Transformer has opened source code and pre-training models for all major CV tasks.

Since Google introduced Transformer in June 2017, it has become a mainstream model for natural language processing. Recently, Transformer has started its own crossover in the field of computer vision. Many new models based on Transformer have been developed, such as VIT of Google for image classification and SETR of Fudan, Oxford, Tencent and other organizations. Thus, “Transformer can do everything?” Has been a hot topic in the machine learning community for a while.

Not long ago, researchers at Microsoft Research Asia came up with a visual, layered Transformer that sends moveable Windows across the wire. They called it a Swin Transformer. Compared with the previous VIT model, Swin Transformer makes the following two improvements: First, it introduces the hierarchical construction method commonly used in CNN to build a layered Transformer; Second, the idea of locality is introduced to compute self-attention in the non-overlapping window area.

Paper links: https://arxiv.org/pdf/2103.14…

First, look at the overall workflow of Swin Transformer. Figure 3A shows the overall architecture of Swin Transformer and Figure 3b shows two consecutive Swin Transformer blocks.

The highlight of this study is to calculate the representation of layered Transformer by using the moving window. By limiting self-attention computation to non-overlapping local serial ports, while allowing cross-window connections. This hierarchical structure can be flexibly modeled at different scales and has linear computational complexity of image size. Figure 2 shows the workflow for calculating self-attention using the mobile window in the Swin Transformer architecture:

The characteristics of the model itself enable it to achieve competitive performance in a series of visual tasks. Among them, 86.4% image classification accuracy was achieved on the ImageNet-1K dataset, 58.7% target detection box AP and 51.1% mask AP were achieved on the Coco Test-Dev dataset. Currently, Swin-L (a variant of SwinTransformer) implements SOTA for both target detection and instance segmentation tasks on both Coco MiniVal and Coco Test-Dev dataset.

In addition, on the ADE20K VAL and ADE20K datasets, SWIN-L also implements SOTA in the semantic segmentation task.

Open source code and pre-training models

Shortly after the publication of the Swin Transformer paper, Microsoft has opened source code and a pre-trained model for image classification, object detection, and semantic segmentation tasks on GitHub. In just two days, the project has already harvested 2,100 stars.

The address of the project: https://github.com/microsoft/…

Firstly, for the image classification task, the accuracy results of Swin-T, Swin-S, Swin-B and Swin-L variant models on ImageNet-1K and ImageNet-22K datasets are as follows:

The second target detection task: Swin-T, Swin-S, Swin-B and Swin-L variant models’ results on COCO target detection (2017 VAL) dataset are as follows:

The final semantic segmentation task: The results of the Swin-T, Swin-S, Swin-B, and Swin-L variants on the ADE20K Semantic Segment (VAL) dataset are shown below. Currently, Swin-L has achieved a SOTA-validated MIOU score of 53.50%.