With Transformer in the crossovers of computer vision, there is a problem to be solved: how do you handle input of different sizes directly like CNN? To solve this problem, Meituan proposed a new implicit conditional position coding method, and the CPVT model based on this method has better performance than ViT and DeiT.

** Heart of Machine release, ** Heart of Machine editorial Department.

With the proposal of Facebook DETR (ECCV 2020) [2] and Google ViT (ICLR 2021) [3], the application of Transformer in the field of vision begins to heat up rapidly and becomes the first hot spot in the current vision research. However, visual Transformer is limited by fixed length position encoding and cannot directly handle different input sizes like CNN, which greatly limits the application of visual Transformer because many visual tasks, such as detection, require dynamic change of input size during testing.

One solution would be to interpolate the position encoding in the ViT to fit different image sizes, but this would require a re-tuning of the fine-tune model, otherwise the results would deteriorate.

Recently, Meituan has proposed an implicit conditional position coding (CPE) [1] for visual Transformer, which loosens the input size restrictions imposed by explicit position coding and makes Transformer easy to handle input of different sizes. Experiments show that Transformer with CPE has better performance than ViT and DeiT.

Address: arxiv.org/pdf/2102.10…

Project address: github.com/Meituan-Aut… (Open source soon)

background

Google’s ViT method usually breaks a 224×224 picture into 196 16×16 patches, and performs linear coding on them in turn to obtain an input sequence. Enables Transformer to process images as if they were character sequences. At the same time, in order to retain the location information between each image block, the position coding with the same length as the input sequence coding dimension is added. DeiT [4] improves the training efficiency of ViT, eliminating the need for large data sets (such as JFT-300M) as a pre-training limitation, and Transformer can train directly on ImageNet.

With visual Transformer, location coding is essential

In experiments with ViT and CPVT, performance of Transformer without location coding was significantly reduced. In addition, in Table 1, the learnable position encoding is similar to the sin-cos position encoding, while 2D RPE is worse, but still better than the case without position encoding.

Meituan and the University of Adelaide propose a new location coding method

Design requirements for location coding

Explicit position coding limits input size, so this study considers using implicit variable-length coding that varies according to input. In addition, it needs to meet the following requirements:

  • Maintain good performance;

  • Avoid permutation equivariance;

  • Easy to implement.

Based on the above requirements, a conditional Encoding Generator (PEG) was proposed to generate implicit position Encoding.

Generates implicit conditional position encoding

In PEG, the 1D output of the previous Encoder is transformed into 2D, and then its position information is learned by using the transformation module. Finally, the 1D space is transformed again, which is added with the previous 1D output as the input of the next Encoder, as shown in Figure 2. The Transoformation unit can be Depthwise convolution, Depthwise convolution, or other more complex modules.

By inserting PEG into the model (as in Figure 1, after the first Encoder is added), location coding information can be added to each Encoder. The advantage of this encoding is that it does not need to be specified explicitly and the length can vary with the input, so it is called implicit conditional position encoding.

The experiment

ImageNet data set

The study named the Vision Transformer model with PEG as CPVT (Conditional Position Encodings Visual Transformer). On ImageNet data sets, CPVT models outperform ViT and DeiT by the same order of magnitude. Thanks to the feature of implicit conditional coding that can be dynamically adjusted according to the input, the model trained based on 224×224 input can directly process 384×384 input (the last column of Table 3), directly improving performance without fine-tune. In contrast, other explicit encodings without fine-tune suffer a performance penalty.

Contrast with other coding methods

Table 5 shows the performance of CPVT-TI model under different encoding strategies. Among them, the performance of inserting a PEG from the 0 to the 5 Encoder is the best, and the top-1 accuracy is 73.4%. CPVT using PEG alone or in combination with learnable encodings also outperforms DEIT-Tiny under various encoding strategies.

The role of PEG in different locations

The ViT trunk is composed of 12 Encoder, CPVT compares the results of PEG at -1, 0, 3, 6, 10, etc. Experiments show that PEG performs best after the first Encoder (IDX 0). According to this study, after the first encoder, not only can the global acceptance domain be provided, but also the location information can be used in the model as soon as possible.

conclusion

** Implicit location coding proposed by CPVT is a plug-and-play universal approach. ** It loosens the limits on input size, which is expected to further improve the performance of Vision Transformer for segmentation, detection, super-resolution and other tasks. This research will have a positive impact on the subsequent development of Vision Transformer.

reference

1.Do We Really Need Explicit Position Encodings for Vision Transformers? Arxiv.org/pdf/2102.10…

2.End-to-end object detection with transformers arxiv.org/abs/2005.12…

3.An image is worth 16×16 words: Transformers for image recognition at scale openreview.net/pdf?id=Yicb…

4.Training data-efficient image transformers & Cut through attention arxiv.org/abs/2012.12…