This article introduces the paper hot 2021 Transformer(ViT)

Paper Hot In 2021 Transformer(ViT)

This article was original written by Jin Tian, you are welcomed to repost, first publish at blog.tsuai.cn. Please keep this copyright info, thanks, any question could be asked via wechat: jintianiloveu

It’s been a long time since I’ve written an article, but before I do, please allow me to post an AD. Since GIthub. IO is no longer comfortably accessible (because you know why), I am directing my blog to my domestic domain: Blog.tsuai.cn, some technical articles will be updated from time to time, interested friends can know about it, of course, you can follow my account on Zhihu, I will not define again zhihu update.

This article is mainly the translation of a Medium article, but I will add some supplementary knowledge points with my own understanding and reference materials, hoping to give you a comprehensive picture of transformer. The level is limited, and please kindly give me some advice if you have any shortcomings. The original address is at the bottom of the reference link.

A lot of times when you think of Transformer you’ll think of the Facebook DETR, but I’d like to start with this one from the Google Brain team: An image is worth 16×16 words:

Why do you want to start with this article? Because it’s the work of the big guy at Google. This article is not about simple nesting of attention mechanics to run benchmark, nor about how Transformer can be used for a specific task. Rather, it explores how visual tasks themselves can be stripped from traditional CNNs to reach their current level. It’s a must-see paper for 2021. Before I explain how awesome the Google Brain team is, here’s a summary. Take a look at what people have been using it for in the visual world (including 3D point clouds) over the past year.

01. Transformers in NLP

Before we get started, let’s take a look back at transformer’s phenomenal NLP success. Compared to NLP students, Transformer has become the standard of NLP. How standard is Transformer? Like when I started NLP three years ago, LSTM unified the world. The well-known GPT-3 is also a transformer based giant model that won the NIPS2020 Best Paper award.

So in the visual field of 2021, will transformers be as successful as NLP? In fact, the use of Transformer for visual tasks has become a new hot topic, and people are studying how to apply this technology to visual tasks for lower model complexity and training efficiency.

02. Transformers in CV 2020

Over the past year, at least a few papers have been useful in building models in Transformer that surpass many of the leading traditional approaches in terms of metrics:

  • DETR: End-to-end Object Detection with Transformers
  • ViT: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE, discuss the application of Transformer to basic visual tasks, such as classification tasks;
  • Image GPT: Generative Pretraining from Pixels: Transformer is used for Image filling;
  • LDTR: Transformers for lane detection: Transformer is used for lane line detection.

Generally speaking, there are two relatively large architectures. In all the papers and works on Transformer, one is the structure composed of traditional CNNs and Transformer, and the other is pure Transformers.

  • Pure transformers;
  • A mixture of transformers.

The ViT paper, the Google Brain paper at the beginning, uses a pure Transformer for visual tasks, which means it doesn’t use any CNNs. I know you’re wondering what that means, and why you didn’t use any CNNs. Don’t worry, I’ll talk about that in the last part of this article, and I’ll make a video when I have time to go through it with Google open Source code. Why? But for now, just know that it’s a pure Transformer.

What is a hybrid Transformer? The DETR paper, for example, is actually based on an idea in Transformer, but with little variation on the feature extraction backbone network. We could call it a hybrid Transformer.

Writing here, a lot of curious babies have a lot of problems:

  • Why use it in vision? How does it work?
  • How does this compare to existing results?
  • Are there any constraint challenges?
  • Transformer has a variety of structures. Are there any typical ones that combine accuracy and efficiency? Why is that?

Good, you’ve learned to ask questions that are beyond the scope of this article because they’re too hardcore. But I’m going to try to dig into the ViT of Google’s brain to figure out the answers with you. In fact, to get to the bottom of one of these and very basic issues, we need to switch scenes and see how a Transformer is presented and why attention mechanisms are useful.

As for the video on Transformer, I recommend a video by YoutuberUP, but you may not be able to watch it because you understand the reason, so I tried my best to transfer it to Bilibili, so as to be careful, everyone can give my poor B station account a like and follow it. I only have five followers now. Poor thing.

www.bilibili.com/video/BV1py…

So, attention is all you Need. Some children may ask, it seems that all the videos and papers you sent are about NLP, how do they relate to vision? Yes, the transformer we are talking about is actually the NLP transformer.

Let’s take a closer look at VIts.

03. Vision Transformer

This is a picture from the Google Brain paper. Going back to the question I raised at the beginning of this article, why I’m using this article to talk about Transformer. As I mentioned in the last section, Transformer is divided into pure and hybrid, and this article is the first paper to discuss and use Transfomer exclusively for visual-based tasks.

This is its value, Google brain team almost no change in any of the transformer based on NLP structure foundation, only the input for an adapter, to cut the image into many small division, then the input to the model as a sequence, finally completed the classification task, and the effect can be catching up on the basis of CNNs SOTA.

Their approach is simple, as the segmentation of images can be perfectly structured into a sequence of input that can be seamlessly plugged into transformer input. In order to further maintain the local and global relationship between these small grids, the number corresponding to the original image is kept for each patch. This is a good way to preserve information about space and location, even when they are out of order. Of course, in the source text of the paper, they also did a comparative experiment with and without the use of this spatial coding method, if you are interested in it, you can read it carefully.

This paper also directly compares traditional CNNs, which are pre-trained on different data sets, such as:

  • Ilsvrc-2012 ImageNet, including 1K categories and 1.3m images;
  • Imagenet-21k contains 21K categories, 14M images;
  • JFT includes 18K categories, as well as 303M high resolution images.

These data sets are so large that instead of Days, they train in k Days:

2.5K days, if not Google big guy who can do this paper, ah. Let’s look at the ViT effect, in fact, the ViT model is the same as BERT, AND I even think That Google has always wanted to do the same job as BERT, and in the visual field, the two are really well combined by Google researchers. Their models are divided into “Large”, “Base” and “Huge”. In the model of “Large”, its accuracy has surpassed Resnet152x4. And it looks like the training sessions are shorter.

One of the interesting conclusions of this article, and our intuition, is that Transformer does not perform well when the data scale is small. In other words, it takes a large enough data set to train a Transformer.

As you can see from this graph, when there’s not enough data, it’s not accurate enough.

As we mentioned earlier, for Transformer, we have ViT, which is a complete replacement for CNNs, and DETR, which is a partial replacement, so what is the optimal architecture? Google’s chart also reveals some answers:

  • Pure transfomer structure is more efficient and can be expanded better than traditional CNNs, no matter small or large size, the effect is better.
  • Mixed structures work better at small scales than pure structures. This may be because Transformer requires not only larger data but also larger structures, but it has a higher upper limit.

04. DETR

This article is the first paper to use Transformer for target detection, and of course it is a hybrid type model as described earlier. Today, DETR also has some shortcomings, although it can reach the level of FasterRCNN in terms of indicators. For example, it shows some signs of insufficient ability in small object detection. Now, there are some papers to improve it, such as DeformableDETR. While these improvements are not the core point of this article, let’s review the idea of using Transformer in DETR.

The characteristics of this model are:

  • The traditional CNN is used to learn the 2D feature representation and extract features at the same time.
  • The output of CNN is tiled to transformer as input. This tiling contains the feature’s position coding information.
  • The output of the Transformer, the output of the decoder, is used to input into an FNN, which then predicts categories and boxes.

Such a structure at least eliminates the achor setting and eliminates redundant NMSS compared to traditional target detection. These manual operations are eliminated, although they still play a significant role in today’s target detection algorithms.

The real beauty of DETR is not its effect on target detection, but the amazing effect it shows when we extend it to panoramic segmentation:

So what did they do? Panorama segmentation is actually two tasks, one is semantic segmentation, which corresponds the category to each pixel, and the other is instance segmentation, which detects each target and divides these target regions. DETR combines the two together and shows amazing results.

One of the interesting boners in this paper is the algorithm’s ability to distinguish between overlapping targets, which reflects the power of attention mechanics, and Transformer itself is a huge attention mechanic engine. For example, they can distinguish between objects that are highly overlapped:

05. Image GPT

Going back to OpenAI’s Dell-E work, which demonstrates the power of Transformer, they have done some work before. Image GPT is an Image filling model based on GDT-2. I should add here that, as Lecun says, Transformer is really good at filling jobs, as if they were born that way.

Image GPT uses a sequence of pixels in an Image to generate an Image. It recursively predicts the next pixel in the Image. Highlights of Image GPT’s work are:

  • Using the same TRANSFORMER model as GDT-2;
  • Unsupervised learning;
  • It takes more computation, if we want a better representation of the feature;

In fact, OpenAI may have built on this work, combined with the text gpT-3, to create an experienced Dell-E.

conclusion

In this article, we review the work of some of the more well-known classics in Transformer, which shows a new visual direction. This may also be one of the important mainlines in 2021.

Writing this article and a purpose, is not only a summary, more inspired everyone in one direction, this structure is totally different from traditional CNNs, it gives model is a new expression of the partial and whole, even the time domain and space, the relationship between GPT3 success also suggests that we may be like CNNs in this direction, open a new era of vision, Because it makes it possible to train on a very large scale and, like BERT, can be a pre-training building block for all other tasks. As I write this, OpenAI has just released their new text-to-image generation model, Dell-E, which shows how models can deeply learn the relationship between natural language and graphics, and how the two can be perfectly integrated and used to further inspire us: Perhaps based on this new architecture, the vision of the future will not only learn features such as textures, but may further learn higher-order information such as models associated with natural language, Imagine now that target detection just tells you that this is 0 or 1 or 2 or 3, future target detection might just tell you that this is table and this is Apple, and let the model do the reasoning.

All in all, we still have to be imaginative about some new technology hotspots. In the near future, perhaps visual tasks, like GPT3, can truly approach the characteristics of human reasoning, which may be the turning point of technological iteration.

As for more details and code interpretation of Transformer, I will continue to explain in the following chapters. If you think it is useful, you can follow my account on Zhihu and click “Like”. Happy Year of the Ox.

Reference

  1. SOURCE: towardsdatascience.com/transformer…
  2. DETR: arxiv.org/pdf/2005.12…
  3. ViT: arxiv.org/pdf/2010.11…
  4. Image GPT: cdn.openai.com/papers/Gene…
  5. Attention is all you need: arxiv.org/pdf/1706.03…