Vision Transformer (ViT)

No mountain is too high to be conquered

0 to review

Attention Is All You Need

RNN and LSTM time series networks have some problems: 1. The memory length is limited. For example, RNN memory time series is short, LSTM is proposed later; 2. 2. There is no parallelization, that is, time T1 can be calculated only after t0 is calculated, and the calculation efficiency is relatively low.

Google has come up with the Transformer, which is theoretically free of hardware limitations, can have infinitely long memories and can be parallelized.

What is the use of Embedding layer?

The introduction of Embedding layer Embedding in Keras Chinese documents is not willing to give too much explanation except one sentence “the Embedding layer transforms the positive integer (subscript) into a vector with a fixed size”. So why do we use Embedding? There are two main reasons:

1. The vectors encoded by one-HOT method will be very high-dimensional and sparse. Suppose we encounter a dictionary of 2000 words in NLP. When one-hot coding is used, each word is represented by a vector of 2000 integers, of which 1,999 are zeros. If the dictionary is larger, the computational efficiency of this method will be greatly reduced.

2. In the process of training the neural network, each embedded vector will be updated. From the picture above, we can find out how much similarity there is between words in multidimensional space, which enables us to visually understand the relationship between words, not only words, but also any content that can be translated into vector by Embedding layer.

The concept mentioned above may be a little unclear, then let’s take an example to see how Embedding handles the following sentence. The concept of Embedding came from Word Embeddings. If you are interested in reading more, check out Word2VEc.

Image from: A Survey on Visual Transformer

1 introduction

An Image is Worth 16×16 Words:Transformers for Image Recognition

The research background

Transformer has achieved excellent results in NLP field since it was proposed. The full Attention structure not only enhances the feature extraction capability, but also maintains the characteristics of parallel computing. It can complete almost all tasks in NLP field quickly and well, which greatly promotes the development of natural language processing.

Although Transformer is strong, its use in computer vision has been limited. Prior to this, only DETR in Objectdetection has used Transformer on a large scale, other fields have little use, and a pure Transformer network has not.

The following figure is the inevitable figure in this direction:

The advantage of the Transformer

1. Parallel computing;

2. Overall vision;

3. Flexible stacking capability;

Research results and significance

ViT and ResNet Baseline achieved similar results

JFT: Google’s closed source data set is about 30 times larger than ImageNet

Vit-h /14: vIT-huge model, input sequence 14×14

ViT’s historical significance: Demonstrates the possibility of using a pure Transformer structure in computer vision.

Picture →Backbone(CNNs)→Transformer→ Results

Images →Transformer→ results

2. Overview of algorithm models

It all begins: self-attention

What is Attention? Take machine translation for example.

Each output is only related to the input of a few words, the relationship is deep, the weight is significant, the relationship is shallow, the weight is small.

Attention Mechanism

The nature of Attention: weighted average

The calculation of Attention: in fact, the calculation of similarity

How do you calculate self-attention?

Self Attention calculation: it is actually doing similarity calculation, calculating the similarity of each Q and each K respectively

Formula:

What are Q, K and V?

Query D.

Key: Key value;

A Value b Value C Value

Why does the dot product measure how similar q is to k?

Formula: q1, k1 = | q1 | | k1 x | x cos

Q1, k2 = | q1 | | x k2 | x cos

Attention to calculate

The input is assumed to be X1-x4, which is a sequence. Each input (vector) is first multiplied by matrix W to get embedding, namely vector A1-A4. Then the embedding enters the self-attention layer. Each vector A1-A4 is multiplied by three different Transformation matrix Wq, Wk and Wv respectively. Taking vector A1 as an example, three different vectors Q1, K1 and V1 are obtained respectively. Then we use each query Q to do attention for each key K, attention is just taking the dot product, matching how close the two vectors are, and dividing by the square root of the dimensions of q and k.

MultiHead Attention

ViT structure

Inspired by the success of the Transformer extension in NLP, we tried to apply a standard Transformer directly to the image with as little modification as possible. To do this, we divide the image into patches and take the linear embedded sequence of these patches as an input to the converter. Image patches are processed in the same way as marks (words) in NLP applications. We train the model of image classification in a supervised manner.

1. Segmentation of the image into serialized data: the original HxWxC dimensional image is transformed into N D dimensional vectors (or a NxD dimensional two-dimensional matrix);

2. Position Embedding: Position Embedding (purple box) + Patch Embedding (pink box) is adopted to combine Position information;

3, Learnable Embedding: Xclass (pink box with asterisk) is a learnable vector. This token has no semantic information (that is, it has nothing to do with any word in the sentence and nothing to do with any patch in the image). It is related to the picture label. The result obtained by encoder makes the overall representation biased to the information of the specified embedding.

Transformer Encoder: Take Z0 obtained previously as the initial input of Transformer. Transformer Encoder is composed of multiple MSA and MLP blocks alternately. LN normalization should be performed before MSA and MLP each time.

Positional Encoding

Why do we need location coding?

After image segmentation and rearrangement (from two dimensions to one dimension), position/spatial information is lost, and Transformer’s internal operations are spatially irrelevant, so the position information needs to be encoded and retransmitted to the network.

ViT uses a learnable vector (Xclass) to encode. The encoding vector and patch vector are directly added to form the input.

Why just add, not concat?

Because addition is a special case of concat

Add form: W(I+P)=WI+WP

Concat form:

When W1=W2, the two expressions have the same connotation

The difference between BN and LN

Why LN?

LN is essentially taking the mean and variance for each sample, converting the input to something with a mean of 0 and a variance of 1. But BN is not used because the processing object of Batch Normalization is a group of samples, which normalizes the same dimension features of these samples, while the processing object of Layer Normalization is a single sample, which normalizes all dimension features of this single sample. And if I put N+1 sequences here, each sequence might be of different length.

3. Analysis of experimental results

Pre-train on large data sets, then Fine Tune on small data sets

After migration, the original MLP Head needs to be replaced with the FC layer corresponding to the number of categories, and the results of Positional Encoding need to be interpolated when processing the input of different sizes

Transformer models do not perform as well as ResNets on medium scale data sets such as ImageNet.

When the data set is scaled up, the transformer model approaches or exceeds some current SOTA(State of the Art) results.

BiT: a large ResNet model for supervised + transfer learning

Noisy Student: EfficientNet-L2 for semi-supervised learning