This paper introduces the basic process of Transformer, the two ways of realizing block, several ways of realizing Position Emebdding, the way of realizing Encoder, two ways of finally classifying, and the introduction of the most important data format.

This article from the public CV technical guide technical summary series

Welcome to pay attention to the public number CV technical guide, focusing on computer vision technology summary, the latest technology tracking, classic paper interpretation, CV recruitment information.

Before we talk about how to build, let’s review how Transformer is structured in computer vision. Here is a typical ViT example.

As shown in the figure, for an image, patches are segmented into NxN patches, the patches are Flatten, and then mapped into tokens through a fully connected layer. When position embedding is added to each token, a token is randomly initialized. After concate to the tokens generated through images, transformer Encoder module and multi-layer Encoder module are used to extract the final tokens(i.e. randomly initialized tokens) and classify them through the full connection layer as a classification network.

Let’s follow this process step by step to build a Transformer model. ,

block

At present, there are two ways to realize block segmentation, one is direct segmentation, the other is through the convolution of the convolution kernel and step size of patch.

Direct segmentation

Direct segmentation is the image directly divided into multiple blocks. Einops library is used for code implementation. The completed operation is to adjust the shape of (B, C, H, W) to (B, (H/P *W/P), P*P*C).

from einops import rearrange, repeat
from einops.layers.torch import Rearrange

self.to_patch_embedding = nn.Sequential(
           Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
           nn.Linear(patch_dim, dim),
      )
Copy the code

Here is a brief introduction to Rearrange.

Rearrange is used to Rearrange the dimensions of tensors and is available to replace 0 0, View, transpose and Permute in PyTorch. Just a couple of examples

Shape (images,'b h W C -> (b H) W C ')#shape is 0 Rearrange(images, 'b h w c -> B C h w')#shape to (32, 3, 200, 400)# Rearrange(images, 'b h w c -> b C h w')#shape to (32, 3, 200, 400)# Rearrange(images, 'b h w c -> (b c w) h')#shape into (32*3*400, 200)Copy the code

It can be seen from these examples that Rearrange is very simple and easy to use. B, C, H and W here can be understood as symbols to express operation changes. From these examples, it seems that you can understand how the following line of code divides the image.

Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width)
Copy the code

It is important to explain that the multiplication of two variables in a parenthesis represents the length of the dimension, so do not read “h” and “w” as the width and height of the image. So actually h is equal to h over P1, w is equal to w over P2, which means how many blocks are in height, how many blocks are in width. Neither h nor w need to be assigned, the code automatically evaluates from this expression, and b and C automatically correspond to b and C of the input data.

(b, (h /P * w /P), P*P*C)

In this way, after segmentation, vectors of segmentation need to be mapped into tokens through a fully connected layer.

This is the direct chunking approach used in ViT.

Convolution integral

Convolution segmentation is easy to understand. It is enough to convolve the image once with the convolution kernel and the convolution with patch step size.

self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)

x = self.proj(x).flatten(2).transpose(1, 2)  # B Ph*Pw C
Copy the code

In SWIN Transformer, this kind of convolution blocking is used. There is no full connection layer added after convolution in SwIN Transformer.

Position Embedding

Position Embedding can be divided into absolute Position Embedding and relative Position Embedding.

When learning the original Transformer, you may notice that sines and cosines are used, but this only applies to 1-dimensional data such as voice and text. Images are highly structured data and sines and cosines are not appropriate.

Both ViT and SWIN Transformer randomly initialize a set of learnable parameters with the same shape as tokens. When added with tokens, absolute position embedding is completed.

Implementation in ViT:

Self. Pos_embedding = nn.Parameter(torch. Randn (1, num_patches +1, dim)) x += self. Pos_embedding [:, (n +1)] # ViT randomly initializes a class token and splices tokens into chunks. Therefore, the number of patches is num_patches+1.Copy the code

Implementation in SWIN Transformer:

from timm.models.layers import trunc_normal_
self.absolute_pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
trunc_normal_(self.absolute_pos_embed, std=.02)
Copy the code

Implementation in TimeSformer:

self.pos_emb = torch.nn.Embedding(num_positions + 1, dim)
Copy the code

The above is the simple way of using it, which belongs to Absolute position embedding.

There are more complicated methods, which will be introduced in a separate article later.

Interested readers can go to see the paper first “ICCV2021 | Vision reflection and improvement of the relative position code” in the Transformer.

Encoder

Encoder consists of multi-head self-attention and FeedForward.

Multi-head Self-attention

Multi-head self-attention consists of first dividing tokens into Q, K and V, then calculating the dot product of Q and K, weighting the tokens by Softmax, weighting the tokens by V, and then passing through the fully connected layer.

The formula is as follows:

The so-called multi-head refers to dividing Q, K and V into heads. Dk in the formula is the dimension of each head.

The specific code is as follows:

class Attention(nn.Module): def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.): super().__init__() inner_dim = dim_head * heads project_out = not (heads == 1 and dim_head == dim) self.heads = heads Self.scale = DIM_head ** -0.5 self.attend = nn.softmax (dim = -1) self.dropout = nn.dropout (dropout) self.to_qkV = self.scale = DIM_head ** -0.5 self.attend = nn.softmax (dim = -1) Self.dropout = nn.Linear(dim, inner_dim * 3, bias = False) self.to_out = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) if project_out else nn.Identity() def forward(self, x): qkv = self.to_qkv(x).chunk(3, dim = -1) q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) attn = self.dropout(attn) out = torch.matmul(attn, v) out = rearrange(out, 'b h n d -> b n (h d)') return self.to_out(out)Copy the code

Q = K = V for self-attention. K = V for general attention. Q is something else, like tokens of a different scale. Or other frames of tokens in the field of video.

FeedForward

There is no need for further introduction.

class FeedForward(nn.Module):
   def __init__(self, dim, hidden_dim, dropout = 0.):
       super().__init__()
       self.net = nn.Sequential(
           nn.Linear(dim, hidden_dim),
           nn.GELU(),
           nn.Dropout(dropout),
           nn.Linear(hidden_dim, dim),
           nn.Dropout(dropout)
      )
   def forward(self, x):
       return self.net(x)
Copy the code

Put those two together and you have Encoder.

class Transformer(nn.Module):
   def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
       super().__init__()
       self.layers = nn.ModuleList([])
       for _ in range(depth):
           self.layers.append(nn.ModuleList([
               PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
               PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
          ]))
   def forward(self, x):
       for attn, ff in self.layers:
           x = attn(x) + x
           x = ff(x) + x
       return x
Copy the code

Depth is the number of Encoder’s. PreNorm refers to layer normalization.

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)
Copy the code

Classification method

There are two typical methods for obtaining the final prediction vector after Encoder. In ViT, a CLS_token is initialized randomly. After concating to the partitioned token, the CLs_token is extracted after Encoder, and finally the CLS_token is mapped to the final prediction dimension through the full connection layer.

Cls_token from einops import repeat self.cls_token = nn.Parameter(torch. Randn (1, 1, 2)) dim)) cls_tokens = repeat(self.cls_token, '1 n d -> b n d', b = b) x = torch.cat((cls_tokens, x), Dim = 1) # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # classification. Part of the self mlp_head = nn. Sequential (nn) LayerNorm (dim), nn. The Linear (dim, num_classes) ) x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0] x = self.to_latent(x) return self.mlp_head(x)Copy the code

In SwIN Transformer, clS_token is not selected. Instead, all data is pooled into an average after passing Encoder directly, and then through the full connection layer.

self.avgpool = nn.AdaptiveAvgPool1d(1)
self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()

x = self.avgpool(x.transpose(1, 2))  # B C 1
x = torch.flatten(x, 1)
x = self.head(x)
Copy the code

Put these together and you have a complete model

class ViT(nn.Module):
   def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
       super().__init__()
       image_height, image_width = pair(image_size)
       patch_height, patch_width = pair(patch_size)

       num_patches = (image_height // patch_height) * (image_width // patch_width)
       patch_dim = channels * patch_height * patch_width
       assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'

       self.to_patch_embedding = nn.Sequential(
           Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
           nn.Linear(patch_dim, dim),
      )

       self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
       self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
       self.dropout = nn.Dropout(emb_dropout)
       self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)

       self.pool = pool
       self.to_latent = nn.Identity()
       self.mlp_head = nn.Sequential(
           nn.LayerNorm(dim),
           nn.Linear(dim, num_classes)
      )

   def forward(self, img):
       x = self.to_patch_embedding(img)
       b, n, _ = x.shape

       cls_tokens = repeat(self.cls_token, '1 n d -> b n d', b = b)
       x = torch.cat((cls_tokens, x), dim=1)
       x += self.pos_embedding[:, :(n + 1)]
       x = self.dropout(x)
       x = self.transformer(x)
       x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]

       x = self.to_latent(x)
       return self.mlp_head(x)
Copy the code

Transformation of data

The above code is relatively simple, overall the most troublesome place is to understand the transformation of data.

The first input data is (B, C, H, W), which becomes (B, n, D) after being partitioned.

In the CNN model, it is well understood that (H, W) is the feature map, and C refers to the number of feature map. Then, which n and D here are channels and which image features?

Let’s review the chunking

Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width)
Copy the code

From this we know that n is the number of blocks and d is the content of each block. Therefore, n here corresponds to C in the CNN model, while D corresponds to Features.

In general, in Encoder, we use the form (B, n, d).

In SWIN Transformer, the form obtained is (B, C, L) in the form of convolution, and then obtained by transpose is (B, L, C), which is actually exactly the same as the form obtained by direct block in ViT. The L in Swin Transformer is the N in ViT and the C is the D in ViT.

Therefore, it should be noted that in multi-head self-attention, the data form is (Batchsize, Channel, Features), and the data divided into multiple heads are Features.

As mentioned earlier, ViT concates a randomly generated CLs_token with dimensions (B, 1, d). You can think of it as an extra 1.

The above is the model building details of Transformer, the overall relatively simple, after reading this article you can find a few Transformer code to understand. Such as ViT, SWIN Transformer, TimeSformer, etc.

ViT:https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py techches: https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py TimeSformer:https://github.com/lucidrains/TimeSformer-pytorch/blob/main/timesformer_pytorch/timesformer_pytorch.pyCopy the code

In the next chapter, we will introduce how to write the train function, including setting the optimization mode, setting the learning rate, setting different learning rates for different layers, parsing parameters and so on.

Welcome to pay attention to the public number CV technical guide, focusing on computer vision technology summary, the latest technology tracking, classic paper interpretation, CV recruitment information.

CV Technical Guide has created a great environment for communication, except for out-of-the-way questions, which are almost always answered. Concern public number to add edit micro signal can invite to add exchange group.

​​

Other articles

Build a Pytorch model from zero

Build Pytorch model from zero

YOLO series comb and review (2) YOLOv4

YOLO series combing (1) YOLOV1-YOLOV3

To fully understand the SOTA StyleGAN big summary | method, architecture, the new progress

A thermal map visual code use tutorial

A visual feature map of the code

Summary of Anomaly Detection research on Industrial Image (2019-2020)

A Review of Small Sample Learning (INSTITUTE of Computing Science, Chinese Academy of Sciences)

Summary of positive and negative sample differentiation strategy and balance strategy in target detection

Summary of frame position optimization in target detection

Summary of Anchor-free application methods of target detection, instance segmentation and multi-target tracking

Soft Sampling: Explore more effective Sampling strategy

How to solve the problem of small samples in industrial defect detection

A summary of some personal habits and thoughts about fast learning a new technology or field