VIT Vision Transformer | information from PyTorch code first

Original article from wechat official account “Machine Learning Alchemy”
Author: Brother Alchemist
Contact: Wechat CYX645016617

The code comes from Github

[Introduction] : When you look at the code, you may not understand the meaning of the various components in VIT, but the purpose of this article is to understand the implementation. After reading the paper, you can do know, rather than a blank.

VIT class

Initialize the

As before, start with the big model classes and then work your way through the small model classes:

class ViT(nn.Module) :
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.) :
        super().__init__()
        assert image_size % patch_size == 0.'Image dimensions must be divisible by the patch size.'
        num_patches = (image_size // patch_size) ** 2
        patch_dim = channels * patch_size ** 2
        assert num_patches > MIN_NUM_PATCHES, f'your number of patches ({num_patches}) is way too small for attention to be effective (at least 16). Try decreasing your patch size'
        assert pool in {'cls'.'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'

        self.patch_size = patch_size

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.patch_to_embedding = nn.Linear(patch_dim, dim)
        self.cls_token = nn.Parameter(torch.randn(1.1, dim))
        self.dropout = nn.Dropout(emb_dropout)

        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)

        self.pool = pool
        self.to_latent = nn.Identity()

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes)
        )
Copy the code

In the actual call, the following is called:

model = ViT(
    dim=128,
    image_size=224,
    patch_size=32,
    num_classes=2,
    channels=3,
).to(device)
Copy the code

Input parameter explanation:

image_size: Size of the picture;
patch_size: Divide the picture into small patches, and the size of small patches;
num_classes: Total number of categories for this classification task;
channels: Enter the number of channels for the picture.

Components initialized in the VIT class:

num_patches: How many patches can an image be divided into? Since images 224 and Patch32 are divided into 7×7=49 patches;
patch_dim:3x32x32, understood as the number of elements in a patch;

. Isn’t it too much of a hassle to show it like that, to go up and down the code, so I’ll write it as a comment

class ViT(nn.Module) :
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.) :
    # image_size = 224, patch_size = 32, num_classes = 2, channels = 3, dim = 128
        super().__init__()
        assert image_size % patch_size == 0.'Image dimensions must be divisible by the patch size.'
        # num_pathes = (224//32) **2 = 7*7=49
        num_patches = (image_size // patch_size) ** 2
        # patch_dim = 3*32*32
        patch_dim = channels * patch_size ** 2
        assert num_patches > MIN_NUM_PATCHES, f'your number of patches ({num_patches}) is way too small for attention to be effective (at least 16). Try decreasing your patch size'
        assert pool in {'cls'.'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
		# self.patch_size = 32
        self.patch_size = patch_size
        # self. Pos_embedding is a shape of (1,50,128)
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        # self.patch_to_embedding is a linear layer mapped from 3*32*32 to 128
        self.patch_to_embedding = nn.Linear(patch_dim, dim)
        # self.cls_token is a randomly initialized variable of the shape (1,1,128)
        self.cls_token = nn.Parameter(torch.randn(1.1, dim))
        self.dropout = nn.Dropout(emb_dropout)
        
        # Transformer will be explained later
        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)

        self.pool = pool
        self.to_latent = nn.Identity()

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes)
        )
Copy the code

forward

Now look at the VIT reasoning process:

    def forward(self, img, mask = None) :
 		# p=32
        p = self.patch_size
        x = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = p, p2 = p)
        x = self.patch_to_embedding(x) # x.s hape = [b, 49128]
        b, n, _ = x.shape # n = 49

        cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b)
        x = torch.cat((cls_tokens, x), dim=1) # x.s hape = [b, 50128]
        x += self.pos_embedding[:, :(n + 1)] # x.s hape = [b, 50128]
        x = self.dropout(x) 

        x = self.transformer(x, mask) # x.s hape = [b, 50128], mask = None

        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]

        x = self.to_latent(x)
        return self.mlp_head(x)
Copy the code

The code here uses itfrom einops import rearrange, repeatEinops is a library function that operates on tensors. It supports PyTorch, TF, etc.
einops.rearrangeIs the input of img, from the shape of [b, 3224224] to [b, 3,7,32,7,32] shape, through the matrix of displacement into [b, 7,7,32,32,3], finally merged into [b, 49, 32 x32x3]
self.patch_to_embedding, the output shape of x is [b,49,128];
einops.repeatCopy self.cls_token from [1,1,128] to [b,1,128]

Now, we know that the transition from patch to embedding is realized using a linear layer.

transformer

class Transformer(nn.Module) :
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout) :
        # dim=128,depth=12, heads=8, DIM_head =64,mlp_dim=128
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Residual(PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout))),
                Residual(PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout)))
            ]))
    def forward(self, x, mask = None) :
        for attn, ff in self.layers:
            x = attn(x, mask = mask)
            x = ff(x)
        return x
Copy the code

Self. layers contains the Attention+FeedForward module of the depth group.
Remember, the size of the input x is [b,50,128]

Attention

class Attention(nn.Module) :
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.) :
        super().__init__()
        inner_dim = dim_head *  heads # 64 x 8
        self.heads = heads # 8
        self.scale = dim_head ** -0.5

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        )

    def forward(self, x, mask = None) :
        b, n, _, h = *x.shape, self.heads # n=50,h=8
        # self.to_qkv(x) yields dimensions [b,50,64x8x3], and then chunks are divided into 3 pieces
        QKV is a triple tuple, each of which is [b,50,64x8] in size
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        Change each copy from [b,50,64x8] to [b,8,50,64]
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv)
		# this step is not easy to understand, q and k are both forms of [b,8,50,64],50 is understood as the number of features,64 is the characteristic variable
        # dots. Shape = [b, 8,50,50]
        dots = torch.einsum('bhid,bhjd->bhij', q, k) * self.scale
        # Do not consider the content of mask
        mask_value = -torch.finfo(dots.dtype).max

        if mask is not None:
            mask = F.pad(mask.flatten(1), (1.0), value = True)
            assert mask.shape[-1] == dots.shape[-1].'mask has incorrect dimensions'
            mask = mask[:, None, :] * mask[:, :, None]
            dots.masked_fill_(~mask, mask_value)
            del mask
		Do softmax for the last dimension of [b,8,50,50]
        attn = dots.softmax(dim=-1)

		Shape =[b,8,50,64]
        out = torch.einsum('bhij,bhjd->bhid', attn, v)
        # out. Shape into [b,50,8x64]
        out = rearrange(out, 'b h n d -> b n (h d)')
        # out. Shape change to [b,60,128]
        out =  self.to_out(out)
        return out
Copy the code

In summary, this attention is actually a self-attention module. The input is [B,50,128] and the return is [B,50,128]. The implementation process is a bit complicated because of the use of torch. Einsum, but in general, it is exactly the same as the “non-local” module in my previous paper. Torch. Einsum has the same principle as torch. Mm except that torch. Mm does not support matrix multiplication for high-latitude tensors.

PreNorm

class PreNorm(nn.Module) :
    def __init__(self, dim, fn) :
    # dim=128,fn=Attention/FeedForward
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs) :
        return self.fn(self.norm(x), **kwargs)
Copy the code

There is a layerNormalization of the input x(x.shape=[b,50,128]) and then there is self-focus in the Attention module above.

Residual

class Residual(nn.Module) :
    def __init__(self, fn) :
        super().__init__()
        self.fn = fn
    def forward(self, x, **kwargs) :
        return self.fn(x, **kwargs) + x
Copy the code

It’s just a residual module.

FeedForward

class FeedForward(nn.Module) :
    def __init__(self, dim, hidden_dim, dropout = 0.) :
    # dim=128,hidden_dim=128
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
    def forward(self, x) :
        return self.net(x)
Copy the code

The GELU() activation function can be called directly using torch.nn.GELU(). I’ll talk more about GELU() later.

The transformer to summarize

Residual(PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout))),
Residual(PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout)))
Copy the code

The first one is that there will be some sort of layerNormalization of the input and then put it in attention to get the result of attention and then add the residual link to the input before there will be some sort of layerNormalization.
The second is x->LayerNormalization->FeedForward linear layer -> Y, and then that Y is added to the input X to make residual connections.

VIT summary

To review the process:

A picture 224×224 is divided into 49 32×32 patches;
When embedding so many patches, 49 128 vectors are generated.
Add cls_tokens to make 50 128 vectors;
Plus pos_embedding, there are 50 128 vectors.
These vectors are input into Transformer for feature extraction of self-attention;
It’s going to output 50 128 vectors, and then it’s going to get a 128 vector for those 50;
The linear layer then converts the 128 dimensions into 2 dimensions to complete the transformer model of the binary task.

Q: I don’t know much about NLP. Can anyone answer this question: what are the uses of CLS_tokens and pos_embedding?