Summary of Attention in deep learning

In recent years, attention-based methods have gained popularity in academia and industry because of their explainability and effectiveness. However, since the network structure proposed in the paper is usually embedded in the code framework of classification, detection, segmentation, etc., the code is relatively redundant. It is difficult for a white like me to find the core code of the network, which leads to some difficulties in understanding the paper and network ideas. Therefore, I have collated and reproduced the core code of Attention, MLP and Re-parameter recently, which is convenient for readers to understand. This article mainly makes a brief introduction to the Attention part of the project. The project will continue to update the latest paper work, you are welcome to follow and star the work, if there are any problems in the process of reappearance and sorting out the project, you are welcome to raise in the issue, I will timely reply ~

The author information

Welcome to Github: xmu-Xiaoma666, Zhihu: Try harder, try harder.

The project address

Github.com/xmu-xiaoma6…

1. External Attention

1.1. Reference

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks.—arXiv 2021.05.05

Address: arxiv.org/abs/2105.02…

1.2. Model structure

1.3. Introduction

This is a may article on arXiv that addresses two pain points of self-attention (SA) :(1) computational complexity of O(n^2); (2)SA calculated Attention according to different positions in the same sample, ignoring the connection between different samples. Therefore, this paper uses two concatenated MLP structures as memory units to reduce the computational complexity to O(n). In addition, these two memory units are learned based on the entire training data, so the relationship between the different samples is also implicitly considered.

1.4. Usage

The from attention. ExternalAttention import ExternalAttention import torch input = torch. Randn (50,49,512) ea = ExternalAttention(d_model=512,S=8) output=ea(input) print(output.shape)Copy the code

2. Self Attention

2.1. Reference

Attention Is All You Need—NeurIPS2017

Address: arxiv.org/abs/1706.03…

2.2. Model structure

2.3. Introduction

This is a Google article published in NeurIPS2017. It has been influential in CV, NLP, multimodal, and other fields, and has been cited more than 2.2w. Self-attention proposed in Transformer is a kind of Attention, which is used to calculate the weight between different positions in the feature, so as to achieve the effect of updating the feature. Firstly, the input feature is mapped into Q, K and V through FC, and then the attention map is obtained by the dot product of Q and K, and then the weighted features are obtained by the dot product of attention map and V. Finally, a new feature is obtained by FC feature mapping. (There are a lot of great tutorials on Transformer and self-attention on the Internet, so I won’t go into the details here.)

2.4. Usage

The from attention. SelfAttention import ScaledDotProductAttention import torch input = torch. Randn (50,49,512) sa = ScaledDotProductAttention(d_model=512, d_k=512, d_v=512, h=8) output=sa(input,input,input) print(output.shape)Copy the code

3. Squeeze-and-Excitation(SE) Attention

3.1. Reference

Squeeze-and-Excitation Networks—CVPR2018

Address: arxiv.org/abs/1709.01…

3.2. Model structure

3.3. Introduction

This is an article from CVPR2018, which is also very influential and currently has 7K + citations. This paper is about channeled attention, which is a small upsurge due to its simple structure and effectiveness. The idea of this article can be said to be very simple. Firstly, the spatial dimension is AdaptiveAvgPool, and then the channel attention is learned through two FCS. The Channel Attention Map was normalized by Sigmoid. Finally, the weighted features were obtained by multiplying the Channel Attention Map with the original features.

3.4. Usage

From attention.seattention import SEAttention import torch input=torch. Randn (50,512,7,7) se = SEAttention(channel=512,reduction=8) output=se(input) print(output.shape)Copy the code

4. Selective Kernel(SK) Attention

4.1. Reference

Selective Kernel Networks—CVPR2019

Address: arxiv.org/pdf/1903.06…

4.2. Model structure

4.3. Introduction

Here’s an article from CVPR2019 that pays tribute to SENet’s ideas. In the traditional CNN, each convolution layer uses a convolution kernel of the same size, which limits the expression ability of the model. The “wider” model structure of Inception also verifies that learning with multiple different convolution kernels can indeed improve the expression ability of the model. The author draws lessons from the idea of SENet, obtains the weight of channel by dynamically calculating each convolution kernel, and dynamically integrates the results of each convolution kernel.

In my opinion, this article can also be called Lightweight because the channel attention to different kernel features is shared (i.e Since features are fused before Attention, results of different convolution kernels share the parameters of an SE module).

This article’s approach is divided into three parts: Split,Fuse, and Select. Split is a multi-branch operation, convolution with different convolution kernels to get different features; Fuse part is to use SE structure to obtain channel attention matrix (N convolution kernel can get N attention matrix, this operation is shared with all characteristic parameters), so that the features of different kernels after SE can be obtained. The Select operation just adds up these features.

4.4. Usage

From attention.skattention import SKAttention import torch input=torch. Randn (50,512,7,7) se = SKAttention(channel=512,reduction=8) output=se(input) print(output.shape)Copy the code

5. CBAM Attention

5.1. Reference

CBAM: Convolutional Block Attention Module—ECCV2018

The paper addresses: openaccess.thecvf.com/content\_EC…

5.2. Model structure

5.3. Introduction

This is a paper of ECCV2018, in which Channel Attention and Spatial Attention are used simultaneously, and the two are connected in series (the ablation experiments of parallel and two kinds of series are also done in this paper).

In terms of Channel Attention, the general structure is still similar to SE. However, the author proposes that AvgPool and MaxPool have different representation effects, so the author carries out AvgPool and MaxPool respectively for the original features in the Spatial dimension. Then we use the SE structure to extract channel attention, notice that this is shared, and then we add the two features and normalize them, and we get the attention matrix.

Spatial Attention is similar to Channel Attention. After two pools are performed in the Channel dimension, the two features are pieced together, and then the 7×7 convolution is used to extract Spatial Attention (the reason why 7×7 is used is that Spatial Attention is extracted, So the convolution kernel must be large enough. And then you do a normalization, and you get the attention matrix in space.

5.4. Usage

From attention.CBAM import CBAMBlock import torch input=torch. Randn (50,512,7,7) kernel_size=input CBAMBlock(channel=512,reduction=16,kernel_size=kernel_size) output=cbam(input) print(output.shape)Copy the code

6. BAM Attention

6.1. Reference

BAM: Bottleneck Attention Module—BMCV2018

Address: arxiv.org/pdf/1807.06…

6.2. Model structure

6.3. Introduction

This is the work of CBAM at the same time as the author, which is very similar to CBAM, which is also double Attention. The difference is that CBAM concatenates the results of two Attention; BAM just adds the two attention matrices.

In terms of Channel Attention, it’s basically the same structure as SE. In terms of Spatial Attention, pool is still carried out in the channel dimension, and two empty convolution of 3×3 is used, and finally the matrix of Spatial Attention will be obtained by convolution of 1×1 once.

Finally, the Channel Attention and Spatial Attention matrices are added (using the broadcast mechanism here) and normalized. In this way, the Attention matrix combined with the space and Channel is obtained.

6.4. Usage

From attention.BAM import BAMBlock import torch input=torch. Randn (50,512,7,7) BAM = BAMBlock(channel=512,reduction=16,dia_val=2) output=bam(input) print(output.shape)Copy the code

7. ECA Attention

7.1. Reference

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks—CVPR2020

Address: arxiv.org/pdf/1910.03…

7.2. Model structure

7.3. Introduction

This is an article from CVPR2020.

As shown in the figure above, SE implements channel attention using two fully connected layers, while ECA requires a convolution. The authors do this partly because they think it is unnecessary to calculate the attention between the two pairs of channels, and partly because using two fully connected layers does introduce too many parameters and computations.

Therefore, after conducting AvgPool, the author only uses a one-dimensional convolution with receptive field k (equivalent to only calculating the attention with k adjacent channels), which greatly reduces the parameters and calculation amount. (I.e. is equivalent to SE being a global attention and ECA being a local attention).

7.4. Usage:

From attention.ecaattention import ECAAttention import torch input=torch. Randn (50,512,7,7) eca = ECAAttention(kernel_size=3) output=eca(input) print(output.shape)Copy the code

8. DANet Attention

8.1. Reference

Dual Attention Network for Scene Segmentation—CVPR2019

Address: arxiv.org/pdf/1809.02…

8.2. Model structure

Diagram, schematic description has been automatically generated

8.3. Introduction

This is the CVPR2019 article, the idea is very simple, is to use self-attention in the task of scene segmentation, the difference is that self-attention is to pay attention to the attention between each position, and this paper will do an extension of self-attention, We also made a branch of channel attention, which is the same as self-attention in operation. The three Linear generating Q, K and V are removed from the different channels of attention. Finally, the features after the two attention were given element-wise sum.

8.4. Usage method

From attention.danet import DAModule import torch input=torch. Randn (50,512,7,7) danet=DAModule(d_model=512,kernel_size=3,H=7,W=7) print(danet(input).shape)Copy the code

9. Pyramid Split Attention(PSA)

9.1. Reference

EPSANet: An Efficient Pyramid Split Attention Block on Convolutional Neural Network—arXiv 2021.05.30

Address: arxiv.org/pdf/2105.14…

9.2. Model structure

9.3. Introduction

This is an article from Shenzhen University, which was uploaded to arXiv on May 30. The purpose of this paper is how to obtain and explore spatial information at different scales to enrich the feature space. The network structure is relatively simple, which can be divided into four steps. In the first step, the original feature is divided into N groups according to the channel, and then the different groups are convolved with different scales to obtain the new feature W1. The second step, SE on the original features of SE, so as to obtain different headache Tony; Step 3: SOFTMAX for different groups; The fourth step is to multiply attention by the original feature W1.

9.4. Usage

From attention.PSA import PSAimport torchinput=torch. Randn (50,512,7,7) PSA = PSA(channel=512,reduction=8)output=psa(input)print(output.shape)Copy the code

10. Efficient Multi-Head Self-Attention(EMSA)

10.1. Reference

ResT: An Efficient Transformer for Visual Recognition—arXiv 2021.05.28

Address: arxiv.org/abs/2105.13…

10.2. Model structure

10.3. Introduction

This is an article uploaded by NTU to arXiv on May 28. This paper mainly solves two pain points of SA :(1) the computational complexity of self-attention has a square relationship with n (n is the size of space dimension); (2) Each head has only partial information of Q,k and V. If the dimensions of Q,k and V are too small, then continuous information cannot be obtained, resulting in performance loss. The idea given in this paper is also very simple. In SA, before FC, a convolution is used to reduce the dimension of space, so as to get smaller K and V in the dimension of space.

10.4. Usage

from attention.EMSA import EMSAimport torchfrom torch import nnfrom torch.nn import functional as Finput=torch. Randn (50,64,512)emsa = emsa (d_k=512, d_v=512, d_model=512, d_k=512, d_v=512, h=8,H=8,W=8,ratio=2,apply_transform=True)output=emsa(input,input,input)print(output.shape)Copy the code

【 Write at the end 】

At present, the Attention of this project is not comprehensive enough. In the future, with the increase of reading volume, this project will be improved continuously. Welcome to star support. If the article is not correct, the code implementation of the wrong place, welcome to point out ~

This article uses the Article Synchronization Assistant to synchronize