Non-local operation is an early attempt of self-attention on visual tasks. The core is to enhance current features by weighting other features according to similarity. The implementation method is very simple, which provides references for many subsequent related studies

Source: Xiaofei algorithm engineering Notes public account

Non-local Neural Networks

  • Thesis Address:Arxiv.org/abs/1711.07…
  • Thesis Code:Github.com/facebookres…

Introduction


Convolutional operation usually carries out feature extraction in local areas. To obtain feature information with a wider range, repeated convolutional operation is required, which not only takes time but also increases the difficulty of training. Therefore, this paper proposes an efficient non-local operation to represent feature points on the feature graph as the weighted sum of all feature points, which is used to capture feature information covering a wider range. Non-local operation can also be used in tasks with timing. For example, the video classification task shown in Figure 1, features of several frames can be integrated to enhance the features of the current frame. Non-local operation has the following advantages:

  • Compared with the operation of superposition convolution, non-local can directly capture broader feature information through the interaction between feature points.
  • According to the experimental results, simply embedding several layers of non-local operations can effectively improve network performance.
  • The non-local operation supports variable input and works well with other network operators.

Non-local Neural Networks


Formulation

First define generic non-local operations:

Iii is the position coordinates of the feature values to be calculated in the feature graph, JJJ is all position coordinates in the feature graph, XXX is the input feature of the corresponding position, YYY is the enhanced output, FFF calculates the similarity between III and JJJ, GGG is used to transform the features of JJJ. C\mathcal{C}C is used to normalize the output. In short, the core of non-local is to calculate the similarity between the features of the current position and all features in the feature graph, and then weighted output all features according to the similarity. Compared with operations with fixed parameters such as convolution and full connection, non-local is more flexible.

Instantiations

In implementation, there are many options for FFF and GGG. For simplicity, the function GGG is chosen as the linear transformation G (xj)=Wgxjg(x_j)=W_gx_jg(xj)=Wgxj, where WgW_gWg is the learnable weight matrix, generally the convolution of 1×11\times 11×1. And function FFF can have the following choices (the paper found through experiments that the specific implementation of function FFF has little influence) :

  • Gaussian

XiTxjx ^ T_i x_jxiTxj as the dot product similarity, also can be used Euclidean distance, C (x) = ∑ ∀ jf (xi, xj) \ mathcal {C} (x) = {\ sum} _ {\ forall j} f (x_j x_i) C (x) = ∑ ∀ jf (xi, xj), Normalization is similar to the Softmax operation.

  • Embedded Gaussian

Theta. Theta xi \ theta (xi) = W = W_ (x_i) {\ theta} x_i theta. Theta xi and ϕ (xi) = W (xj) = W ϕ xj \ phi (x_j) = W_ {\ phi} x_j ϕ (xj) = W ϕ xj for two linear transformation, C (x) = ∑ ∀ jf (xi, xj) \ mathcal {C} (x) = {\ sum} _ {\ forall j} f (x_j x_i) C (x) = ∑ ∀ jf (xi, xj), the implementation and the self – close attention.

  • Dot product

Linear transformation, followed by dot product calculation of similarity, C(x)=N\mathcal{C}(x)=NC(x)=N, helps simplify gradient calculation.

  • Concatenation

Conate the feature directly and transform it into scalar output by weight vector wfTw^T_fwfT, C(x)=N\mathcal{C}(x)=NC(x)=N.

Non-local Block

The non-local operation of Formula 1 can be modified into a non-local block, which can be inserted into the current network architecture. The non-local block is defined as:

Formula 6 adds the output of non-local operation to the original feature after linear transformation, similar to the embedding method of residual block.

One implementation of non-local block is shown in Figure 2. Firstly, three different linear transformations are performed on XXX, and then output features are obtained according to Formula 1, and then added to the original features, which is basically the same as self-attention.

Experiment


Table 2A shows the implementation comparison of function FFF. It can be seen that the influence is not very great.

Video classification comparison.

COCO on the segmentation, detection, key point comparison.

Conclusion


Non-local operation is an early attempt of self-attention on visual tasks. The core is to enhance current features by weighting other features according to similarity. The implementation method is very simple, which provides references for many subsequent related studies.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]