Abstract: In this paper, we propose an attention module which is simple in concept but very effective for convolutional neural networks.

This article is shared from huawei cloud community “Paper Interpretation series 30: Interpretation of the Attention module SimAm paper without parameters”, by Gu Yurun Yimai.

Abstract

In this paper, we propose an attention module which is simple in concept but very effective for convolutional neural networks. Compared with the existing channel attention and spatial attention mechanisms, this paper directly deduces the three-dimensional attention weight in the network layer without increasing the number of parameters. Specifically, this paper proposes the importance of optimizing the energy function to find each neuron based on the well-known neuroscience theory. In this paper, the code implementation is further controlled within ten lines by solving the analytic solution of energy function. Another advantage of the SimAm model is that most operations are based on a defined energy function solution, so it does not require much effort to do structural adjustments. Quantitative experiments on various visual tasks show that the proposed module is flexible and effective in improving the representation ability of convolutional networks.

motivation

There are two problems with the existing attention base module. One is that they can refine features in only one of the channel or spatial dimensions, but there is no flexibility in the space where both the space and the channel change simultaneously. The second is that their structures often need to be based on a series of complex operations, such as pooling. The module proposed by the text based on the perfect neuroscience theory solves the above two problems well. Specifically, in order to enable the network to learn more discriminative neurons, this paper proposes to directly deduce the three-dimensional weights from the current neurons, and then optimize these neurons in turn. In order to effectively deduce the weight of three dimensions, this paper defines an energy function based on the knowledge of neuroscience, and then obtains the analytic solution of the function.

methods

In neuroscience, information-rich neurons usually display a different firing pattern from surrounding neurons. Moreover, activating neurons usually suppresses surrounding neurons, known as spatial inhibition. In other words, neurons that exhibit spatial inhibition should be given greater importance in visual processing tasks. The simplest way to find important neurons is to measure linear separability between neurons. Based on these neuroscientific findings, this paper defines the following energy function for each neuron:

Among them, hatt=w_tt+b_t,hatx_i=w_tx_i+b_t\\hat t=w\_t t+b\_t, \ \ \ \ _t _i = w hat x x \ \ _thatt _i + b = w_tt + b_t, hatx_i = w_tx_i + b_t is TTT and x_ix \ _ix_i linear transformation, TTT and x_ix \ _ix_i as input characteristic textbfXinmathbbRCtimesHtimesW \ \ textbf {X} \ \ in \ \ mathbb {R} ^ {C \ \ times H \ \ times W} textbfXinmathbbRCtimesHtimesW target neurons and other cells in single channel. Iii is the index in the spatial dimension, M=HtimesMM=H\\times MM=HtimesM is the number of neurons in a channel. W_tw \ _TW_t and b_tb\_tb_t are the weights and biases of linear transformations. All values in equation (1) are scalars. When hatt=y_t\ hatt=y \_thatt=y_t and hatx_i=y_o\ hatx \_i =y\_ohatx_i=y_o, equation (1) is minimized. Where y_ty\_ty_t and y_oy\ _oy_O are two different values. Minimization formula (1) is equivalent to finding the linear separability of the target neuron TTT and other neurons in the same channel. For simplicity, this article uses binary labels and adds regular entries. The final energy function is as follows:

Theoretically, each channel has MMM energy functions, and solving these equations with a gradient descent algorithm like SGD would be computationally expensive. Fortunately, both w_tw\_tw_t and b_tb\_tb_t in equation (2) can be quickly solved analytically as follows:

Including u_t = frac1M sumi = 1 M – 1-1 x_iu \ _t = \ \ frac {1} {1} M – \ \ sum ^ {I = 1}} {M – 1 x \ _iu_t = frac1M sumi = 1 M – 1-1 x_i and Sigma_t2 = frac1M sumim – 1-1 (s_i – mu_t) 2 \ \ sigma \ _t ^ 2 = \ \ frac {1} {1} M – \ \ sum {I} ^ {1} M – (s \ _i – \ \ mu \ _t) ^ 2 sigma_t2 = frac1M sumim – 1-1 (s_i − MU_T)2 is the mean and variance of all outgoing neurons in the corresponding channel after TTT. It can be seen from Formula (3) and Formula (4) that the analytical solutions are obtained in a single channel, so it can be reasonably speculated that other neurons in the same channel also meet the same distribution. Based on this assumption, the mean and variance can be calculated on all neurons, and can be reused by all neurons on the same channel. Therefore, the overhead of repeated calculation of MU \\ MUMu and sigma\\sigmasigma at each position can be greatly reduced, and the minimum energy at each position can be obtained by the following formula:

Including 1 mu = frac1Msumi = mx_i \ \ mu = \ \ frac {1} {M} \ \ sum {I = 1} ^ {M} \ _imu = frac1Msumi = 1 mx_i and x Hatsigma2 = frac1Msumi = 1 M (x_i – hatmu) 2 \ \ hat \ \ sigma ^ 2 = \ \ frac {1} {M} \ \ sum {I = 1} ^ {M} (x \ _i – \ \ hat \ \ mu) ^ 2 hatsigma2 = frac1Msumi = 1 M (x_ I – hatmu) 2. Equation (5) indicates that the lower the energy e\_t^, the greater the difference between neuron TTT and surrounding neurons, and the more important it is in visual processing. Therefore, the importance of each neuron is expressed by 1/e\_t^. According to Hillard et al. 1, attentional regulation in the mammalian brain typically manifests as a gain effect on neuronal responses. Therefore, this paper directly uses the operation of scaling instead of adding to extract features. The refining process of the whole module is as follows:

Where Epsilon\ EpsilonEpsilon is the sum of e_t\*e\_t^ *e_t\* in all channel and spatial dimensions, sigmoidsigmoidsigmoid is used to constrain too large a value, which does not affect the relative size of each neuron because it is a monotone function.

Virtually all operations, except for calculating the mean and variance of each channel, are element-level point-to-point operations. So with Pytorch you can implement the function of formula (6) in a few lines of code, as shown in Figure 1.

Figure 1 pyTorch style implementation of SimAM

The experiment

CIFAR classification experiment

Experiments are carried out on CIFAR 10 data and 100 data, and compared with the other four attention mechanisms. The module proposed in this paper shows superiority in multiple models without adding any parameters, and the experimental results are shown in Figure 2.

FIG. 2 Top-1 accuracy of CIFAR image classification task for five different attention modules in different models

[1] : Hillyard, S. A., Vogel, E. K., and Luck, S. J. Sensory Gain Control (Amplification) as a Mechanism of Selective Attention: Electrophysiological and Neuroimaging evidence. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 353(1373): 1257 — 1270, 1998.

Click to follow, the first time to learn about Huawei cloud fresh technology ~