“This is the 15th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Self-attention mechanism

Self-attention became popular with the introduction of the Natural Language Processing (NLP) model (called “Transformer”). In NLP applications such as language translation, the model usually needs to read sentences verbatim to understand them, and then produce output. The Neural networks used before Transformer were Recurrent Neural networks (RNN) or variants such as Long short-term Memory (LSTM). RNN has its internal state, which can better process sequence information. For example, there is a relationship between the previous word input and the following word input in the sentence. However, RNN also has its drawbacks. For example, as the number of words increases, the gradient of the first word may disappear. That is, as RNN reads more words, the word at the beginning of a sentence becomes less important. Transformer handles this differently. It reads all the words at once and weighs the importance of each word. Therefore, more attention is focused on the more important words, hence also called attention.

The self-attention mechanism is a variant of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal relevance of data or features. Self-attention is the cornerstone of the latest NLP models, such as BERT and GPT-3.

Self attention in computer vision

Firstly, the main points of Convolutional Neural Networks (CNN) are briefly reviewed: CNN is mainly composed of Convolutional layers. For the convolution layer with a convolution kernel size of 3×3, it only looks at 3×3 = 9 features in the input activation (this region can also be called the receptive field of the convolution kernel) to calculate each output feature, and it does not look at pixels beyond this range. To capture pixels beyond this range, we can increase the core size slightly to 5×5 or 7×7, but the receptive field is still small compared to the size of the feature map.

We have to increase the depth of the network so that the receptive field of the convolution kernel is large enough to capture what we want. As with RNN, the relative importance of input characteristics decreases as we move through the network layer. Therefore, we can use self-attention to observe each pixel in the feature map and focus our attention on the more important pixels.

Now, we will examine how the self-attention mechanism works. The first step in self-attention is to project each input feature onto three vectors, called key, query, and value. Although these terms are not commonly used in computer vision, the first step is to introduce knowledge of these terms so that we can better understand self-attention. Literature related to Transformer or NLP:

  1. Value indicates an input characteristic. We don’t want the self-attention module to look at every pixel, as this is computationally expensive and unnecessary. Instead, we are more interested in the local regions of input activation. Thus, values reduce the dimension from the input features both in terms of the activation graph size (for example, it can be downsampled to have a smaller height and width) and in terms of the number of channels. For convolution layer activation, the number of channels is reduced by using 1×1 convolution, and the space size is reduced by maximum pooling or average pooling.
  2. Keys and queries are used to calculate the importance of features in the self-attention graph. To calculate the output characteristics at position XXX, we query at position XXX and compare it to the keys at all positions.

To further illustrate this, suppose we have a portrait, and when the network processes one of the portrait’s eyes, it will first make a query that has the semantic meaning of “eye” and is checked using the keys of the other areas of the portrait. If one of the other keys is the eye, then the network knows it has found another eye, and this is the area the network needs to pay attention to so that it can further process it.

More generally, a mathematical equation is used to express: for feature 000, we calculate the vectors q0×k0,q0×k1,q0×k2,q0×kN−1q_0×k_0, q_0× K_1, q_0×k_{n-1}q0×k0,q0×k1,q0×k2,q0×kN−1. Then use SoftMaxSoftMaxSoftMax to normalize the vectors so that they add up to $1.0, which is the attention score we want. Use the attention score as a weight to perform element-by-element multiplication of the values to obtain the attention output.

The following diagram illustrates how to generate an attention map from a query:

In red, the left-most column is an image of queries with dotted marks. The next five images show the attention map obtained through the query. The first note at the top tries to query one of the rabbit’s eyes; Note that there are more white areas around both eyes (indicating important areas) and nearly all black areas (less important areas).

There are several ways to achieve self-attention. The following figure shows the attention modules used in SAGAN, where θθθ, φφφ and GGG correspond to keys, queries, and values:

Most calculations in deep learning are vectorized to improve speed performance, and are no different for self-attention. If the Batch dimension is ignored for simplicity, the activation convolved with 1×1 will have the shape of (H, W, C). The first step is to reshape it into a 2D matrix of shape (H×W, C) and compute the attention diagram by multiplying θθθ by φφφ matrix. In the self-attention module used in SAGAN, there is another 1×1 convolution for restoring the number of channels to the same number as the number of input channels and then scaling with learnable parameters.

Tensorflow implements the self-attention module

First define all 1×1 convolutional layers and weights in build() of the custom layer. Here, the spectrum normalization function is used as the kernel constraint of the convolution layer:

class SelfAttention(Layer) :
    def __init__(self) :
        super(SelfAttention, self).__init__()
    
    def build(self, input_shape) :
        n,h,w,c = input_shape
        self.n_feats = h * w
        self.conv_theta = Conv2D(c//8.1, padding='same', kernel_constraint=SpectralNorm(), name='Conv_Theta')
        self.conv_phi = Conv2D(c//8.1, padding='same', kernel_constraint=SpectralNorm(), name='Conv_Phi')
        self.conv_g = Conv2D(c//8.1, padding='same', kernel_constraint=SpectralNorm(), name='Conv_g')
        self.conv_attn_g = Conv2D(c//8.1, padding='same', kernel_constraint=SpectralNorm(), name='Conv_AttnG')
        self.sigma = self.add_weight(shape=[1], initializer='zeros', trainable=True, name='sigma')
Copy the code

Note that:

  1. Internal activation reduces the size to make calculations run faster.
  2. After each convolution layer, the shape (H, W, C) is reshaped into a two-dimensional matrix of shape (H*W, C). Then, we can use matrix multiplication on matrices.

The layers are then connected in the call() function to perform self-attention operations. First calculate θ\thetaθ, φφφ and GGG:

    def call(self, x) :
        n, h, w, c = x.shape
        theta = self.conv_theta(x)
        theta = tf.reshape(theta, (-1, self.n_feats, theta.shape[-1]))
        phi = self.conv_phi(x)
        phi = tf.nn.max_pool2d(phi, ksize=2, strides=2, padding='VALID')
        phi = tf.reshape(phi, (-1, self.n_feats//4, phi.shape[-1]))
        g = self.conv_g(x)
        g = tf.nn.max_pool2d(g, ksize=2, strides=2, padding='VALID')
        g = tf.reshape(g, (-1, self.n_feats//4, g.shape[-1]))
Copy the code

Then, the attention diagram will be calculated as follows:

        attn = tf.matmul(theta, phi, transpose_b=True)
        attn = tf.nn.softmax(attn)
Copy the code

Finally, the attention attempt is multiplied by the query GGG to produce the final output:

        attn_g = tf.matmul(attn, g)
        attn_g = tf.reshape(attn_g, (-1, h, w, attn_g.shape[-1]))
        attn_g = self.conv_attn_g(attn_g)
        output = x + self.sigma * attn_g
        return output
Copy the code