1. The mechanism of attention

Attention mechanism: Also known as attention mechanism, as its name implies, it is a technology that enables models to focus on important information and fully learn and absorb it, which can be applied to any sequence model.

From the perspective of the function of Attention, we can classify Attention from two perspectives: Spatial Attention, Temporal Attention, Temporal Attention. Such classification is more from the application level, and Attention can be divided into Soft Attention and Hard Attention from the function method, which is not only what we say, Whether the vector distribution of Attention output is one-hot or soft directly affects the selection of context information.

Why add Attention:

When the input sequence is very long, it is difficult for the model to learn a reasonable vector representation

Input sequences, with the sequence of growing, the original according to the time step of the way of performance worse and worse, this is because the original this time step model with the defect of the structure of the design, namely all the context of the input information is limited to fixed length, the ability of the entire model also receive limit, we are called the primitive model of simple model of the codec.

The structure of a codec is unexplicable, which makes it impossible to design.

The basic idea of Attention mechanism is that it breaks the limitation of traditional encoder-decoder structure that relies on an internal fixed length vector when encoding and decoding. The Attention mechanism is realized by reserving the intermediate output of the LSTM encoder for the input sequence, and then training a model to selectively learn these inputs and associate the output sequence with them when the model outputs.

To put it another way, the probability of producing each item in the output sequence depends on which items are selected in the input sequence.

The attention-based Model is actually a similarity measure. The current input is about similar to the target state, so the weight of the current input will be larger. The idea of Attention was added to the original model.

Specifically, the LSTM/RNN model with the traditional encoder-decoder structure has a problem: it encodes the input into a fixed-length vector representation regardless of its length, which makes the model have poor learning performance (poor decoding performance) for long input sequences. The attention mechanism overcomes this problem by selectively focusing on relevant input information in model output. Attention mechanism has been widely used in various sequence prediction tasks, including text translation and speech recognition.

Code 2.

 import os

import re

import csv

import codecs

import numpy as np

import pandas as pd

from keras import backend as K

from keras.engine.topology import Layer

from keras import initializers, regularizers, constraints

np.random.seed(2018)

class Attention(Layer):

    def __init__(self,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        # self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias

        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.step_dim = input_shape[1]
        assert len(input_shape) == 3 # batch ,timestep , num_features
        print(input_shape)
        self.W = self.add_weight((input_shape[-1],), #num_features
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),#timesteps
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim
        print(K.reshape(x, (-1, features_dim)))# n, d
        print(K.reshape(self.W, (features_dim, 1)))# w= dx1
        print(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))))#nx1

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))#batch,step
        print(eij)
        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        print(a)
        a = K.expand_dims(a)
        print("expand_dims:")
        print(a)
        print("x:")
        print(x)
        weighted_input = x * a
        print(weighted_input.shape)
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        # return input_shape[0], input_shape[-1]
        return input_shape[0], self.features_dim

Copy the code

3. Summary

In general, the mechanism for attention is a weighted summation mechanism, and as long as you’re using a weighted summation, no matter how fancy you’re using a weighted summation, as long as you’re calculating the sum of hidden states based on existing information, you’re using attention, Self attention is just a weighted sum inside the sentence (unlike seq2seq’s decoder that weights the hidden state of the encoder).

Personally, I think the scope of self attention is larger, while key-value is actually a broader definition of attention. We can use key-value attention as the term before. For example, a lot of times we’re going to treat k and v as the same thing, or quey equals key equals value when we’re doing self.