Reprint address

Practice Blog address

GitHub:github.com/xiaosongshi…

preface

In the middle of 2017, there are two similar papers that I really admire, Both FaceBook’s Convolutional Sequence to Sequence Learning and Google’s Attention is All You Need are innovations in Seq2Seq. Essentially, the RNN structure is discarded to do Seq2Seq tasks.

In this article, I will analyze the Attention is All You Need. Of course, these two papers themselves are quite popular, so there have been a lot of interpretation online (but a lot of interpretation are direct translation of the paper, few of their own understanding), so here as much as possible their own words, as far as possible not to repeat the online you have said the content.

Sequence coding

Deep learning approaches to NLP basically divide sentences into words first, and then transform each word into the corresponding sequence of word vectors. In this way, each sentence corresponds to a matrix X=(x1,x2… ,xt), where xi both represent the word vector (line vector) of the ith word, and the dimension is D-dimension, so. The problem then becomes coding these sequences.

The first basic idea is RNN layer, RNN scheme is very simple, recursive:

The LSTM, GRU, and more recently SRU, which have been widely used, do not deviate from this recursive framework. The structure of RNN itself is relatively simple and suitable for sequence modeling, but one of the obvious disadvantages of RNN is that it cannot be parallel, so it is slow, which is a natural defect of recursion.

In addition, I personally think RNN cannot learn the global structure information well, because it is essentially a Markov decision process.

The second idea is the CNN layer. In fact, CNN’s scheme is also very natural. Window-type traversal, such as convolution with size 3, is:

In FaceBook’s paper, Seq2Seq is also studied using pure convolution, which is a fine and extreme use case of convolution. Readers who are keen on convolution must read this paper carefully.

CNN is convenient for parallel, and it is easy to capture some global structural information. The author prefers CNN. In the current work or competition model, I have tried to replace the existing RNN model with CNN, and formed my own use experience, which will be discussed later.

Google’s masterpiece offers a third idea: pure Attention.

RNN needs to recurse step by step to obtain global information, so bidirectional RNN is generally better. In fact, CNN can only obtain local information by stacking to increase the receptive field. Attention has the roughest idea. It gets the global information in one step. Its solution is:

Where A and B are another sequence (matrix). If both A=B=X are selected, it is called Self Attention, which means to directly compare Xt with each of the original words and finally calculate YT.

Attention layer

Attention to define

Google’s general Attention idea is also a sequence encoding scheme, so we can also consider it as a sequence encoding layer like RNN and CNN.

The description given above is in the form of a general framework. In fact, the solution given by Google is very specific. First, it defines Attention:

The notation used here is consistent with the Google paper, where:

If you omit the activation function Softmax, it is essentially a multiplication of three n by DK, DK by m, and M by dV matrices, resulting in an N by DV matrix.

So we can think of this as a Attention layer, encoding the n× DK sequence Q into a new n× dV sequence.

So how do you understand this structure? Let’s do it vector by vector.

Where Z is the normalized factor. In fact, q,k,v are shorthand for query,key, and value respectively. K, v are one-to-one corresponding, just like the key-value relationship, so the above formula means that by qt query, by softmax with each ks inner product, To get the similarity between QT and vs, and then the weighted sum to get a dv dimension vector.

One factorAdjust the inner product so that it is not too large (if it is too large, it will not be 0 or 1 after Softmax, which is not soft enough).

The definition of Attention isn’t new, but thanks to Google’s influence, we can assume that it’s now more formally defined and treated as a layer.

In addition, this definition is just a form of attention. There are other options, such as query and key operations do not need to be dotted (they can be concatenated and then inner product a parameter vector), or even weights do not need to be normalized, etc.

Multi-Head Attention

This is a new concept proposed by Google, which is the improvement of Attention mechanism.

For Attention, this process is repeated for h times, and the result is joined together. It can be described as “simple”. To be specific:

 

hereAnd then:

According to the study, a sequence of N ×(HD ̃v) is created. Multi-head is doing the same thing just a few more times (parameters are not shared) and stitching together the results.

Self Attention

So far, the Attention layer has been described generically, and we can implement some applications. For example, if Q is a sequence of text word vectors and K=V is the sequence of problem word vectors, the output is called Aligned Question Embedding.

In Google’s paper, most of the Attention is Self Attention, or “self-attention,” or internal Attention.

Self Attention is the input sequence for Attention(X,X,X). In other words, do Attention inside the sequence, looking for connections within the sequence.

One of the main contributions of the Google paper is that it shows the importance of internal attention in the sequence encoding of machine translation (and even Seq2Seq tasks in general), whereas previous studies on Seq2Seq have mostly focused on the attention mechanism at the decoding side.

Similarly, SQUAD’s current leading model for reading comprehension, R-Net, has improved its model by adding a self-attention mechanism.

Of course, it’s more accurate to say that Google uses Self multi-head Attention:

Position Embedding

However, a moment’s thought reveals that such a model does not capture the order of sequences. In other words, Attention gets the same result if you shuffle K and V in line order.

This suggests that, so far, the Attention model is nothing more than a very sophisticated bag of words model.

This problem is more serious, you know, for the time series, especially for NLP tasks, the order is very important information, it represents the local and even the global structure, study is less than the order information, the effect will be a big discount (such as machine translation, may only to translate every word. But it can’t be organized into sensible sentences).

So Google offers another way: Position Embedding, which is also called “Position vector”, numbers every Position and then corresponds to a vector. By combining Position vector and word vector, certain Position information is introduced to every word. This allows Attention to distinguish between words in different positions.

Position Embedding is not new, and is also used in FaceBook Convolutional Sequence to Sequence Learning. However, in this work of Google, its Position Embedding has several differences:

1. Position Embedding has been seen in RNN and CNN models in the past, but in those models, Position Embedding is an auxiliary means of adding beauty, that is, it will be better if it is present and worse if it is not present. Because RNN, CNN itself can capture location information.

However, in this pure Attention model, Position Embedding is the only source of Position information, so it is one of the core components of the model and not just a simple auxiliary means.

2. In the past Position Embedding, it is basically the vector trained according to the task. Google directly provides a formula for constructing Position Embedding:

The idea here is to map the position with id P to a dPOS dimensional position vector whose ith element is PEi(p).

Google said in the paper that they had compared the position vector directly trained with the position vector calculated by the above formula, and the results were close. So obviously we prefer Position Embedding constructed by formula.

3. Position Embedding itself is an absolute Position information, but relative Position is also very important in language. An important reason for Google to select the above Position vector formula is as follows:

Since we have sin(α+β)=sinα cosβ+cosα sinβ and cos(α+β)=cosα cosβ−sinα sinβ, this shows that vectors of position P +k can indicate linear transformations of vectors of position P, which provides the possibility of expressing relative position information.

There are several options for combining a position vector and a word vector, you can concatenate them as a new vector, or you can define the position vector to be the same size as the word vector, and then add the two together.

The FaceBook paper used the former, while the Google paper used the latter. Adding up leads to a loss of information that may seem intuitively undesirable, but Google’s results suggest that adding up is also a good solution. I guess I don’t understand enough.

Some shortcomings

At this point, we’ve basically covered the Attention mechanism. The benefit of the Attention layer is that it captures global connections in one step, because it directly compares sequences in pairs (at the cost of calculating? (n2), of course, since it is a pure matrix operation, this calculation is not very serious).

In contrast, RNN requires step by step recursion to capture, while CNN requires cascade to expand the receptive field, which is an obvious advantage of Attention layer.

The rest of the Google paper is to show how it can be used in machine translation, which is a matter of application and tuning, and we don’t particularly care about it here. Of course, Google’s results show that using pure attention mechanisms in machine translation can achieve the best results so far, which is indeed brilliant.

However, I still want to talk about some shortcomings of this paper and Attention layer itself.

1. The title of the paper is Attention is All You Need, so the words RNN and CNN are deliberately avoided in the paper, but I think this approach is too deliberate.

In fact, the paper specifically named a position-wise feed-forward network, which in fact is a one-dimensional convolution with a window size of 1. Therefore, it seems a little ungenerous to change the name so as not to mention the convolution. (Or maybe I’m being too presumptuous).

2. Although Attention has no direct connection with CNN, it fully draws on CNN’s ideas. For example, multi-head Attention means that Attention is done several times and then splined, which is consistent with the idea of multiple convolutional kernels in CNN. Another paper uses residual structure, which also comes from CNN network.

3. Inability to model location information well is a bruise. Although Position Embedding can be introduced, I think it is only a relief solution and does not fundamentally solve the problem.

For example, training a text categorization model or machine translation model with this pure Attention mechanism is fine, but training a sequence tagging model (word segmentation, entity recognition, etc.) is not so good.

So why is it good for machine translation tasks? I think the reason is that word order is not particularly emphasized in machine translation, so the location information brought by Position Embedding is enough. In addition, BLEU, the evaluation index of translation task, does not emphasize word order.

4. Not all problems need long-term and global dependence, and many problems only depend on local structure, so it is not good to use pure Attention.

In fact, Google seems to be aware of this problem, so there is also a restricted version of self-attention in the paper (though the text of the paper probably doesn’t use it).

It assumes that the current word is only associated with r words before and after it, and therefore attention only occurs between these 2r+1 words, so the amount of computation is? (NR), which also captures the local structure of the sequence. But obviously, that’s the concept of the convolution window in the convolution kernel.

From the above discussion, we can realize that Attention as a separate layer, mixed with CNN, RNN and other structures, should be able to more fully integrate their advantages, rather than the Google paper claiming Attention is All You Need. That would be overcorrecting, and it wouldn’t work.

As far as the work of the paper is concerned, perhaps lowering your profile and calling Attention is All Seq2Seq Need would have gained more recognition.

Code implementation

Finally, in order to make this paper have some practical value, the author tries to give the multi-head Attention implementation code of the paper. Readers who need it can use it directly or refer to it for modification.

Note that while multi-head means simple — repeat a few times and splice — you can’t actually write a program this way, which would be very slow. Because TensorFlow does not automatically parallelize, for example:

a = tf.zeros((10.10))

b = a + 1

c = a + 2
Copy the code

Where b and C are calculated in series, although B and C do not depend on each other. Therefore, we must combine the multi-head operations into a single tensor because the multiplications of individual tensors are automatically parallel internally.

In addition, we Mask the sequence to ignore the effects of the fill part. For a Mask, the padding is set to zero, but for a Mask in Attention, the padding is subtracted by a large integer before Softmax (so that it is very close to zero after Softmax). These things have corresponding implementations in the code.

TensorFlow version

Github.com/bojone/atte…

Keras version

Github.com/bojone/atte…

The test code

Simple test of IMDB on Keras (without Mask) :

from __future__ import print_function
from keras.preprocessing import sequence
from keras.datasets import imdb


max_features = 20000

maxlen = 80

batch_size = 32


print('Loading data... ')

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

print(len(x_train), 'train sequences')

print(len(x_test), 'test sequences')


print('Pad sequences (samples x time)')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)

x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

print('x_train shape:', x_train.shape)

print('x_test shape:', x_test.shape)

from keras.models import Model
from keras.layers import *


S_inputs = Input(shape=(None,), dtype='int32')

embeddings = Embedding(max_features, 128)(S_inputs)
Embeddings = Position_Embedding()(Embeddings) # Add Position_Embedding to slightly improve accuracy

O_seq = Attention(8.16)([embeddings,embeddings,embeddings])

O_seq = GlobalAveragePooling1D()(O_seq)

O_seq = Dropout(0.5)(O_seq)

outputs = Dense(1, activation='sigmoid')(O_seq)


model = Model(inputs=S_inputs, outputs=outputs)
# try using different optimizers and different optimizer configs

model.compile(loss='binary_crossentropy',

             optimizer='adam',

             metrics=['accuracy'])


print('Train... ')

model.fit(x_train, y_train,

         batch_size=batch_size,

         epochs=5,

         validation_data=(x_test, y_test))
Copy the code

Results of no Position Embedding:

Results of Position Embedding:

It seems that the highest accuracy is a little higher than that of single-layer LSTM. In addition, it can be seen that Position Embedding can improve accuracy and weaken overfitting.

Computational analysis

As you can see, Attention is actually quite computationally intensive. For example, in Self Attention, the first thing you have to do is do a cubic linear mapping of X, which is equivalent to a one-dimensional convolution with a convolution kernel size of 3, but it’s just, okay? (n); And then there are two matrix multiplications of the sequence itself, both of which are computations, right? N squared, if the sequence is long enough, this calculation is actually very difficult to accept.

This also indicates that Attention in the Restricted version will be the focus of the following research, and the mixed use of Attention, CNN and RNN is a relatively moderate path.

conclusion

Thanks to the wonderful use cases provided by Google, I not only opened my eyes, but also gained a deeper understanding of Attention. This achievement of Google reflects the concept of “simplicity of avenue” to some extent, and is indeed a rare fine work in NLP.