Attention is all you need. Start with the head

This article will publish the wechat official account: Alitou Feeding House

takeaway

This should be a refresher, but recall that Transformer is a bit of a blur at a time when Transformer is dominating almost every area. However, Transformer’s position in the field of deep learning should be beyond doubt, whether it is Bert in the field of NLP or Visual Transformer in the field of CV, we can see its figure. Therefore, it is important to know more about Tranformer. In this article, we will introduce Transformer structure and implementation details in the form of a review.

Bert series related articles recommended

What does BERT Learn

TinyBert: Ultra-detailed application of model distillation, he is enough to ask questions about distillation

Quantization technique and Albert dynamic quantization

DistillBert: Bert is too expensive? I’m cheap and easy to use

[share] paper | RoBERTa: hello XLNet in, was beaten

XLNet paper introduction – Beyond Bert’s afterwave

Paper Links:Arxiv.org/pdf/1706.03…
Source:Github.com/jadore80112…

Attention mechanism

Transformer is an encoder -decode structure, encoder and decoder structure is similar, are a plurality of the same layer spliced together. Encoder each layer is divided into two sub-layers, attention layer and full connection layer respectively. Each layer in the decoder is divided into three sub-layers, which are two attention layers and one full connection layer.

Single attention

The structure of attention layer is shown in the figure below. Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(\frac {QK^T}{\ SQRT {d_k}})VAttention(Q,K,V)=softmax(DK) QKT)V, this step is to calculate the correlation between Q and K, through this calculation, the context comparison between the correlation between words will be calculated according to the weight of the correlation. In self-attention, the QKV matrix comes from the same input.

The torch implements single-layer attention as follows.

class ScaledDotProductAttention(nn.Module) :
    ''' Scaled Dot-Product Attention '''
    de __init__(self, temperature, attn_dropout=0.1) :super().__init__()
        self.temperature = temperature
        self.dropout = nn.Dropout(attn_dropout)

    def forward(self, q, k, v, mask=None) :

        attn = torch.matmul(q / self.temperature, k.transpose(2.3))

        if mask is not None:
            attn = attn.masked_fill(mask == 0, -1e9)

        attn = self.dropout(F.softmax(attn, dim=-1))
        output = torch.matmul(attn, v)

        return output, attn
Copy the code

MultiHead Attention

The structure of Multihead attention is shown in the figure below, which is to put together the results of single-layer attention and will not be explained in detail here.

Transformer structure details

This is the structure of Attention. In this chapter, we will introduce the structure of Transformer. Firstly, we introduce position embedding.

Transformer is encoder-decoder structure. Encoder input includes not only Word embedding but also position embedding. The reason position embedding is added is to use the sequence information of the input text. Position embedding is computed as follows: Position embedding

class PositionalEncoding(nn.Module) :

    def __init__(self, d_hid, n_position=200) :
        super(PositionalEncoding, self).__init__()

        # Not a parameter
        self.register_buffer('pos_table', self._get_sinusoid_encoding_table(n_position, d_hid))

    def _get_sinusoid_encoding_table(self, n_position, d_hid) :
        ''' Sinusoid position encoding table '''
        # TODO: make it with torch instead of numpy

        def get_position_angle_vec(position) :
            return [position / np.power(10000.2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]

        sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
        sinusoid_table[:, 0: :2] = np.sin(sinusoid_table[:, 0: :2])  # dim 2i
        sinusoid_table[:, 1: :2] = np.cos(sinusoid_table[:, 1: :2])  # dim 2i+1

        return torch.FloatTensor(sinusoid_table).unsqueeze(0)

    def forward(self, x) :
        return x + self.pos_table[:, :x.size(1)].clone().detach()
Copy the code

1. Encoder

The encoder input is position embedding and Word embedding. It outputs the results to the decoder through multihead attention and fully connected network.

class EncoderLayer(nn.Module) :
    ''' Compose with two layers '''

    def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1) :
        super(EncoderLayer, self).__init__()
        self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)

    def forward(self, enc_input, slf_attn_mask=None) :
        enc_output, enc_slf_attn = self.slf_attn(
            enc_input, enc_input, enc_input, mask=slf_attn_mask)
        enc_output = self.pos_ffn(enc_output)
        return enc_output, enc_slf_attn
Copy the code

2. Decoder

Decoder has two attention structures, the first layer is self attention, the second layer input is encoder output and self attention output. In order to prevent the decoder from learning anything beyond the current location, the input needs to be masked in the self-attention step. The second attention mask is the same as encoder mask.

class DecoderLayer(nn.Module) :
    ''' Compose with three layers '''

    def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1) :
        super(DecoderLayer, self).__init__()
        self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.enc_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)

    def forward(
            self, dec_input, enc_output,
            slf_attn_mask=None, dec_enc_attn_mask=None) :
        dec_output, dec_slf_attn = self.slf_attn(
            dec_input, dec_input, dec_input, mask=slf_attn_mask)
        dec_output, dec_enc_attn = self.enc_attn(
            dec_output, enc_output, enc_output, mask=dec_enc_attn_mask)
        dec_output = self.pos_ffn(dec_output)
        return dec_output, dec_slf_attn, dec_enc_attn
Copy the code

Conclusions and Reflections

This article is not so much an interpretation of Transformer as an interpretation of the source code of Transformer. When I read the paper, I only looked at some places roughly, but when I really looked at the source code, I found that many details were not taken into account at that time. Maybe this is where the paper comes in. Here are some thoughts

Decoder input why to add the original input?
Why mask? Can I leave it out?

Reference

Rush A . The Annotated Transformer[C]// Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 2018.
ArXiv PrePrint arXiv: 176.03762, 2017. Shazeer N, Vaswani A, Parmar N, et al. Attention is the most important thing [J]. ArXiv Preprint arXiv: 176.03762, 2017.

Attention is all you need. Start with the head

takeaway

Attention mechanism

Transformer structure details

1. Encoder

2. Decoder

Conclusions and Reflections

Reference

Related Posts

Pytorch–Tensor

Note: multi-label classification problem

China’s ai world leader military wants to get ahead of the game