This article uses transformer to achieve a repeatable machine, code from harvarDNLP annotated- Transformer

Torch version: 1.6.0

Introducing three-party packages

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline
Copy the code

Code is written from the large framework to the widget, starting with the entire encoder-decode structure

1 Encoder Decoder architecture

Encoding: Encoder encodes the input sequence SRC and the input mask sequence src_mask

Decoding: The decoder decodes according to the memory output of the encoder, the input mask sequence srC_mask, the input sequence TGT of the decoder, and the mask sequence TGT_mask of the decoder

class EncoderDecoder(nn.Module):

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
        
    def forward(self, src, tgt, src_mask, tgt_mask):
        return self.decode(self.encode(src, src_mask), src_mask,
                            tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
Copy the code

The encoder and decoder in the code are the encoder and decoder of Transformer, srC_Embed is the word vector matrix on the encoder side, tgT_embed is the word vector matrix on the decoder side, and a generator is defined. Its function is to map the decoder output vector to the VOCab dimension and calculate log SoftMax, which can be regarded as part of the decoder

Class Generator(nn.Module): """ def __init__(self, d_model, vocab): """ super(Generator, self).__init__() self.proj = nn.Linear(d_model, vocab) def forward(self, x): return F.log_softmax(self.proj(x), dim=-1)Copy the code

2 Encoder implementation

The encoder consists of six layers of the same architecture, the same architecture does not mean the same parameters, first implement a copy function

Def clones(module, N): "clones(module, N)" return nn.ModuleList([copy. Deepcopy (module) for _ in range(N)])Copy the code

Defining the encoder framework

For each layer, enter x and mask to get an updated X; And then finally a LayerNorm

class Encoder(nn.Module): def __init__(self, layer, N): super(Encoder, self).__init__() self.layers = clones(layer, N) self.norm = LayerNorm(layer.size) def forward(self, x, Mask): "for each layer, enter x and mask to get the updated x; LayerNorm" for layer in self. Layers: x = layer(x, mask) return self.norm(x)Copy the code

Layer Normalization

Arxiv.org/abs/1607.06…

class LayerNorm(nn.Module): def __init__(self, features, eps=1e-6): super(LayerNorm, self).__init__() self.a_2 = nn.Parameter(torch.ones(features)) self.b_2 = nn.Parameter(torch.zeros(features)) self.eps =  eps def forward(self, x): mean = x.mean(-1, keepdim=True) std = x.std(-1, keepdim=True) return self.a_2 * (x - mean) / (std + self.eps) + self.b_2Copy the code

Connection between sub-layers

As mentioned above, the encoder is composed of six identical modules, and each module is separately composed of two sub-modules (multi-head self-attention, FFN). Use residual mode to connect here, namely: LayerNorm(x+Sublayer(x))

Note that the output dimension of each sub-layer or layer remains: dModel =512d_{\text{model}}= 512dModel =512

class SublayerConnection(nn.Module):

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "residual connection"
        return x + self.dropout(sublayer(self.norm(x)))
Copy the code

Define EncoderLayer

EncoderLayer is each layer in class Encoder. It first accepts input X and mask, and passes through multiple self-attention layers. Then output x through the SublayerConnection layer; X goes through the FFN layer; Finally, go through the SublayerConnection layer one more time and output the vector

class EncoderLayer(nn.Module):

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)
Copy the code

The concrete implementation of the two sub-layers will be described in detail later, and then build the decoder according to the above routine

3. Decoder implementation

The decoder is the same as the encoder, and also consists of six identical modules

Define the decoder framework

The decoder still passes through six layers in sequence, and finally through a LayerNorm

The difference between this and the encoder is that the encoder has input x and mask, and the decoder has changed from two parameters to four parameters: the memory of the encoder, the mask on the encoder side, the mask on the decoder side and the input to the decoder section

class Decoder(nn.Module):

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)
Copy the code

Define DecoderLayer

In EncoderLayer, there are only two sub-layers, multi-self-focused and fully connected; However, in the DecoderLayer, there are three sub-layers. In addition to the multi-attention of the decoder side and full connection, there is also an additional multi-self-attention of the decoder side to the encoder side, which is similar to the attention mechanism of traditional SEq2SEq

class DecoderLayer(nn.Module): def __init__(self, size, self_attn, src_attn, feed_forward, dropout): super(DecoderLayer, Self_attn = self_attn # Self attention self.src_attn = src_attn # and the encodec do attention self.feed_forward = feed_forward self.sublayer = clones(SublayerConnection(size, dropout), 3) def forward(self, x, memory, src_mask, tgt_mask): m = memory x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask)) x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask)) return self.sublayer[2](x, self.feed_forward)Copy the code

In DecoderLayer, QKV is the input of decoder, and the attention mechanism is combined with mask of decoder. Then pass a SublayerConnection; Then the output X is used as query, and the memory on the encoder is used as key and value as attention mechanism. SublayerConnection again; Fully connected; SublayerConnection again

Mask of the decoder

In particular, the decoder side mask is different from the encoder side mask. Of course, both sides mask the pad inside the batch. In addition, the decoder side mask also takes into account that it cannot “look back”, because unlike the RNN which explicitly depends on the previous pass at each moment. When Transformer uses self-attention, it is necessary to only see the inputs at the time t and before the step T, and not the inputs after the t. Otherwise, it is cheating and meaningless to use the known to predict the known

So, we want to construct a triangular matrix

Construct an upper triangle using np.triu

See: juejin. Cn/post / 693125…

def subsequent_mask(size):
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0
Copy the code

Let’s look at an example

Print (subsequent_mask (5)) PLT. Figure (figsize = (5, 5)) PLT. Imshow (subsequent_mask (20) [0]) NoneCopy the code

The output

tensor([[[ True, False, False, False, False],
         [ True,  True, False, False, False],
         [ True,  True,  True, False, False],
         [ True,  True,  True,  True, False],
         [ True,  True,  True,  True,  True]]])
Copy the code

The output diagram is as follows:

4. Multi-head self-attention mechanism

Set aside the bulls for a moment and look at pure self-attention

Since the attention

Just like traditional attention, query and key calculate the weight, and then calculate weighted-sum with value. The formula is as follows:


A t t e n t i o n ( Q . K . V ) = s o f t m a x ( Q K T d k ) V \mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V

What’s special here is that the denominator of the weight is dk\ SQRT {d_k} DK. The reason for this is: When dkd_kdk is small, it doesn’t matter whether you divide or not, but when dkd_kDK is large, the dot product of Query and key will be very large, which may lead to the gradient minimum region of SoftMax, so you need to scale, divide by dk\ SQRT {d_k} DK and still keep 0 mean and 1 variance

The problem with dot-product attention is to softmax everything after the dot product, and the components affect each other (not like tanh, which counts each component separately). As a result: The higher the dimension of the vector, the wider the range of the result of the dot product, and the more likely it is that the maximum value is much larger than the other values, The result of SoftMax is close to one-hot (you can calculate softMax (Np.random.random (10)) and SoftMax (100 * NP.random.Random (10)), the phenomenon of the latter’s probability mass concentration to a certain dimension is obvious). In back propagation, most elements of Jacobian matrix of Softmax are close to zero, so the gradient cannot flow

def attention(query, key, value, mask=None, dropout=None):

    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn
Copy the code

In the implementation, considering the mask matrix, set the position of mask matrix equal to 0 in the corresponding position of scores matrix as a large negative number, for example: -1e9, so that e−1e9e^{-1e9}e−1e9 is close to 0, which is equivalent to ignoring these positions in attention

The bulls

Transformer uses multi-head self-attention to allow the model to calculate attention in subspaces of different angles, as follows:


M u l t i H e a d ( Q . K . V ) = C o n c a t ( h e a d 1 . . . . . h e a d h ) W O where  h e a d i = A t t e n t i o n ( Q W i Q . K W i K . V W i V ) \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, … , \mathrm{head_h})W^O \\ \text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)

WiQ∈Rdmodel×dkW^Q_i \in \mathbb{R}^{d_{text{model}} \times d_k}WiQ∈Rdmodel×dk, WiK∈Rdmodel×dkW^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k} WiV∈Rdmodel×dvW^V_i \in \mathbb{R}^{d_{text{model}} \times d_v}WiV∈Rdmodel×dv and WO∈Rhdv×dmodelW^O \in ^ \ mathbb {R} {hd_v \ times d_ {\ text {model}}} send ∈ Rhdv x dmodel

Dmodel = 512D_ {model}= 512dModel =512, h=8h=8h=8, dk=dv=dmodel/h=64d_k=d_v=d_{\text{model}}/h= 64DK =dv=dmodel/h=64

Class MultiHeadedAttention(nn.module): def __init__(self, H, d_model, dropout=0.1): super(MultiHeadedAttention, Self).__init__() assert d_model % h == 0 self.d_k = d_model // h self.h = h # linears clones(nn.Linear(d_model, d_model), Dropout = nn. dropout (p=dropout) def forward(self, query, key, value, self) mask=None): if mask is not None: Mask = mask.unsqueeze(1) nbatches = query.size(0) # linears D_model => h x query, key, value = \ [l(x).view(nbatches, -1, self.h, self.d_k). Transpose (1, 2) for l, X in zip(self.linears, (query, key, value))] # 2) Self. attn = attention(query, key, value, mask=mask, Dropout =self.dropout) # 3) Concat the results of multiple batches through a layer of Linear computing output x = x.transpose(1, 2).contiguous() \.view(nbatches, -1, self.h * self.d_k) return self.linears[-1](x)Copy the code

The multi-head code is easy to understand and is computed in parallel using a large pieced matrix. See the code comment above for details

5 FFN layer

The sub-layer FFN layer is composed of two full connections and one ReLU. The formula is as follows:


F F N ( x ) = max ( 0 . x W 1 + b 1 ) W 2 + b 2 \mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2

In the above equation, dimension DFF = 2048D_ {FF}= 2048DFF =2048

Class PositionwiseFeedForward(nn.module): def __init__(self, d_model, d_ff, dropout=0.1): super(PositionwiseFeedForward, self).__init__() self.w_1 = nn.Linear(d_model, d_ff) self.w_2 = nn.Linear(d_ff, d_model) self.dropout = nn.Dropout(dropout) def forward(self, x): return self.w_2(self.dropout(F.relu(self.w_1(x))))Copy the code

6 Embedding layer

The word vector Embedding

Note here: dmodel\ SQRT {d_{model}} dModel is also multiplied by each weight

class Embeddings(nn.Module): def __init__(self, d_model, vocab): Super (Embeddings, self).__init__() # vocab size x d_model self.lut = nn.Embedding(vocab, d_model) self.d_model = d_model def forward(self, x): return self.lut(x) * math.sqrt(self.d_model)Copy the code

Positional Encoding

Since Transformer has no recurrence design, simple self-attention cannot distinguish the order. Therefore, in order to integrate the location information, a position coding vector needs to be designed, which should be consistent with the dimension of the input vector, so that the addition operation can be performed

Transformer’s position coding is designed using sin and cos, which has not been used in subsequent papers. The random effect is similar. The formula is as follows:


P E ( p o s . 2 i ) = s i n ( p o s / 1000 0 2 i / d model ) PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}})

P E ( p o s . 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d model ) PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{\text{model}}})

In the above equation, pos is position, and I is index of Dimension

class PositionalEncoding(nn.Module): def __init__(self, d_model, dropout, max_len=5000): super(PositionalEncoding, self).__init__() self.dropout = nn.Dropout(p=dropout) pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * - (math. The log (10000.0)/d_model)) PE [: 0: : 2] = torch. Sin (position * div_term) [PE: 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False) return self.dropout(x)Copy the code
plt.figure(figsize=(15, 5)) pe = PositionalEncoding(20, 0) y = pe.forward(Variable(torch.zeros(1, 100, 20))) PLT. The plot (np) arange (100), [0, :, 4:8] y data. Numpy (), PLT. Legend ([" dim % d % p for p in [4, 7]]) NoneCopy the code

7 Complete model modeling

The inner module of the model has been implemented in the previous section. Just fill the submodule into the EncoderDecoder class

Def make_model(src_VOCab, TGT_VOCab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1): c = copy.deepcopy attn = MultiHeadedAttention(h, d_model) ff = PositionwiseFeedForward(d_model, d_ff, dropout) position = PositionalEncoding(d_model, dropout) model = EncoderDecoder( Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N), Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N), nn.Sequential(Embeddings(d_model, src_vocab), c(position)), nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)), Generator(d_model, tgt_vocab)) for p in model.parameters(): if p.dim() > 1: nn.init.xavier_uniform(p) return modelCopy the code

8 Training details

Construct the batch

class Batch: def __init__(self, src, trg=None, pad=0): Self. SRC = SRC # (batch size, 1, seq_len) # Src_mask = (SRC! = pad).unsqueeze(-1) # TRG is not empty if TRG is not None: self.trg = trg[:, :-1] self.trg_y = trg[:, 1:] self.trg_mask = \ self.make_std_mask(self.trg, Ntokens self. Ntokens = (self. Trg_y! = pad).data.sum() @staticMethod def make_std_mask(TGT, pad): = pad).unsqueeze(-2) # unsqueeze(-2) # Here is a broadcast operation # (batch size, seq_len-1, seq_len-1) tgt_mask = tgt_mask & Variable( subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data)) return tgt_maskCopy the code

Batch class constructs the mask matrix on the encoder side and the mask matrix on the encoder side according to the input sequence, output sequence and pad index. In addition, the input sequence and output sequence on the decoder side are constructed by shifting according to the convention of SEq2SEq

Training loop

def run_epoch(data_iter, model, loss_compute):
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.trg, 
                            batch.src_mask, batch.trg_mask)
        loss = loss_compute(out, batch.trg_y, batch.ntokens)
        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 50 == 1:
            elapsed = time.time() - start
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
                    (i, loss / batch.ntokens, tokens / elapsed))
            start = time.time()
            tokens = 0
    return total_loss / total_tokens
Copy the code

The only thing to note here is that when observing the loss indicator, loss is calculated in batch. nTokens, and the loss of pad position is not concerned

The optimizer

  1. Adam, specific parameters are as follows: 0.9 beta 1 = \ beta_1 = 0.9 beta 1 = 0.9, 0.98 beta 2 = \ beta_2 = 0.98 beta 2 = 0.98 and ϵ = 10-9 \ epsilon = 10 ^ {9} ϵ = 10-9
  2. Warmup is used for the learning rate, and the formula is:

l r a t e = d model 0.5 min ( s t e p _ n u m 0.5 . s t e p _ n u m w a r m u p _ s t e p s 1.5 ) Lrate = d_ {\ text {model}} ^ {0.5} \ cdot \ min ({step \ _num} ^ {0.5}, {step \ _num} \ cdot ^ {warmup \ _steps}} {1.5)

Here warmup_steps = 4000 warmup \ warmup_steps _steps = 4000 = 4000

The above formula is a subsection function. If step_num is smaller than warmup_steps, Lr = dmodel – 0.5 ⋅ step_num ⋅ warmup_steps – 1.5 lr = d_ {\ text {model}} ^ {0.5} \ cdot {step \ _num} \ cdot {warmup\_steps}^{-1.5}lr=dmodel−0.5, is a linear function. Greater than negative power decay, the decay rate is first fast and then slow

The implementation is as follows:

class NoamOpt: def __init__(self, model_size, factor, warmup, optimizer): self.optimizer = optimizer self._step = 0 self.warmup = warmup self.factor = factor self.model_size = model_size Self._step += 1 rate = self.rate() for p in self.optimizer.param_groups: self._step += 1 rate = self.rate() for p in self.optimizer.param_groups: p['lr'] = rate self._rate = rate self.optimizer.step() def rate(self, step = None): if step is None: Step = self._step return self.factor * \ (self.model_size ** (-0.5) * min(self.model_size ** (-0.5)), Step * self. Warmup ** (-1.5)) def get_std_opt(model): D_model, 2, 4000, torch. Optim.adam (model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))Copy the code

Let’s draw the learning rate curve

opts = [NoamOpt(512, 1, 4000, None), 
        NoamOpt(512, 1, 8000, None),
        NoamOpt(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])
None
Copy the code

Regularization-Label Smoothing

Label Smoothing is to punish the neural network for its confidence in prediction. It evenly distributes the probability of “1” in one-Hot Ground truth to the probability of “0”. Three classification, for example, the original y = (0, 0) y = (0, 0) y = (0, 0), the label after smoothing is: y = (0.1, 0.8, 0.1) y = (0.1, 0.8, 0.1) = y (0.1, 0.8, 0.1)

The implementation is as follows:

Class smoothing (nn.Module): def __init__(self, size, padding_idx, smoothing=0.0): super(LabelSmoothing, Self).__init__() self.criterion = nn.KLDivLoss(size_average=False) self.padding_idx = padding_idx self.confidence = 1.0 Self. Size = size self.true_dist = None def forward(self, x, target): Size # vocab keyword size true_dist = x.data.clone() true_dist. Fill_ (self.size/(self.size)) # Smoothing probability true_dist. Scatter_ (1, target.data.unsqueeze(1), self.confidence) # Smoothing probability true_dist. self.padding_idx] = 0 mask = torch.nonzero(target.data == self.padding_idx) if mask.dim() > 0: True_dist = true_dist return self.criterion(x, Variable(true_dist, 0) requires_grad=False))Copy the code

Fill_ here explain why divide by size-2, suppose the dictionary size is 3, and they are ABC, because the model needs pad, so the dictionary supplement is A,B,C,< pad >, suppose the target tag is A, then one-hot is: When choosing an answer to your answer, which is 0.2 and smoothing= 0.8, we left it at 0 because we don’t want to make A prediction.

Scatter_ is used to populate the confidence into the appropriate location

Then set the <PAD> column to 0 in the one-hot transformed matrix

Finally, when the target is <PAD>, since the maximum length max_len should be added in the batch, assuming that max_len is 5, then the output sequence B B A C will become B B A C <PAD>. At this time, it is meaningless to predict the probability distribution of the position where the target is <PAD>. It should not be included in our loss, so the true_distribution corresponding to this position should be set to all 0

Let’s look at a simple example

Smoothing(5, 0, 0.4) predict = torch. Smoothing(5, 0, 0.2) 0.7, 0.1, 0], [0, 0.2, 0.7, 0.1, 0] to [0, 0.2, 0.7, 0.1, 0]]) v = crit (Variable (predict the log ()), Tensor([[0.0000, 0.1333, 0.6000, 0.1333, 0.1333], # [0.0000, 0.6000, 0.1333, 0.1333], # [0.0000, 0.6000, 0.1333, 0.1333], # [0.0000, 0.6000, 0.1333, 0.1333] Print (crit. True_dist) plt.imshow(crit. True_dist) NoneCopy the code

Let’s take a look at the change curve of loss for the probability distribution with different density

Smoothing(5, 0, 0.1) def loss(x): d = x + 3 * 1 predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d], ]) return crit(Variable(predict.log()), Variable(torch.LongTensor([1]))).data.item() plt.plot(np.arange(1, 100), [loss(x) for x in range(1, 100)]) NoneCopy the code

When calculating loss in the traditional one-hot form, the curve should be decreasing, but in the label smoothing case, the prediction probability of overly confident is slightly increased by loss

Loss calculation

class SimpleLossCompute: def __init__(self, generator, criterion, opt=None): self.generator = generator self.criterion = criterion self.opt = opt def __call__(self, x, y, norm): X = self.generator(x) # Mapping to the dictionary dimension do log softmax # Norm valid token number in batch loss = self.generator(x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)) / norm loss.backward() if self.opt is not None: self.opt.step() self.opt.optimizer.zero_grad() return loss.data.item() * normCopy the code

9 Small experiment – reread machine

Fake data

def data_gen(V, batch, nbatches):
    for i in range(nbatches):
        data = torch.from_numpy(np.random.randint(1, V, size=(batch, 10)))
        data[:, 0] = 1
        src = Variable(data, requires_grad=False).long()
        tgt = Variable(data, requires_grad=False).long()
        yield Batch(src, tgt, 0)
Copy the code

Let’s make the dictionary size 11, where 1 to 10 are normal tokens and 0 is pad token

training

Smoothing(size=V, padding_idx=0, smoothing=0.0) model = make_model(V, V, D_model, 1, 400, torch. Optim.adam (model.parameters(), lr=0, betas=(0.9,) 0.98), EPS = 1E-9)) for the epoch in range(10): model.train() run_epoch(data_gen(V, 30, 20), model, SimpleLossCompute(model.generator, criterion, model_opt)) model.eval() test_loss = run_epoch(data_gen(V, 30, 5), model, SimpleLossCompute(model.generator, criterion, None)) print("test_loss", test_loss)Copy the code

Running record

Epoch Step: 1 Loss: 2.949874 Tokens per Sec: 557.973450 Epoch Step: 1 Loss: 1.857541 Tokens per Sec: 557.973450 Tokens per Sec: Tensor (1.8417) Epoch Step: 1 Loss: 2.048431 Tokens per Sec: 596.984863 Epoch Step: 1 Loss: 1.577389 Tokens per Sec: 861.355225 test_loss tensor(1.6092) Epoch Step: 1 Loss: 1.865752 Tokens per Sec: Epoch Step: 1 Loss: 1.395658 Tokens per Sec: 942.581787 TEST_loss tensor(1.3495) Epoch Step: 1 Loss: 1.395658 Tokens per Sec: 942.581787 2.041692 Tokens per Sec: 608.372864 Epoch Step: 1 Loss: 1.183396 Tokens per Sec: Epoch Step: 1 Loss: 1.291280 Tokens per Sec: 667.504517 Epoch Step: 1 Loss: 1.291280 Tokens per Sec 0.924788 Tokens per Sec: 906.874023 test_loss tensor(0.9144) Epoch Step: 1 Loss: 1.222422 Tokens per Sec: Epoch Step: 1 Loss: 0.733476 Tokens per Sec: 1043.809326 test_loss tensor(0.7075) Epoch Step: 1 Loss: 0.733476 Tokens per Sec: 1043.809326 test_loss tensor(0.7075) Epoch Step: 1 Loss: 0.733476 Tokens per Sec 0.829088 Tokens per Sec: 663.332275 Epoch Step: 1 Loss: 0.296809 Tokens per Sec: 1100.190186 test_loss tensor(0.3417) Epoch Step: 1 Loss: 1.048580 Tokens per Sec: 638.724670 Epoch Step: 1 Loss: 1.048580 Tokens per Sec 0.277764 Tokens per Sec: 970.994873 test_loss tensor(0.2576) Epoch Step: 1 Loss: 0.393721 Tokens per Sec: Epoch Step: 1 Loss: 0.385875 Tokens per Sec: 690.867737 test_loss tensor(0.3720) Epoch Step: 1 Loss: 0.385875 Tokens per Sec: 690.867737 0.544152 Tokens per Sec: 441.701752 Epoch Step: 1 Loss: 0.238676 Tokens per Sec: 965.472900 test_loss tensor(0.2562)Copy the code

Greed to generate

def greedy_decode(model, src, src_mask, max_len, start_symbol):
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len-1):
        out = model.decode(memory, src_mask, 
                           Variable(ys), 
                           Variable(subsequent_mask(ys.size(1)).type_as(src.data)))
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim = 1)
        next_word = next_word.data[0]
        ys = torch.cat([ys, 
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
    return ys

model.eval()
src = Variable(torch.LongTensor([[1, 3, 2, 2, 4, 6, 7, 9, 10, 8]]) )
src_mask = Variable(torch.ones(1, 1, 10) )
print(greedy_decode(model, src, src_mask, max_len=10, start_symbol=1))
Copy the code

Generate results:

tensor([[ 1,  3,  2,  2,  4,  6,  7,  9, 10,  8]])
Copy the code

Attention visualization

def draw(data, x, y, ax): Seaborn. heatmap(data, xticklabels=x, square=True, Yticklabels =y, vmin=0.0, vmax=1.0, cbar=False, ax=ax)Copy the code
sent = [1, 3, 2, 2, 4, 6, 7, 9, 10, 8] for layer in range(2): Print ("Encoder Layer", Layer +1) for h in range(4) and plots(figure size=(20, 10) draw(model.encoder.layers[layer].self_attn.attn[0, h].data, sent, sent if h ==0 else [], ax=axs[h]) plt.show()Copy the code

Encoder Layer 1

Encoder Layer 2

tgt_sent = [1, 3, 2, 2, 4, 6, 7, 9, 10, 8] for layer in range(2): Figure (1,4, figsize=(20, 10)) print("Decoder Self Layer", Layer +1) for h in range(4): draw(model.decoder.layers[layer].self_attn.attn[0, h].data[:len(tgt_sent), :len(tgt_sent)], tgt_sent, Tgt_sent if h ==0 else [], ax=axs[h]) plt.show() print("Decoder Src Layer", Layer +1) FIG, axs = plt.subplots(1,4, plots) figsize=(20, 10)) for h in range(4): draw(model.decoder.layers[layer].self_attn.attn[0, h].data[:len(tgt_sent), :len(sent)], sent, tgt_sent if h ==0 else [], ax=axs[h]) plt.show()Copy the code

Decoder Self Layer 1

Decoder Src Layer 1

Decoder Self Layer 2

Decoder Src Layer 2

reference

Github.com/harvardnlp/…