This is the 24th day of my participation in the August More Text Challenge

In the word – based language model, we use recurrent neural networks. Its input is a sequence of indefinite length, but its output is fixed length, for example: They are, the output may be watching or sleeping. However, the output of many problems is an indefinite sequence. Take machine translation for example, the input is an English paragraph, the output is a French paragraph, the input and output are variable length, for example

The are watching

French: LLS regardent

When the input and output sequences are variable length, we can use encoder-decoder or SEq2seq. They are based on two work done in 2014:

  • Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
  • Sutskever et al., Sequence to Sequence Leaerning with Neural Networks

The above two works essentially use two recurrent neural network structures, called encoder and decoder respectively. The encoder corresponds to the input sequence and the decoder corresponds to the output sequence

Encoder – decoder

Encoder and decoder correspond to two recurrent neural networks of input sequence and output sequence respectively. We usually enclose a special character ‘< EOS >’ (end of sequence) after the input sequence and output sequence to indicate the end of the sequence. When testing the model, the current output sequence is terminated as soon as ‘< EOS >’ is output

The encoder

The function of the encoder is to convert an input sequence of indefinite length into a background word vector of fixed length c\ boldSymbol {c}c. The background word vector contains information about the input sequence. A common encoder is a recurrent neural network

First, review the following knowledge of recurrent neural networks. Assume that the recurrent neural network unit is FFF, and the input at TTT is Xt,t=1… ,Tx_t,t=1,… ,Txt,t=1,… , T. Assuming xt\ boldSymbol {x}_txt is the result of a single output at the embed layer, For example, xt\ boldSymbol {x}_txt corresponds to the ONt-hot vector O ∈Rx\ BoldSymbol {o}\in \mathbb{R}^xo∈Rx and the embedding layer parameter matrix E∈Rx×h\ BoldSymbol {E}\in \mathbb{R}^{x\times h}E∈Rx×h product o⊤E\boldsymbol{o}^\top\boldsymbol{E}o⊤E Hidden layer variables


h t = f ( x t . h t 1 ) \boldsymbol{h}_t = f(\boldsymbol{x}_t, \boldsymbol{h}_{t-1})

Background vector of the encoder


c = q ( h 1 . . h T ) \boldsymbol{c} = q(\boldsymbol{h}_1, \ldots, \boldsymbol{h}_T)

A simple background vector can be thought of as the hidden layer variable hT\ boldSymbol {h}_ThT at the end of the network. We call the circulating neural network here an encoder

Bidirectional cyclic neural network

The input to the encoder can be either forward or backward. If the input sequence is x1, X2… ,xT\boldsymbol{x}_1,\boldsymbol{x}_2,… ,\boldsymbol{x}_Tx1,x2,… ,xT, hidden layer variables in forward pass


h t = f ( x t . h t 1 ) \overrightarrow {\boldsymbol{h}}_t = f(\boldsymbol{x}_t,\overrightarrow {\boldsymbol{h}}_{t-1})

In the reverse transfer process, the calculation of hidden layer variables becomes


h please t = f ( x t . h please t 1 ) \overleftarrow {\boldsymbol{h}}_t = f(\boldsymbol{x}_t,\overleftarrow {\boldsymbol{h}}_{t-1})

When we want the input of the encoder to contain both forward and reverse transfer information, we can use bidirectional cyclic neural networks. For example, given the input sequence x1,x2… ,xT\boldsymbol{x}_1,\boldsymbol{x}_2,… ,\boldsymbol{x}_Tx1,x2,… ,xT, according to forward transmission, their hidden layer variables in the recurrent neural network are respectively H ⃗1, H ⃗2… , h ⃗ \ vec T {\ boldsymbol {h}} _1, \ vec {\ boldsymbol {h}} _2,… ,\vec {\boldsymbol{h}}_Th 1,h 2,… H T; According to back propagation, their hidden layer variables in the recurrent neural network are RESPECTIVELY H ←1, H ←2… Please T, h \ overleftarrow {\ boldsymbol {h}} _1, \ overleftarrow {\ boldsymbol {h}} _2,… ,\overleftarrow{\boldsymbol{h}}_Th 1,h 2,… H T. In bidirectional cyclic neural networks, the hidden layer variable of moment III is to splicken h⃗ I \vec{\boldsymbol{h}}_ih I with H ← I \overleftarrow{\boldsymbol{h}}_ih I, for example

import numpy as np
h_forward = np.array([1.2])
h_backward = np.array([3.4])
h_bi = np.concat(h_forward, h_backward, dim=0)
# [1, 2, 3, 4]
Copy the code

decoder

The encoder finally outputs a background vector c\ boldSymbol {c} C, which integrates the input sequences X1,x2… ,xT\boldsymbol{x}_1,\boldsymbol{x}_2,… ,\boldsymbol{x}_Tx1,x2,… ,xT

Suppose the output sequence in the training data is y1,y2… , yT ‘\ boldsymbol {} y _1, \ boldsymbol {} y _2,… ,\boldsymbol{y}_{T’}y1,y2,… YT ‘, we want to represent the vector output at each time t’t ‘, depending on both the previous output and the background vector. Because we can maximize the joint probability of the output sequence


P ( y 1 . y 2 . . . . . y T ) = t = 1 T P ( y t y 1 . . . . . y t 1 . c ) P(\boldsymbol{y}_1,\boldsymbol{y}_2,… ,\boldsymbol{y}_{T’})=\prod_{t’=1}^{T’}P(\boldsymbol{y}_{t’}\mid \boldsymbol{y}_1,… ,\boldsymbol{y}_{t’-1},\boldsymbol{c})

The loss function of the output sequence is obtained


log P ( y 1 . . . . . y T ) -\log P(\boldsymbol{y}_1,… ,\boldsymbol{y}_{T’})

To do this, we use another recurrent neural network as a decoder. The decoder uses the function PPP to represent the probability of a single output yt ‘\ boldSymbol {y}_{t’} YT’


P ( y t y 1 . . . . . y t 1 . c ) = p ( y t 1 . s t . c ) P(\boldsymbol{y}_{t’}\mid \boldsymbol{y}_1,… ,\boldsymbol{y}_{t’-1},\boldsymbol{c})=p(\boldsymbol{y}_{t’-1},\boldsymbol{s}_{t’},\boldsymbol{c})

Where ST ‘\ BoldSymbol {S}_{t’} ST’ is the hidden layer variable of the decoder at t’t ‘time. The hidden layer variable


s t = g ( y t 1 . c . s t 1 ) \boldsymbol{s}_{t’}=g(\boldsymbol{y}_{t’-1},\boldsymbol{c},\boldsymbol{s}_{t’-1})

Where function G is a cyclic neural network unit

It is important to note that encoders and decoders often use multi-layer cyclic neural networks

Attentional mechanism

In the decoder design above, the same background vector C \ boldSymbol {c} C is used for each moment. What if the decoder could use different background vectors at different times?

Taking the English-French translation as an example, given a pair of input sequences “They are watching” and output sequences “LLS regardent”, the decoder can generate” LLS “at time 1 using more background vectors encoding “They are” information, At time 2, however, more background vectors encoded with “watching” information can be used to generate “regardent”. This looks like allocating different attention to different moments in the input sequence at each moment of the decoder. That’s where the attention mechanism comes in

Now, make a few changes to the decoder above. We assume that the background vector of time t’t’t ‘is ct’ \boldsymbol{c}_{t’}ct ‘. Then the decoder’s hidden layer variable at time t’t ‘


s t = g ( y t 1 . c t . s t 1 ) \boldsymbol{s}_{t’}=g(\boldsymbol{y}_{t’-1},\boldsymbol{c}_{t’},\boldsymbol{s}_{t’-1})

Let the hidden layer variable of the encoder at TTT be HT \ boldSymbol {h}_tht, and the background vector of the decoder at t’t ‘is


c t = t = 1 T Alpha. t t h t \boldsymbol{c}_{t’}=\sum_{t=1}^{T}\alpha_{t’t}\boldsymbol{h}_t

In other words, given the current time t’t ‘of the decoder, we need to take a weighted average of the hidden layer variables at different times in the encoder. And the weight is also called the attention weight. And the formula for that is


Alpha. t t = exp ( e t t ) k = 1 T exp ( e t k ) \alpha_{t’ t} = \frac{\exp(e_{t’ t})}{ \sum_{k=1}^T \exp(e_{t’ k}) }

Et t∈Re_{t t}\in \mathbb{R}et T ∈R is


e t t = a ( s t 1 . h t ) e_{t’ t} = a(\boldsymbol{s}_{t’ – 1}, \boldsymbol{h}_t)

There are several ways to design aaa. In Bahdanau’s paper


e t t = v tanh ( W s s t 1 + W h h t ) . \boldsymbol{e}_{t’t} = \boldsymbol{v}^\top \tanh(\boldsymbol{W}_s \boldsymbol{s}_{t’-1} + \boldsymbol{W}_h \boldsymbol{h}_t),

Among them, v\boldsymbol{v}v, Ws\boldsymbol{W}_sWs, Wh\ Boldsymbol {W}_hWh and the weight and offset terms of encoder and decoder, as well as the parameters of embedding layer, are all model parameters that need to be learned at the same time

In the paper of Bahdanau’s paper, GRU is used in both encoder and decoder

In the decoder, we need to modify the design of the GRU slightly, assuming yt\ BoldSymbol {y} _TYt is the result of a single output at the embed layer, For example, yt\boldsymbol{y}_tyt corresponds to the ONt-hot vector O ∈Ry\ BoldSymbol {o}\in \mathbb{R}^yo∈Ry and the embedding layer parameter matrix B∈Ry× S \ BoldSymbol {B}\in \mathbb{R}^{y\times s}B∈Ry×s product o⊤B\ boldSymbol {o}^\top\boldsymbol{B}o⊤B Suppose the background vector at time t’t’t ‘is ct’ \boldsymbol{c}_{t’}ct ‘. Then a single hidden layer variable of the decoder at time t’t ‘


s t = z t Even though s t 1 + ( 1 z t ) Even though s ~ t \boldsymbol{s}_{t’} = \boldsymbol{z}_{t’} \odot \boldsymbol{s}_{t’-1} + (1 – \boldsymbol{z}_{t’}) \odot \tilde{\boldsymbol{s}}_{t’}

Where, reset door, update door and candidate hidden state are respectively


r t = sigma ( W y r y t 1 + W s r s t 1 + W c r c t + b r ) . z t = sigma ( W y z y t 1 + W s z s t 1 + W c z c t + b z ) . s ~ t = tanh ( W y s y t 1 + W s s ( s t 1 Even though r t ) + W c s c t + b s ) . \begin{aligned} \boldsymbol{r}_{t’} &= \sigma(\boldsymbol{W}_{yr} \boldsymbol{y}_{t’-1} + \boldsymbol{W}_{sr} \boldsymbol{s}_{t’ – 1} + \boldsymbol{W}_{cr} \boldsymbol{c}_{t’} + \boldsymbol{b}_r),\\ \boldsymbol{z}_{t’} &= \sigma(\boldsymbol{W}_{yz} \boldsymbol{y}_{t’-1} + \boldsymbol{W}_{sz} \boldsymbol{s}_{t’ – 1} + \boldsymbol{W}_{cz} \boldsymbol{c}_{t’} + \boldsymbol{b}_z),\\ \tilde{\boldsymbol{s}}_{t’} &= \tanh(\boldsymbol{W}_{ys} \boldsymbol{y}_{t’-1} + \boldsymbol{W}_{ss} (\boldsymbol{s}_{t’ – 1} \odot \boldsymbol{r}_{t’}) + \boldsymbol{W}_{cs} \boldsymbol{c}_{t’} + \boldsymbol{b}_s), \end{aligned}

conclusion

  • The input and output of an encoder-decoder (SEQ2SEQ) can both be indefinite sequences
  • Applying the attention mechanism to the decoder makes it possible to use a different background vector for each moment in the decoder. Each background vector corresponds to allocating different attention to different parts of the input sequence