Seq2Seq model for machine translation

Author: Li Xuedong

Editor: Li Xuedong

Before the speech

Seq2Seq, Sequence to Sequence. It is a universal encoder – decoder framework that can be used in machine translation, text summarization, session modeling, image captioning and other scenarios. Seq2Seq is not an official open source implementation of the Google Neural Machine Translation (GNMT) system. Frameworks are designed to accomplish a wide range of tasks, and neuromachine translation is just one of them. In recurrent neural networks we learned how to convert a sequence into a fixed-length output. In this article, we will explore how to convert a sequence into an indefinite sequence of outputs (for example, in machine translation, sentences in source and target languages are often not of the same length).

Simple introduction

(1) Design objectives

This framework was originally built for machine translation, but has since been used for a variety of other tasks, including text summarization, session modeling, and image captions. As long as our tasks can encode input data in one format and decode it in another, we can use or extend this framework.
Availability supports many types of input data, including standard raw text.
> < span style = “box-sizing: border-box; display: inherit! Important; word-break: inherit! Important;”
Extensible code is built in a modular manner, requiring minimal code changes to add a new attention mechanism or encoder architecture.
Documentation: All code is documented using standard Python docstrings, and usage guides are written to help us get started with common tasks.
Good performance: For the sake of code simplicity, the development team did not try to squeeze every possible feature that could be extended, but the current implementation is fast enough for almost all production and research projects. In addition, TF-SEQ2SEQ also supports distributed training.

(2) Main concepts

Configuration

Many objects are configured using key-value pairs. These parameters are usually passed in the form of YAML through a configuration file or directly through the command line. Configurations are usually nested, as shown in the following example:

model_params:    attention.class:                                                       seq2seq.decoders.attention.AttentionLayerBahdanau    attention.params:        num_units: 512        embedding.dim: 1024        encoder.class: seq2seq.encoders.BidirectionalRNNEncoder    encoder.params:        rnn_cell:        cell_class: LSTMCell    cell_params:        num_units: 512Copy the code

Input Pipeline

InputPipeline defines how to read, parse, and separate data into features and tags. If you want to read new data formats, we need to implement our own input pipeline.

Encoder (code)
Decoder
The Model (Attention)

Encoder-Decoder

The whole process can be illustrated by this picture:

Figure 1: The simplest Encoder-Decoder model

Where, X and Y are composed of their own sequences of words (X and Y can be in the same language or two different languages) :

X = (x1,x2,… ,xm)

Y = (y1,y2,… ,yn)

Encoder: is the input sequence by nonlinear transformation into a specified length of the vector C semantic representation (middle), there are several ways to get C, the simplest way is to put the Encoder are assigned, the last of the hidden states C, also can do it to the last of the hidden states a transformation to get C, also can make change for all the hidden states.

C = F(x1,x2,… ,xm)

Decoder: is based on vector C (encoder output results) and the previously generated history information y1,y2… ,yn to generate the word yi to be generated at the moment I.

yi = G( C , y1,y2,… ,yn-1)

Below is a schematic diagram of the generation of couplets.

Figure 2: Chestnut in life

Coding phase

In RNN, the hidden state of the current time is determined by the state of the previous time and the input of the current time, namely:After obtaining the hidden layer of each time period, the information of the hidden layer is summarized to generate the final semantic vectorOf course, one of the simplest ways is to use the last hidden layer as semantic vector C, i.e

The decoding stage

You can view it as the reverse of coding. At this stage, according to the given semantic vector C and the previously generated output sequence y1,y2… , yT-1 to predict the next output word YT, i.eYou can also write

In RNN, it can also be simplified to

Where S is the hidden layer in the output RNN (i.e. RNN decoder), C represents the semantic vector obtained by the previous encoder, yT-1 represents the output of the last time period, and in turn serves as the input of this time period. G can be a nonlinear multi-layer neural network, which generates the probability that each word in the dictionary belongs to YT.

Attention model

Encoder -decoder model is very classic, but its limitations are also very large. The biggest limitation is that the only connection between encoding and decoding is a fixed length semantic vector C. In other words, the encoder compresses the entire sequence of information into a fixed length vector. However, there are two disadvantages in this way. One is that the semantic vector cannot fully represent the information of the whole sequence; the other is that the information carried by the first input will be diluted by the later input. The longer the input sequence, the more severe the phenomenon. This makes it impossible to get enough information from the input sequence at the beginning of the decoding, so the accuracy of decoding will be reduced to a certain extent.

In order to solve the above problems, a year after the emergence of Seq2Seq, the Attention model was proposed. The model generates an attention range to indicate which parts of the input sequence should be focused on when producing output, and then generates the next output according to the area of attention, and so on and so on. Attention has certain similarities with some behavioral characteristics of people. When people read a paragraph, they usually only focus on words with information rather than all words, that is, people will assign different attention weight to each word. Although the attention model increases the training difficulty of the model, it improves the effect of text generation. A rough sketch of the model is shown below.

Figure 3: The classical Attention model

Each C automatically selects the context information that is most appropriate for the y being output. Specifically, we use AIJ to measure the correlation between the HJ in the j stage of coding and the I stage of decoding. Finally, the context information CI in the I stage of Decoder comes from the weighted sum of all HJ to AIJ.

Figure 4: Schematic diagram of different concerns

The input sequence is “I love China”. Therefore, H1, H2, H3 and H4 in Encoder can be regarded as the information represented by “I”, “ai”, “Zhong” and “Guo” respectively. When translated into English, the first context C1 should be most relevant to the word “I”, so the corresponding A11 is larger, and the corresponding A12, A13, and a14 are smaller. C2 should be most associated with love, so a22 is bigger. Finally, C3 is most correlated with H3 and H4, so the values of A33 and A34 are higher. How is the specific model weight AIJ calculated?

For example: input the English sentence: Tom chase Jerry, generate: “Tom”, “chase”, “Jerry”. General calculation process of probability distribution value of attention distribution:

Figure 5: Schematic diagram of weight calculation

The attention weight of the current output word Yi to an input word J is determined by the current hidden layer Hi and the hidden layer state (hj) of the input word J. And then you have a sofrmax that gives you a probability of 0 minus 1. That is, the alignment possibility corresponding to the target word Yi and each input word is obtained through the function F (hj,H I). For more details, please refer to the source article, which will be linked at the end of the article.

CNN’s seq2seq

Currently, the Seq2Seq model used in most scenarios is constructed based on RNN. Although good results have been achieved, some scholars have found that using CNN to replace encoder or decoder in Seq2Seq can achieve better results. Recently, FaceBook published a paper, Convolutional Sequence to Sequence Learning, which proposes to construct a Seq2Seq model using CNN entirely for machine translation, surpassing the LSTM-based machine translation effect created by Google. An important reason for the network’s temporary success lies in a number of tricks that are worth learning:

Capture long-distance dependencies

The low-level CNN captures the dependencies between words that are near together, while the high-level CNN captures the dependencies between words that are far away. Through the hierarchical structure, the function similar to RNN (LSTM) to capture the dependency relation of the Sequence with more than 20 words in length is realized.

High efficiency

Assume that the length of a sequence sequence is N, using RNN (LSTM) to model it requires N operations, and the time complexity is O (n). In contrast, cascade CNN only needs n/ K operations, and the time complexity is O (n/k), where K is the size of the convolution window.

Parallel implementation

The modeling of sequence by RNN depends on the historical information of sequence, so it cannot be implemented in parallel. In contrast, cascading CNN convolves the entire sequence, independent of sequence history information, and can be realized in parallel. Especially in industrial production, model training is faster when large data volume and real-time requirements are high.

Fusion multi-layer attention

Multi-layer attention is integrated with Residual Connection and liner Mapping. Use attention to determine what information you input is important and pass it down. The output of encoder and the output of decoder are dot products, and then normalized, multiplied by the input X of Encoder as the result of weight added to the decoder to predict the target language sequence.

gate mechanism

GLU was used as the gate mechanism. The activation mode of GLU unit is shown in the following formula:

Gradient clipping and fine weight initialization are carried out to accelerate model training and convergence

We cannot judge whether the SEQ2SEQ model based on CNN or the LSTM model is better. The biggest advantage of ADOPTING CNN’S Seq2Seq is its high speed and efficiency, while the disadvantage is that too many parameters need to be adjusted. When CNN and RNN are used for NLP problems, CNN is also feasible, and the network structure is more flexible and efficient. Since THE training of RNN often requires the state of the previous moment, it is difficult to run in parallel. Especially in large data sets, CNN-SEQ2SEQ can often achieve better results than RNN-SEq2SEQ.

Application field

Machine translation

Speech recognition input is speech signal sequence, output is text sequence.
The text summary input is a sequence of text, and the output is a sequence of summaries of that sequence of text. Generally, text summarization methods can be divided into two categories: extractive abstract and Abstractive abstract. The former is to find the most informative sentences from one document or multiple documents by sorting them and then combine them into summaries. The latter is similar to human editors, who understand the text and summarize it in concise words. In the application, extractive abstract method is more practical, is also widely used, but there are certain problems in coherence, consistency, need to carry out some post-processing; Abstract method can solve these problems well, but it is very difficult to study.
Dialogue generatedAfter the Seq2Seq model was proposed, a lot of work has been done to apply it to Chatbot task, hoping to train the model with massive data and make an agent that can answer any open question. Another group of researchers has made progress in a very vertical area, such as buying movie tickets, by combining Seq2Seq with the current knowledge base to make task-specific Chatbots.

Having a machine write poetry for you is not a distant dream. One of the most interesting applications of Seq2Seq model is poetry generation, where the last sentence of a poem is given to generate the next one.
Generate code completion

Figure 8: Code completion diagram
Pre-training In 2015, Google proposed Seq2Seq autoencoder as a pre-training step for LSTM text classification, thus improving classification stability. This makes the purpose of Seq2Seq technology no longer limited to obtain the sequence itself, and opens a new page for its application field.
Reading comprehension

Encode the input text and the question separately, and then decode it to get the answer to the question.

Seq-to-seq model was proposed in the field of machine translation at the very beginning, and later was widely used in various fields of NLP. The reason lies in its perfect use of sequence data, and it solved the problem of fixed output dimension of RNN model before, so it was quickly promoted. But SEq-to-SEQ isn’t a panacea, and it works best in the right context.

Source address: https://github.com/google/seq2seq
The Convolutional Sequence to Sequence Learning “: https://link.zhihu.com/?target=https%3A//arxiv.org/abs/1705.03122
The Language modeling with gated linear units “: https://link.zhihu.com/?target=https%3A//arxiv.org/abs/1612.08083
A Convolutional Encoder Model for Neural Machine Translation https://link.zhihu.com/?target=https%3A//arxiv.org/abs/1611.02344
Google Neural Machine Translation： https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
Jane: Datartisan https://www.jianshu.com/p/124b777e0c55
Zhihu author: li ning, https://zhuanlan.zhihu.com/p/30516984
The source of zhihu author: what https://zhuanlan.zhihu.com/p/28054589
PaperWeekly: handsome, https://zhuanlan.zhihu.com/p/26753131

end

Machine learning algorithm full stack engineer

A heart of the public account

Get in, learn, get help

Your attention, our heat,

We will give you the best help in your study

Seq2Seq model for machine translation

Related Posts

Face recognition Based on MATLAB GUI LBP face recognition

Python crawler – Northbound fund holdings data for stocks

Naive Bayes model and Python implementation