TensorFlow series (11) : Application and Attention Model of RNN

Directory:

● Application of circulating neural network

● Text classification

● Sequence labeling

● Machine translation

Low Attention – -based model

Low RNN
Series of summary

● Application of circulating neural network

At present, recurrent neural networks have been applied in many fields, such as speech recognition (ASR), speech synthesis (TTS), chatbot, machine translation, etc. In recent years, recurrent neural networks have also been used in the research of word segmentation and partof speech tagging in natural language processing. In this section, we will introduce several typical applications of recurrent neural networks, so as to understand how recurrent neural networks are combined with our actual application scenarios.

According to different application scenarios and requirements, we can roughly divide the tasks of recurrent neural networks into two categories: one is the sequence to category mode, the other is the sequence to sequence mode. Sequence-to-sequence problems can be further divided into “synchronous sequence-to-sequence patterns” and “asynchronous sequence-to-sequence patterns”. Next, we will use three cases to further understand these three modes.

Text classification

At present, text classification is one of the most common problems in the field of Natural LanguageProcessing (NLP), such as spam detection, emotional polarity analysis of user comments, etc. The sequential to category pattern is suitable for text classification problems, where we input a piece of text with length N into the recurrent neural network, and the output of the neural network has only one category with length 1.

Suppose we want to achieve the emotional polarity classification of user comments in the takeout industry, as shown in Figure 1, we input into the neural network a section of user comments on products for sale.

FIG. 1 Schematic diagram of recurrent neural network for text classification

The recurrent neural network has one output for each “time step”, but for a simple classification problem, we do not need so many outputs. A common and simple processing method is to keep only the output of the last “time step”, as shown in Figure 2:

FIG. 2 schematic diagram of “sequence to category pattern” recurrent neural network

Sequence annotation

Word segmentation is the most basic and important part of natural language processing. With the development of deep learning, many people begin to try to apply deep learning to this field, and some achievements have been achieved in the past two years. Although CRF, HMM and other traditional algorithms are still commonly used in word segmentation, part-of-speech tagging and other tasks, the achievements of deep learning have been recognized by more and more people, and it continues to stand out in natural language processing tasks.

Whether we use the traditional CRF algorithm or the recurrent neural network to train the word segmentation model, we need to annotate the training data first. Taking the 4-tag method as an example, suppose we have a training sample “Beijing is the capital of China”, and the tagged data form is as follows:

In the 4-tag method, there are four tags: B, M, E and S. Where B means that the word is the first word of a word, M means that the word is the middle part of a word (if a word is composed of multiple words, all the middle words are marked with M except the beginning and end), E means that the word is the last word of a word, and S means that it is a single word and does not constitute a word. In the case of sequential labeling such as word segmentation, each “time step” corresponds to an input and output. For this problem, we use a synchronous sequence-to-sequence pattern, as shown in Figure 3:

Figure 3. Schematic diagram of “synchronous sequence to sequence pattern” recurrent neural network

Machine translation

For machine translation cycle of neural network is a kind of “asynchronous sequence to sequence mode” network structure, the same sequence to sequence pattern, and is applicable to sequence annotation of “synchronization sequences to sequence mode” of the difference is that the cycle of “asynchronous sequence to sequence mode” neural network for there is no limit to the length of the input and output of the sequence. In the sequence labeling problem, each “time step” has an input and a corresponding output, so the sequence length of the input and output is the same. However, in the machine translation problem, the sequence length of the input and the sequence length of the output are not necessarily the same.

The “asynchronous Sequenceto Sequence pattern” cyclic neural network is often referred to as the Sequenceto Sequence model, also known as encoder-decoder model. It is called the encoder-decoder model because we divide the network into two parts: the encoder part and the decoder part. As shown in FIG. 4, the encoder model encodes the input sequence data to obtain the intermediate vector:

FIG. 4 Schematic diagram of the encoder

The simplest way to code is to directly assign the state of the network at the last moment to, or you can use a function to do the transformation. The function can take arguments from, or all the intermediate states from, to. Now that you have the intermediate vector, all you have to do is decode it. A commonly used decoding method is shown in Figure 5 (left). In the process of decoding, the model takes the encoded vector as the initial state of the decoder, and takes the output of each time step as the input of the next time step until the decoding is completed. “EOS” is the end of the input and output sequence. Shown on the right side of Figure 5 is an alternative decoding approach that takes the encoded vector as input to each “time step” of the decoder model.

For a more specific Sequence to Sequence model, you can read Bengio et al. ‘s paper published in 2014 [1] and Google’s paper in 2014 [2].

Figure 5. Schematic diagram of two different decoder models

Low Attention – -based model

Although the encoder-decoder model has achieved very good results in many applications such as machine translation, speech recognition and text summarization, it also has some shortcomings. The encoder encodes the input sequence into a fixed-length vector, which is decoded by the decoder to obtain the output sequence. The representation ability of the fixed-length vector is limited, but the decoder is limited by the fixed-length vector. Therefore, when the input sequence is long, it is difficult for the encoder to encode all the important information into the constant length vector, which greatly reduces the effect of the model.

In order to solve this problem, Attention mechanism is introduced. This neural network model with Attention mechanism is also called attention-based model. In this section, we will introduce the Soft Attention Model, which is the most common and widely used Attention Model. In order to solve the problem that a single fixed-length coding vector in the traditional Encoder-decoder model cannot retain all useful information in a long input sequence, the attention-based model introduces multiple coding vectors, and each output corresponds to one coding vector in the Decoder, as shown in FIG. 6.

FIG. 8 Schematic diagram of Attention calculation process

We take the calculation of the first coding vector as an example. First, the initial state of the decoder and the output of each time step in the encoder are respectively used to calculate the similarity and get the output, and then a Softmax operation is used to convert it into the probability value. Finally, the coding vector is calculated by the formula. The output of the neural network in the decoder is then used to calculate the coding vector, and so on until the decoding process is completed.

The above is the traditional Soft Attention Model. In addition, there are some other attention-based models, including those for natural language processing and images. In a paper Attention is All You Need published by Google in 2017 [3], Google tried to get rid of CNN and RNN, and tried to use pure Attention to realize the task of Encoder-Decoder model, and achieved very good results.

Low RNN
Series of summary

This concludes the chapter. In this chapter, we start from the most basic simple structure of recurrent neural network, introduced the calculation process of recurrent neural network and how to use TensorFlow to achieve, and introduced several commonly used recurrent neural network structure; In the fourth section, we introduce the problem of the recurrent neural network — the long-term dependence problem, and the corresponding solutions; After that, we introduce two kinds of gating control based recurrent neural networks, which are widely used in recurrent neural networks at present. These two kinds of network structures solve the problem of gradient disappearance and gradient explosion to a certain extent by adding linear dependence between the two network states. In section 6, we introduce some applications of recurrent neural networks, and through which we introduce the different network structures when applied to different tasks; Finally, we introduce an improvement of the traditional Encoder-Decoder model: attention-based model. Those wishing to learn more about the applications of recurrent neural networks are recommended to refer to the resources organized in the GitHub project of this book.

In the next chapter, we will implement several complete projects using recurrent neural networks, and deepen our understanding of recurrent neural networks while learning to build a recurrent neural network model using TensorFlow.

The original article was published on November 26, 2018

This article is from Panchuang AI, a cloud community partner. For more information, please pay attention to Panchuang AI.

TensorFlow series (11) : Application and Attention Model of RNN

Related Posts

The difference between sequence diagrams and flow diagrams

A dynamic bar graph that’s going to explode on the web. It’s just 5 lines of Python code!

Why is content an essential part of the Web design process?