Editor’s note: Automatic text summarization is an important part of natural language processing research. How to obtain high quality abstract text is a great concern of many researchers. In this paper, Zhou Qingyu, PhD student of Harbin Institute of Technology and Microsoft Research Asia, tells us how he improved generative sentence summary from encoder and decoder. Let’s learn!

Share video reviews online


(The following is a transcript shared by Zhou Qingyu)

Our topic of this time is to improve the generative sentence summary from the encoder side and the decoder side. I will mainly explain to you from the following three parts:

I. Introduction to background knowledge, sentence summaries and some related work.

How to improve generative sentence summary from encoder side is introduced.

This paper introduces how to improve generative sentence summary from decoder.

background

Automatic text summarization is when we give a piece of text, extract key points from it, and then form a short summary text.

Automatic text summarization can be divided into different categories. According to the type of input text, the shortest input is the sentence summary, the input is slightly longer is the single document level summary, above the single document level summary there are multiple documents.

In addition to the classification of input documents, automatic text summarization can also be divided into extraction summarization and generation summarization according to the way in which the summarization is generated. Abstract, as the name implies, is to extract words or sentences from the original text to form an abstract. The generative summary is similar to the way in which we first understand the content of the text and then write a paragraph to summarize the given text. For example, there are query focused abstracts. Given a query, the system provides a summary of the document for that query. For different queries, the system can provide different summaries.

Next, I’ll briefly introduce the extractive and generative summaries, respectively.

Abstract is mainly a task of sequence annotation, which is equivalent to classifying every word in a sentence by 0 and 1. For example, mark 1 is to choose the word down, so that you can use the method of extraction from the sentence to extract some key words, and then these key words as the summary of the sentence. Sentence summarization used to be called sentence compression for document summarization.

So how do you use it? When extracting a summary of a document, we might extract a long sentence and then use sentence compression to “shorten” the sentence. What are the benefits of “cutting short” sentences? In past DUC tasks, the document had a limit on the length of the output summary, and if a particularly long sentence was selected and reached its limit, no more sentences could be selected. By using sentence compression, you leave more room to choose more sentences later.

Generative summaries are basically based on sequence-to-sequence models. The sequence-to-sequence encoder with attention model is a bidirectional GRU, or bidirectional LSTM, which can encode input sentences. When decoding, it also uses a GRU or LSTM as a decoder. This attentional mechanism is actually a matching algorithm that matches the current decoder state with the hidden state in the input sentence.

With the development of deep learning, the research on generative abstracts has received more attention in recent years. Today, we also focus on generative methods.

How to improve generative sentence summary from encoder side

Firstly, we introduce how to improve generative sentence summary from encoder side. This is one of our work on ACL in 2017. All the work I introduced today was completed during my internship in Microsoft Research Asia.

As mentioned above, generative sentence summary is a sequence-to-sequence model. Many previous works have adopted the attention mechanism, which actually provides a word alignment information, that is, there is an alignment table between the words on the output end and the words on the input end, for example, “we” is aligned to “we”. However, we believe that in the process of summarization, except for the original copy words extracted, the remaining words do not have such a word alignment relationship, because there is no alignment for these words

In addition, none of the previous models actually modeled the summary task. We can think about why do we do abstracts? The purpose of the summary is to extract the more important parts of the input. Based on this, we propose a model that determines which word in the input is important.

 

We put forward the model is called “selective coding model”, it to modeling of input every word in this sentence, the modeling method is based on an already encoding good sentences, we use the information of the sentence to outlining the word is important, thus to build a new input sentences middle term said. We call this selection process selective coding, and this model is called selective coding model.

First, let’s look at the framework of this model, which is also a sequence-to-sequence model. The difference is that we propose a selective gate network, which selects which parts of the input sentence are important. For example, our model sample diagram has six words drawn. After reading the sentence, we can sense which words are important in the sentence, and then the network decides which words are important in the sentence, so we construct a new representation for each of these six words. Next we use a decoder, but not based on its original representation, but based on our newly constructed layer representation.

The first layer of the underlying encoder in this model actually uses a GRU as our RNN as in the previous sequence-to-sequence model. This decoder is similar to the decoder in machine translation, except that the vector representations it uses to decode are not directly given by our decoder, but by selective encoding. We read a sentence, know its meaning, and then construct a representation of the meaning of the sentence. Here we take the simplest method — put together the forward last hidden and the reverse last hidden of the two-way GRU, namely h1 and hn, and put this together as the representation of the meaning of the whole sentence. With the meaning representation of the sentence, we can choose the input word according to the meaning representation of the sentence, and thus determine which word in the sentence is more important.

How to improve generative sentence summary from decoder side

So how can we improve from the decoder side? This is an article we published in this year’s AAAI called Sequential Copying Networks, which improves on what’s now CopyNet. If you’re interested, you can download poster from this website:

https://res.qyzhou.me/AAAI2018_poster.pdf

When making a sentence summary, people will read the sentence first, then choose the key words, and then write it down. How do people write when they write? You can actually use a marker to copy the important parts directly. We gave an example from the dataset Gigaword, and you can see that it copied a lot of parts, such as “security mechanism” and “overseas embassy”.

We assume that there will be a lot of copying when generating sentence summaries. Then I want to make statistics on the data set to see if this prediction is correct, so I make statistics on the percentage of generated words and copied words on the training set of Gigaword. You can see that 42.5% of the words were generated, and the rest were copied. We also counted the copies of one word, two words and three words respectively. As you can see, copies of more than two words account for about a third. This actually confirms our view that sequence copying exists in many tasks.

Based on this observation, we propose a sequence copy model. The sequential copy model does a relatively simple job of incorporating serialized copies from sequence to sequence model.

 

This is the overall overview of our model. It can be seen that it also consists of an encoder and a decoder, and in the decoder stage there will be a copy operation. In the old copy model, because every copy had a decision to make — to copy or not to copy — to copy three words, the machine had to make three decisions, and maybe it made a mistake in one of the middle decisions, or maybe it didn’t copy this time, it chose to produce, or maybe it made the wrong copy. In this case, three consecutive words may not be copied. And what we do is if the machine decides to make a copy, it just makes a copy. For example, if you have three words, just copy them in their entirety so you don’t have to make three decisions.

A more detailed introduction, our model is mainly to make a selection of a segment, with three modules to make the prediction of this segment.

 

First, we made a gate, the gate is used to control whether to copy, it will produce a probability value between 0 and 1 to decide whether to copy or not to copy.

Second, we make a pointer network that selects the start and end positions of the phrase we want to copy.

Third, we design a copy state. Its main function is to help pointer network select a fragment.

First we define the state of a decoder. This state contains a lot of information, such as the words of the previous moment, the context of the current moment, and so on. Using this state, we can predict the probability of copying. This G is an MLP, and we use an MLP to map this MT to a real number, followed by a sigmoid function to predict the probability of the current copy. The natural probability of making it is 1 minus the probability of copying it.

With probability, we can make predictions. If the model is currently in copy mode, we need to predict which fragments are copied out. The first step is to select the starting position of the copy fragment, which we do in the following way. First build a starting position query vector, called QS. Qs is to pass the current state defined by the decoder through an MLP, and we obtain such a starting position of the query vector QS.

With QS, we use it directly as input to a Pointer Network. “COPYs” is the position with the highest probability. The system does a bit of Attention on the input with q and S. The maximum position we assume is 12, so it selects 12 as the starting position of the COPY.

 

The next step is to predict where the clip will end. We built a query vector for the end position called QE. The QE construct method uses the copy state mentioned earlier. What it does is it essentially transfers q s to Q E. Why would I want to change q and S to QE? Because of its location at the end of the starting position and prediction, the end position must be take into account the initial position of some information, and information about current decoder, it can choose according to the current state of the decoder and the end of the starting position to choose a right position, and then put the fragments of the correct choice.

We use a GRU here to implement copy state. The context vector CS, which predicts the starting position, is taken as input to the GRU. The initial state of the GRU is initialized with an MLP. The input of this initialization is the state MT of the decoder. In both ways, we get the query vector QE for the end position. After the query vector of the end position is obtained, the end position can be obtained by using the same method as the prediction of the start position.

To summarize, we introduce a sequence copy mechanism based on the sequence to sequence model. Compared with the previous CopyNet, the SeqCopyNet we proposed can copy multiple words. This process does not require multi-step copying, but one-step copying, which reduces errors introduced by multiple decisions during copying. In addition, we also found some interesting phenomenon, that is, our SeqCopyNet model also has a good performance in detecting boundaries.

Do you have any different ideas or questions about generative sentence summaries? Feel free to interact with us in the comments section!

You may also want to see:

Low dry | thesis unscramble: dialogue generated research based on dynamic glossary

Low dry | NIPS 2017: for sequence generated by the weigh and network

Low dry | NIPS 2017 online sharing: use value network to improve machine translation

Thank you for your attention to “Microsoft Research AI headlines”, we look forward to your comments and contributions, to build a communication platform. Please send your contributions to [email protected].