Key words: NLG, bottom-up, text-summarization


1. Background and Problem description

A text summary is a content-related text summary that needs to be generated from long text. Generative text summarization methods based on neural networks can produce very smooth results, but they are not good at content selection. This paper introduces a simple content selector, which first determines which parts of the document are valid, and then generates a summary only on those parts of the sentence. Experiments show that this method can improve the quality of sentence compression and produce smooth summaries. Moreover, this two-step approach is simpler and more efficient than the end-to-end model. In addition, content selector training requires very few sentences to achieve good results, so it is easy to migrate to other models.

2. Existing solutions

At present, the model with better performance of generative text summary is the end-to-end model which uses Pointer-Generator Models.

3. Solution Overview

The bottom-up approach proposed by the authors is to split the general end-to-end model into two parts. The first step is to select the parts that may be relevant from the long document, and then perform the general summary model in the selected parts. The idea of the author is derived from CV. When doing object recognition, we should first determine the scope of the frame on the image, and then only focus on the frame.

The implementation of content selector is regarded as a sequence-tagging problem, and the author can achieve a model with more than 60% recall and 50% precision only by using Elmo word vector.

The results of the first step are introduced into the later summary model, and masking is only required to limit the copying of words from the original text.

1.Bottom-Up Attention

First, a general definition of text digest is defined: for text pairs (X,Y)(X,Y)(X,Y), where X ∈Xx\in{X} X ∈X denotes source sequence x1,… ,xnx_1,… ,x_nx1,… ,xn, y∈Yy\in{y}y∈ y denotes the generated summary sequence Y1… ,ymy_1,… ,y_my1,… , YM, where M <

The author treats content selection as a sequence annotation problem, so the first step is to construct annotation data. The author’s approach is that text summary datasets are usually document-summary pairs, so the author constructs supervisory data by combining the summary with the document pair. In detail, the token xix_iXI in the document is selected when:

  1. It as much as possible in a subsequence s = xi – j: I: I + ks = x_ {} I – j: I: I + k s = xi – j: I: I + k, at the same time s ∈ xs \ {x} in s ∈ x and s ∈ ys \ {} y s ∈ in y.
  2. There is no preceding sequence equivalent to SSS.

After constructing the training data, it is necessary to train a conventional sequence annotation model. The author input ELMo to train a two-layer LSTM model, and then calculate the probability of each position being selected.

2.Bottom-Up Copy Attention

The author finds that encoder works better with native code. Therefore, during the training phase, train the Pointer-Generator Model and content selector respectively. In the inference stage, the selection probability of all tokens in the source q1:nq_{1:n} Q1 :n is calculated first, and then it is used to influence the copy probability in the copy model, so that the tokens that are not selected will not be copied. Let ajia^i_jaji represent the copy probability of iword in source in step J, and the adjusted probability is, where ϵ\epsilonϵ is a threshold value between 0.1 and 0.2:

Notice here that since the adjusted probability distribution is no longer correct, it needs to be normalized again.

3. The End – to – End solution

The “two-step” approach is simple and effective, but the author tries to train in a model, assuming that the standard copy model can be trained in conjunction with the content selection period. The authors set three models:

  1. Only the mask.
  2. Multitask. A shared encoder is used to train both the sequence annotation task and the text summarization task. But it is still a “two-step” strategy in the forecasting phase.
  3. Cross training. During training, the copy probability ajia^i_jaji is directly multiplied by the selection probability QIq_iQi.

4. The reasoning stage

There are two main problems with the current generation network of long documents: 1. 2. Repeat words. The author introduces two penalties into the score function: Length penalty LP (Lengthpenalty) LP (lengthpenalty) and coveragepenalty CP (Covera penalty) CP (coverage) Penalty) cp (coveragepenalty). S (x, y) = log ⁡ p (y) ∣ x/lp (x) + cp (x; y)s(x,y) = \log{p(y|x)}/lp(x) + cp(x; Y) s (x, y) = logp (y ∣ x)/lp (x) + cp (x; y)

Among them, the length penalty
l p ( l e n g t h p e n a l t y ) lp (length penalty)
This is to encourage longer sequences that need to be taken into account in the Beam search phase.

Cp (Coveragepenalty) CP (Coverage Penalty) is used to prevent excessive duplication. The author introduces a new method:

4. Result analysis

Experiments on CNN-DM, shown above:

  1. The authors design end-to-end schemes and none of them work.
  2. The “two-step” bottom-up method proposed in this paper significantly improves the results.
  3. The model using cross entropy training is better than that using reinforcement learning training.

The above image shows the training results of the content selector. Experimental results show that only a small number of sentences, more than 1000, can achieve good results.

The above figure is based on the original Point-generator model and tested the effects of three punishment strategies in the Inference stage. It can be found that all three punishments are very effective and three indicators are improved at the same time. It also shows that the original Point-generator model can solve the text summarization problem well, and it would be better to add some processing in the prediction phase!

5. Innovation or contribution

  1. This paper presents a simple but effective methodContent selection modelTo deal with the text summary problem.
  2. The authors found that the two-step bottom-up approach was more effective.
  3. The method proposed by the authors is very data-dependent and can be migrated to other data sets easily.
  4. The authors introduce two punishment strategies that have proved very effective.

6. Personal thinking

The basic model of this paper is pointer-generator Network, which is a very practical improvement. There are three main points to consider:

  1. Use a bottom-up mentality and a “two go” strategy
  2. Adding length penalty in beam-search phase can generate longer and richer result sequence.
  3. The new coverage penalty deals with duplication.