Personal public account: Alitou Feeding House

Fine-tune BERT for Extractive Summarization: BERT and text summaries

Paper Links:Arxiv.org/pdf/1903.10…

Source:Github.com/nlpyang/Ber…

Introduction to BERT’s papers series

What does BERT Learn

TinyBert: Ultra-detailed application of model distillation, he is enough to ask questions about distillation

Quantization technique and Albert dynamic quantization

DistillBert: Bert is too expensive? I’m cheap and easy to use

[share] paper | RoBERTa: hello XLNet in, was beaten

XLNet paper introduction – Beyond Bert’s afterwave

takeaway

Abstract text is mainly divided into abstracted text abstract and generated text abstract. Abstract text abstract has a long development time, so it is widely used in the industry. Textrank is the most commonly used algorithm for extracting text summarization. However, this paper will introduce a Model of extracting text summarization related to BERT. Of course, as a comparison, this paper also introduces another paper, combining Textrank and BERT’s model, hoping to inspire you.

Bert With Summarization

First, the structure of the model is introduced. The output of the original BERT is for tokens rather than sentences, and the input of the original BERT is only two sentences, which is not suitable for text summaries.

Therefore, first of all, the author made some changes to BERT’s structure to make him more suitable for the task of text summarization. The author’s changes can be reflected in the following figure:

  1. The author uses [CLS] and [SEP] to distinguish each sentence. In the original BERT, [CLS] represents the content of a whole sentence or pair of sentences. Here the author modifies the model structure and uses [CLS] to distinguish each sentence
  2. The author adds segment embedding to each sentence, which is determined by the even-odd sequence of the sentence. For example, for sentences [sen1, sen2, sen3, sen4, sen5] their segment embedding is [EA, EB, EA, EB, EA].

Summarization layer

Once you have the sentence vector, the next thing to do is decide whether the sentence should form a summary of the text. This is a dichotomy. The author tries three summarization layers, respectively

  1. Traditional fully connected layer

  2. Inter-sentence Transformer

    The structure is shown in the figure below. The sentence vector of the initial position is position embedding, and the input of each position is the output result after the input of the previous position passes through multi-attention layer, layer norm and full connection layer. The final output is still a dichotomy.

  1. RNN layer

    Here, LSTM layer is connected after BERT, LSTM is a structure very suitable for NLP task, of course, the final output is also a dichotomous result.

The experimental results

The author conducted experiments on two public data sets, CNN Daily and NYT. The experimental results are shown in the figure below

  • The Lead is abstracted from the first three sentences of the text
  • REFRESH is an extractive text summarization system optimized for the ROUGE matrix
  • NEUSUM is a state-of-art effect of abstracted text summarization
  • PGN is a Pointer Generator that generates a text digest
  • DCA is the state-of-art effect of the current generated text summary

Conclusion: Abstract text summary is better than generative text summary (even PGN is worse than rule?). I doubt it. The BERT+Transformer effect exceeds the SOTA effect of the current extractive model.

Disadvantages:

  1. RNN is one-tier and does not compare well with multi-tier Transformer
  2. Experimental results show that the effectiveness of the generative model is worse than that of the regular model, which remains doubtful
  3. No explanation of what to do with long text

Recommended data

Sentence Centrality Revisited for Unsupervised Summarization. In this paper, Bert and Textrank algorithms are combined, and fine-tuning Bert is used as Sentence encoder to calculate the similarity between texts. The final effect also exceeds the effect of SOAT.

  • Links to papers: arxiv.org/pdf/1906.03…
  • Source: github.com/mswellhao/P…

Here are questions and thoughts

  1. What is the difference between the use of BERT in the two articles
  2. How did PACSUM tweak BERT