Pointer Network is a Seq2Seq model, but its Decoder predicts the result from Encoder’s input sequence. Pointer Network obtains output results from input sequences, so it is more suitable for generating text summaries and can better avoid OOV (Out of Vocabulary) problems. This paper mainly introduces two text summarizer algorithms using Pointer Network: Pointer-generator Networks and Multi-Source Pointer Network.

1. The introduction

I have introduced Pointer Network in the previous article. Pointer Network is a Seq2Seq model, but its Decoder prediction results are obtained from the input sequence of Encoder. Pointer network changes the usage of Attention in the traditional Seq2Seq model, using the Encoder input sequence with the highest Attention score as the current output. The following figure shows the difference between traditional Seq2Seq Attention and Pointer Network usage. To put it simply, traditional Seq2Seq computes the probability distribution of the whole dictionary while Pointer Network computes the probability distribution of each word in the input sequence.

Abstract is an important field in NLP. Common abstract methods can be divided into abstract and generative abstract. Abstract summarization mainly extracts ready-made sentences from source documents as summary sentences, which are generally better than generative summarization in terms of sentence fluency, but easy to introduce more redundant information. Abstract generation is mainly based on the content of the source document and generated by the algorithm model, rather than extracting the sentence of the original text.

Pointer Network is suitable for text summarization because it can copy tokens of input sequences as outputs. In addition, Pointer Network can alleviate OOV problems to some extent. For example, the word “husky” does not appear in the training set, but “husky” appears in the prediction. Seq2Seq will replace “husky” with “UNK” when generating the summary. But Pointer Network can copy “Husky” directly from the input sequence as output.

This paper mainly introduces two text summarizer algorithms using Pointer Network: Pointer-generator Networks and Multi-Source Pointer Network. Pointer-generator Networks are a summarization algorithm that utilizes both extraction and generation, while multi-source Pointer Network is mainly extraction.

2.Pointer-Generator Networks

Pointer-generator Networks is from The paper Get To The Point: Summarization with Pointer-generator Networks. The main contents of pointer-Generator Networks include the following two points:

  • The traditional Seq2Seq model can calculate the probability distribution of all the tokens in the whole dictionary when output. On this basis, Pointer-Generator Networks integrate the token probability distribution of input sequence calculated by Pointer network. This alleviates OOV problems and gives the model both the ability to generate new words and copy original sequences.
  • In order to solve the problem that Seq2Seq is prone to repeat the same words when generating abstracts, Pointer-generator Networks adds the coverage mechanism, which can record the generated content and avoid repeated generation.

2.1 model

Pointer-generator Networks add Pointer Networks on the basis of Seq2Seq. Firstly, take a look at the model structure diagram of Seq2Seq, as shown below.

In Seq2Seq, the Decoder output and Encoder output at each moment are used to calculate the Attention score, and according to the Attention score, the Encoder output is fused to get the context vector. Pass the output of the Context Vector and Decoder into Softmax to get the probability distribution of the tokens in the dictionary. The calculation formula is as follows:

Pointer-generator Networks add the mechanism of Pointer network to Seq2Seq. The model is shown in the following figure.

It can be seen that the probability distribution P_vocab (green part in the figure) predicted by Seq2Seq of Poor-Generator Networks is added with the Attention distribution (blue part in the figure) to obtain the final probability distribution. P_vocab didn’t have the word “2-0”, but the Attention distribution included “2-0”, so the final distribution added “2-0” to alleviate the OOV problem. Pointer-generator Networks add two distributions together with a parameter p_gen, which p_gen is learned from.

2.2 coverage mechanism

In order to avoid repeated sequences generated by the model and omitted translation, pointer-generator Networks added coverage mechanism. A coverage vector CT was used to record all Attention scores before time t. Therefore, the value of each token in the coverage vector could be used to determine whether the token was used, and the one with a higher value was more likely to be used.

Then add the coverage vector when calculating Attention.

Finally, add coverage Loss to Loss. It can be seen that for the ith word, if the attention value and coverage value are both large, it indicates that word I has been generated with a high probability before, and the coverage Loss will be large at this time.

2.3 Experimental Results

The above is the comparison of the three models, the blue is the reference abstract, the red is the error abstract, it can be seen that the Seq2Seq model of the abstract contains many errors and UNK words. The green ones are repeated summaries. Pointer-generator Networks generate repeated summaries although they have fewer errors and unknown words. However, pointer-generator Networks with coverage mechanism can better avoid repeated summaries.

3.Multi-Source Pointer Network

Multi-source Pointer Network (hereinafter referred to as MS-Pointer) It was proposed by the Alibaba team to generate product titles.

Ms-pointer requires two data sources, one of which is the original title of the item and the other is some additional information. The additional information is called “knowledge” and mainly contains the product name and brand name, similar to some product labels.

Since the title of a product should not introduce irrelevant information and keep the original important information, MS-Pointer adopts an extraction method, and all the tokens in the abstract come from the title or “knowledge”.

Ms-pointer is similar to pointer-generator Networks. The difference is that MS-Pointer uses a title Encoder to obtain the Attention distribution of title token. Use a “knowledge” Encoder to get the Attention distribution of “knowledge”, and merge the two Attention distributions together. The schematic diagram of the model is as follows:

As shown in the figure above, mS-pointer fuses two Attention distributions. The fusing parameters are calculated using the following formula:

4. References

  • Get To The Point: Summarization with Pointer-Generator Networks
  • Multi-Source Pointer Network for Product Title Summarization