Keywords: pre-training model, NLG


1. Background and problem description

This paper presents a pre-training language model specifically for NLU and NLG. The training of this model uses three language model tasks. This model improved particularly for NLG tasks, such as the CNN/DailyMail dataset, where rouge-L reached 40.51, a two-point improvement over the previous SOTA.

2. Existing solutions

The advent of pre-trained language models has greatly raised the benchmark for various NLP tasks, such as ELMo, which is trained through two one-way language models, one from left to right and one from right to left. GPT, which uses one-way Transformer with a large number of high-quality data sets, has also achieved great results. BERT implemented a bi-directional language model by randomly masking tokens and then predicting them using left and right context, but this bi-directional strategy was inherently inappropriate for NLG tasks.

3. Solution Overview

The UNILM backbone is the same as BERT, using multi-layer Transformer, and the language model is implemented in the same way as BERT, training the model by predicting the token of the mask. It’s just that UNILM has introduced the idea of multiple tasks to learn better models.

  1. Input Representation

The input to the model is a sequence, which can be a segment or a pair of segments, depending on the LM task. Add the [SOS] composition sequence beginning and [EOS] as the end of each sequence. [EOS] is not only the end of the sequence, but also the end of decode NLG mission. The input of the model is composed of three embedding locations: token embedding obtained by WordPiece, position embedding, and segment embedding. Since UNILM uses multiple LMS, the Segment embedding can also be used to distinguish between different LM types.

  1. Trunk: Multi-layer Transformer

Here, the mask matrix is used to see the direction of the control language model. For example, the left-to-right LM only uses the position in front of it and its own vector to calculate the probability when predicting a token in a certain position. In this way, all the values above the diagonal of the Mask matrix can be set to negative infinity, which can be calculated by Softmax for attention
A l A_l
Phi is zero, which is the vector that doesn’t care about the position behind it.

  1. Pre-training objective

Pre-training at UNILM uses four reading comprehension tasks, namely replacing tokens with [MASK] randomly, then using vectors around [MASK] calculated by Transformer to predict tokens at this location and Softmax to calculate the probability of distribution of words in the dictionary, Finally, minimize the cross-entropy loss. The four language models used by UNILM are as follows:

  • Unidirectional LM, a Unidirectional language model, implemented using a Mask matrix.
  • Bidirectional LM, like BERT, is implemented with the mask matrix set to 0.
  • Sequence-to-sequence LM, in this task, requires a pair of segments, one as source and one as target, again randomly replacing tokens with [MASK] on both segments. The difference is that [MASK] in sorce is predicted by two-way context, while [MASK] in target can only be predicted by the context of the preceding position and itself, including source. The mask matrix is set as shown in the figure above. The source is set to 0, and the upper part of s2− S2s_2-s_2S2 − S2 is set to negative infinity in the position corresponding to target below. In this training mode, because source and target are regarded as a continuous sequence, this design can make the model learn some relational information of the two segments implicitly. In order to better predict the tokens in taget, the model will learn how to effectively use the information in the source. Therefore, this sequence-to-sequence LM can pre-train bidirectional encoder and unidirectional encoder at the same time, which is also the reason why this model is suitable for condition text generation.
  • Next Sentence Prediction, same as BERT.
  1. Pre-training Setup

The UNILM model uses BERTlargeBERT_{large}BERTlarge as the basic training, in a training batch, using LM+NSP mode training, where LM training in turn using the above three modes. In detail, 1/3 of the time you use Bidirectional LM, 1/3 of the time you use equence-to-sequence LM, 1/6 of the time you use left-to-right Unidirectional LM, and 1/6 of the time you use right-to-left Unidirectional LM. The Mask strategy is almost the same as BERT, except that 80% Mask one token randomly, 20% Mask two tokens or three tokens (bigram,trigram).

  1. Fine-tuning

To fine-tune the NLG task, construct a sequence of the form *”[SOS] source [EOS] target [EOS]”* for the input, mask only the tokens in the target at random, and then let the model predict the tokens at the masked position. Since the [EOS] mark can also be masked off, the end mark is encountered in the Encode stage.

4. Result analysis

  1. Significantly improved performance in NLG tasks

Figure 1 above is the text summary task; Figure 2 and Figure 3 are Extractive QA questions that require the correct answer from the article; Figure 4 is Question Generation, which is a typical sequence-to-sequence problem. The source of the input sequence is message and answer, and the target is the generated problem. The results of SOTA are also taken into account in this model.

5. Innovation or contribution

  1. The UNILM model in this paper uses the same multi-layer Transformer language model, with multiple LM tasks sharing parameters and better training results.
  2. The design of “shared parameters” allows the model to learn more general text representations, because the UNILM models are cross-trained using different language model targets to learn contextual knowledge in multiple ways, which can effectively avoid the over-fitting problem in a single LM.
  3. Since UNILM uses the SEQ_2_SEQ language model, it is a natural fit for NLG tasks.

6. Personal thinking

  1. This 2019 paper is at the stage of the pre-training language model fire. The model in this paper is trained by introducing three language models and easily implemented by the mask mask matrix. In particular, the combination of unidirectional language model + SEQ_2_SEQ language model makes the model very suitable for NLG-related tasks.