This is the fifth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

It all starts with Transformer. Transformer is Encoder on the left side and Decoder on the right side. We call Encoder input sentences source, Decoder input sentences target

Encoder is responsible for self-attention of source and obtaining representation of each word in the sentence. The most classic Encoder architecture is BERT, and it learns the relationship between words by bivariate Language Model. There’s XLNet, RoBERTa, ALBERT, DistilBERT, and more. However, the separate Encoder structure is not suitable for build tasks

Regressive: In order to enable the model to see future words in the equation, the Decoder models were used for sequence generation, such as GPT, CTRL, etc. But the Decoder structure alone predicts words only based on the left context and cannot learn bidirectional interaction

Together, the two can be used as a Seq2Seq model for translation tasks. The following is the main structure of BART. It doesn’t look much different from the Transformer. The main differences are source and Target

In the training stage, Encoder end uses bidirectional model to encode the destroyed text, and then Decoder calculates the original input by autoregressions. In the test or fine-tuning phase, Encoder and Decoder input is uncorrupted text

BART vs Transformer

BART uses the standard Transformer model with some changes:

  1. As with GPT, ReLU activation function was changed to GeLU, and parameter initialization followed normal distribution N(0,0.02)N(0, 0.02)N(0,0.02)
  2. The BART Base model has 6 Encoder and Decoder layers each, while the Large model has 12 layers
  3. Each layer of the BART decoder performs additional cross-attention for the encoder’s final hidden layer
  4. BERT uses an additional Feed Forward Layer before word prediction, while BART does not

Pre-training BART

BART writers try different ways to destroy input:

  • Token Masking: Following BERT (Devlin et al., 2019), Random tokens are sampled and replaced with [MASK] elements.
  • What Permutation: A document is divided into sentences based on full stops, and these sentences are shuffled in a random order.
  • The Document Rotation: A token is chosen uniformly at random, and the document is rotated so that it begins with that token. This task trains the model to identify the start of the document.
  • Token Deletion: Random tokens are deleted from the input. In contrast to token masking, the model must decide which positions are missing inputs.
  • The Text Infilling: A number of text spans are sampled, 15. With span lengths from a Poisson distribution (λ=3\lambda=3λ=3). token. 0-length spans correspond to the insertion of [MASK] tokens. Text infilling teaches the model to predict how many tokens are missing from a span.

Fine-tuning BART

Sequence Classification Tasks

In the sequence classification task, the encoder and decoder have the same input, and the final hidden state of the decoder token is entered into the multi-category linear classifier. BART added an extra token at the end of the decoder, as shown in the figure below. The output of the token position can be regarded as the representation of the sentence

Sequence Generation Tasks

Because BART has an autoregressive decoder, it can be fine-tuned directly for sequence generation tasks, such as questions and answers or text summaries

Machine Translation

The authors use a new random initialization Encoder to replace the Embedding layer of BART Encoder. The model is trained in an end-to-end manner, that is, a new encoder is trained to map foreign words to inputs. The new encoder can use a different vocabulary than the original BART model. The training of random initialization Encoder is divided into two steps, both of which require reverse propagation of cross entropy loss from BART model output. In the first step, we freeze most parameters of BART and update only randomly initialized Encoder, BART position embedding and self-attentional input projection matrix of the first layer of BART Encoder. In the second step, the author conducts a small amount of iterative training for all model parameters

Results

As can be seen from the above table, it seems that the effect of carrying on the Document Rotation or Sentence Shuffling is not very good. It can be understood that if the model sees the Sentence order is out of order during training, it may think that the Sentence order in the world is out of order. When you do the test, If the input sentences are in positive order, maybe the model is at a loss. In fact, Text Infilling can be regarded as Token Masking+Token Deletion, so it is understandable that the effect of Text Infilling is so good

Reference

  • Fb-bart’s new pre-training model
  • [NLP] The pre-training model proposed by Facebook, BART