The famous Sesame Street

The abbreviations for pre-trained language models are mostly Sesame Street characters. This is obviously a deliberate move by the naming masters. They can even toss out acronyms to cobble together sesame Street names

The models shown in the figure above (except Big Bird, because there is no such model) all have something in common, that is, they can embed a word through the context of a sentence. There are many network architectures that can achieve this purpose, such as LSTM, self-attention layers, Tree-based models (focus on grammar, generally perform poorly, only perform well if the grammar structure is very strict) and so on

Smaller Model

The pre-trained language model keeps getting bigger and bigger and bigger and more unplayable. However, by making the model smaller, as if we were going to Africa, the purchasing power of the RMB in the first-tier cities would become more. This is BERT for poor people. Examples are Distill BERT, Tiny BERT, Mobile BERT, Q8BERT, ALBERT

Teaching a man to fish is better than teaching a man to fish. What on earth can make the Model smaller? You Can refer to professor Li Hongyi’s video on model compression and The article All The Ways You Can Compress BERT. Common methods are as follows

  • The Network Pruning the Pruning
  • 5, Knowledge evaporates
  • Parameter Quantization Quantization of parameters
  • Architecture Design

Network Architecture Improvements

In addition to model compression, the most popular attempt in recent years is the design of model architecture. For example, Transfomer-XL can process very long sequences by understanding the content across fragments; Reformer and Longformer can reduce the complexity of self-attention from O(N2)O(N^2)O(N2) to O(NlogN)O(NlogN)O(NlogN) O(NlogN)O(NlogN) or even lower

How to Fine-tune

Teacher Li Hongyi talked a lot about this part, but I personally feel that many of them are relatively simple content. For example, input a sentence how to classify, this is actually done more, the common approach is to use the output of [CLS] followed by a linear classification layer, or to calculate the output of all tokens into an Average, and then into the linear classification layer. The purpose of the above two methods is to classify, but their essential difference lies in what is used to represent the good Embedding of a Sentence. For this, please refer to sentence-Bert’s paper or sentence-Bert’s detailed explanation. There are some experiments in there that show what’s a good way to represent a vector of a sentence

Extraction-based QA

Perhaps less commonly seen is how to use pre-training models for extraction-based QA (Question Answering) tasks

For example, if you now throw a document and a question into the QA Model, the Model will output two integers SSS and Eee that represent the answer to the question from the SSS word to the eee word in the document, i.e. {ds,… ,de}\{d_s,… ,d_e\}{ds,… ,de}

The way you get these two integers is also interesting. First of all, we generate two vectors (orange and blue in the figure above), use one of them (orange) to make dot product with the output of all positions of document, and then get a series of probability values through a Softmax. We take the subscript of the maximum probability (argmax) to get the starting position of the answer s=2s=2s=2

The end position of the answer eEE is obtained in a similar way, that is, another (blue) vector is used to make dot product with the output of all positions of document, and the subscript of the maximum probability is obtained after Softmax. So the final answer is [s,e][s,e][s,e]

Getting back to the subject, how can the pre-trained language model be fine-tuned? One method is to fix the pre-trained model as a feature extractor, and only update the parameters of the task-specific model that follows during training. The other method is not to fix the parameters of the pre-training language model, and all parameters are updated during the training process. In many of my own experiments, the latter works better than the former, but the problem is that many pre-training models are so large that 11G of video memory is often not enough, so the former method has to be used

Combination of Features

We know BERT has a lot of Encoder layers, and the general practice is to extract the output of the last Layer to do downstream tasks, but is this actually the optimal solution? In fact, someone did an experiment on NER task, where the output of different layers was combined in various ways, and the result was as follows

Xiao Han has created an open source project on Github called Bert-as-Service, which aims to use Bert to create word embeddings for your text. He experimented with various ways to combine these inserts and shared some of the conclusions and rationale on the project’s FAQ page

Xiao Han believes that:

  1. The first layer is the embedding layer, because it has no context information, so the vector of the same word in different contexts is the same
  2. As we move deeper into the network, word embedding takes more and more contextual information from each layer
  3. However, as you approach the final layer, word embedding will start to pick up information about BERT specific pre-training tasks (MLM and NSP)
  4. It makes sense to use the second-to-last layer

Why Pre-train Models?

Why do we use these pre-trained models? Obviously, we don’t have the money to train a larger model from scratch, so we can just use what someone else has trained

Of course, an EMNLP 2019 article Visualizing and Understanding the Effectiveness of BERT carefully analyzes the use of pretraining models from an academic perspective, showing that pretraining models can greatly accelerate loss convergence, Without the use of pre-training model, the loss is more difficult to reduce. It can be understood that the pre-trained model provides a better initialization than random initialization

Another conclusion is that pre-training models can greatly increase the generalization ability of models. The figure above shows that when different parameters are given to the model, the loss of the end point of the model after training will reach a local minima position. The steeper the location of the local minima, the worse its generalization ability, because its loss will vary greatly with slight changes in input; Conversely, the gentler the local minima, the stronger the generalization ability

ELMo

ELMo is a well-known two-way network at present. The traditional LSTM just goes through the sentence from left to right, so the information to predict the next token depends on the content on its left. To really use the context of the token, we can go through the sentence from right to left again, namely BiLSTM. But it’s actually not enough, because when the model is encode w1, W2, W3, w4W_1, W_2,w_3, w_4W1,w2,w3,w4, it doesn’t see the end of the sentence; In encode W5, W6, W7W_5, W_6, W_7W5, W6, W7, the part before the sentence is not taken into account, so ELMo is not really bidirectional when encoding at the bottom level. On the upper layer, because the embedding on both sides concat, it sees the two-way information at the same time

BERT

With Transformer class models (typically BERT), the self-attentional mechanism allows it to see the context at the same time, and each token can interact with each other in pairs. The only thing to do is to MASK a token randomly

If you go back in history to when Word2vec was just starting the NLP revolution, you’ll see that CBOW’s training is almost the same as BERT’s. The main difference is that BERT’s range of attention is variable, while CBOW’s range is fixed

Whole Word Masking (WWM)

Is it really good to mask a token at random? For Chinese, a word is composed of multiple characters, and one character is a token. If we mask off a token randomly, the model might not have to learn much about semantic dependencies and could easily predict the token by the word before or after it. To do this, we need to raise the difficulty a little bit. Instead of covering a token, we need to cover a word (span). The model needs to learn more semantics to predict the span covered, which is Bert-WWM. In the same way, we can extend the span of words a little bit further, into phrase levels, entity levels (ERNIE).

SpanBERT

There’s another BERT improvement called SpanBERT. It covers NNN tokens at a time, where NNN is obtained according to the probability shown in the figure above. Experimental results showed that this probability-based choice of how many tokens to cover was better for some tasks

SpanBERT also proposed a training method called Span Boundary Objective (SBO). Usually we just train to show you masked tokens. The SBO wants to predict what is in the covered area by the output from the left and right sides of the covered area. As shown in the figure above, the output of W3W_3W3 and W8W_8W8 is fed into the subsequent network along with an index indicating where in the span we wish to predict words

XLNet

A more detailed explanation of XLNet is available in this blog post. In short, XLNet believes that the training and test phases of Bert-class models are not unified (the training phase has [MASK] tokens, but the test phase does not), so there may be some problems. In Autoregressive terms, XLNet involves scrambling inputs as inputs and predicting the next token from left to right. If we look at XLNet as an AutoEncoder (BERT), we want to predict the words of [MASK] position based on the information to the left or right of [MASK]. Unlike BERT, XLNet has no [MASK] input

MASS / BART

The BERT class model lacks the ability to generate sentences, so it is not suitable for the task of Seq2Seq, while the MASS and BART models solve the problem that BERT is not good at generating. We want the output of the Decoder to be the Encoder’s input, but it is important to note that we must destroy the Encoder’s input to some extent, because if there is no damage, Decoder will directly copy Encoder input to the line, it may not learn what useful things

MASS does this by randomly masking parts of the input with [MASK] tokens. The output does not have to be a complete sequence of sentences, as long as it can predict the correct part of [MASK]

In BART’s paper, he also proposed a variety of methods, including random mask for input sequences, direct deletion of a token, or random permutation and combination. For a more detailed explanation of BART, see this article

UniLM

There is also a model called UniLM which can be either an encoder, decoder or Seq2Seq. The UniLM is stacked by many transformers and trains in three ways simultaneously, including BERT as an encoder, GPT as a decoder, and MASS/BART as Seq2Seq. When it is used as Seq2Seq, the input is divided into two segments. In the first segment, the tokens in this segment can be noticed by each other, but in the second segment, only the tokens on the left can be seen

ELECTRA

Predicting how much training is required is a big thing, and ELECTRA wants to simplify it by turning it into a dichotomic problem, determining whether a given word has been randomly substituted

But the question is, how do you replace some words in a sentence that’s grammatically correct and semantically not so weird? Because if the token is replaced with something weird, the model can easily see that ELECTRA doesn’t learn anything great. In this paper, another relatively small BERT is used to output the masked words. There is no need to use a good BERT, because if the BERT effect is too good, the words will be directly output exactly the same as the original words, which is not what we expect. This architecture looks a bit like a GAN, but it’s not actually a GAN, because the GAN Generator will cheat the Discriminator when it trains. However, Small BERT here trained himself by himself, as long as the location of the mask is predicted, it has nothing to do with whether the prediction of the following model is correct

ELECTRA training is amazing, with much better GLUE scores than BERT for the same amount of pre-training, and it takes only 1/4 of the computation to achieve XLNet’s performance

T5

Pre-training language models requires too many resources to be done by ordinary people. There’s a Google paper called T5, which shows Google’s vast financial and computing resources. It tries out all sorts of pre-training methods once, and then comes up with something that leaves no one else to study

Reference

  • The training model of Notes | Bert series
  • Li Hongyi human language processing
  • BERT and its family