BERT model Distillation BERT

BERT has achieved good results in many NLP tasks, but its model volume and computation are large, and now there are more and more larger models, such as roBERTa and GPT2. Because these models are too large, they are difficult to use on some machines with less performance, and they are not very good for real-time applications. Therefore, there are many studies on BERT model compression, of which model Distillation is a better solution. This paper introduces two BERT model compression methods based on model Distillation.

1. Introduction

The figure above shows a number of models based on Transformer. The numbers below the models correspond to the number of parameters in millions, and you can see that the models are getting bigger and bigger. The size of these models also limits their use in the real world due to various factors:

The training of this model costs a lot of money and requires the use of expensive GPU servers to provide large-scale services.
The large model led to the long inference time, which could not be used in some tasks with high real-time requirements.
There are a lot of machine learning tasks that need to run on a terminal, such as a smartphone, where you have to use a lightweight model.

Based on the above reasons, many studies have begun to focus on BERT model compression. Common model compression methods are as follows:

Model Distillation practices training smaller models with knowledge learned from the larger model, thus making the smaller model generalize like the larger model.
Quantization reduces the accuracy of the large model and reduces the model.
Pruning Pruning, remove the connections in the model that have little effect.
Parameter sharing, share some parameters in the network, reduce the number of model parameters.

ALBERT introduced in the previous article RoBERTa and ALBERT is also a BERT compression method, which mainly uses the methods of parameter sharing and matrix decomposition to compress BERT. However, ALBERT only reduces the parameters of the model, but cannot reduce the time of inference.

Next, two algorithms for distilling BERT using models are introduced. The first is DistilBERT, which distills a 12-layer Bert-Base model into a six-layer BERT model. The second is to distill BERT model to BiLSTM model.

Model Distillation

First, model distillation is a method of model compression proposed by Hinton in his paper “Distilling the Knowledge in a Neural Network”. Here are the main points of model distillation:

First, we need to train a large model, which is also called teacher model.
The probability distribution of teacher model output is used to train the small model, which is called student model.
When training the student model, there are two types of labels. Soft label corresponds to the probability distribution of the output of teacher model, while hard label is the original one-hot label.
Small models trained in model distillation learn the performance and generalization of large models.

In supervised training, the model will have a predicted target, usually one-hot label, and the model needs to maximize the probability of the corresponding label (Softmax or Logits) during training. In the probability distribution of model prediction, the correct category has the highest probability, while the probability of other categories is relatively small. However, the probability of those low-probability categories is also different. For example, the probability of predicting a dog as a Wolf is higher than that of predicting a tree. To some extent, this difference between probabilities indicates the generalization ability of models. A good model is not a model that fits well with training sets, but a model with generalization ability.

Model distillation hopes that the student model can learn the generalization ability of the teacher model, so the target used in training the model is the probability distribution of the output of the teacher model, which is also known as soft target. Some model distillation methods also use the original one-hot label, called hard Target, during training.

In order to better learn the generalization ability of teacher model, Hinton proposed softmax-temperature, and the formula is as follows:

Softmax-temperature adds a parameter T on the basis of Softmax. The closer T is to 0, the closer the distribution is to one-hot, and the closer T is to infinity, the closer the distribution is to average. That is, the larger T is, the smoother the distribution will be. Selecting an appropriate T can make the student model more able to observe the diversity of the category distribution of the teacher model. In the process of training, the teacher model and student model use the same parameter T, and in the subsequent inference using student model, T is set back to 1, which becomes the standard Softmax.

3.DistilBERT

The DistilBERT model was published by HuggingFace in the paper DistilBERT, a distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. The DistilBERT model is similar to the BERT model, but DistilBERT has only six layers, compared to bert-Base’s 12. DistilBERT has only 66 million parameters, compared to bert-Base’s 110 million parameters. DistilBERT still performs well with fewer BERT parameters and layers, retaining 95% of BERT performance on glues.

3.1 DistilBERT training

DistilBERT uses KL divergence as a loss function, with Q representing the distribution of the student model and P representing the distribution of the teacher model output, as follows:

DistilBERT’s final loss function is derived from a linear combination of KL divergence (distilling loss) and MLM (masking language modeling) loss. DistilBERT removed the BERT model’s token types embedding and NSP (next sentence prediction task), preserving BERT’s other mechanisms, and then reduced BERT’s layers by half.

In addition, the authors of DistilBERT also use some optimized tricks, such as using the parameters of the Teacher model to initialize the DistilBERT model; Some training methods in RoBERTa are adopted, such as large batch and dynamic Mask.

3.2 DistilBERT results

DistilBERT performed better than ELMo in all of its data sets, and even better than BERT in some of its data sets. Overall, DistilBERT performed 97% better than BERT. But DistilBERT has only 60 percent of BERT’s number of ingredients.

In the figure above, a comparison of different model parameters and inference times shows that DistilBERT has far fewer parameters than ELMo and Bert-Base, and the inference time is much shorter.

4. Distill BERT to BiLSTM

Another paper, Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, attempts to distil the BERT model into the BiLSTM model, It is called a Distilled BiLSTM. That is, the teacher model is BERT, while the student model is BiLSTM. This paper proposes two models, one of which is the classification of single sentences; The other is to match two sentences.

4.1 Distilled BiLSTM model

The figure above is the first BiLSTM model for the classification of a single sentence. The word vectors of all the words of the sentence are input into a BiLSTM, and then the hidden layer vectors of the forward and backward LSTM are splicing together and passed into the fully connected network for classification.

Above is the second BiLSTM model, which is used for matching two sentences. The hidden vectors of the output of the two BiLSTM are H1 and H2 respectively, so the two vectors need to be splicing together before classification. The formula of h1 and H2 stitching is as follows:

4.2 BiLSTM training

The loss function used for distilling BERT into BiLSTM model consists of two parts:

One part is hard target, which directly calculates cross entropy by using one-hot category and probability value output by BiLSTM.
One part is soft Target, and the probability values output by Teacher model (BERT) and BiLSTM are used to calculate the MSE.

In the training process, the data set is too small for the student model to learn all the knowledge of the teacher model, so the author proposes three data enhancement methods to expand the data:

Masking uses the [mask] tag to replace a word randomly, for example, “I have a cat”, with “I [mask] a cat”.
POS- Guided word replacement, whereby a word is replaced with another word with the same POS, for example, “I have a cat” is replaced with “I have a dog”.
N-gram, randomly sampling 1-5 to get n, then taking n-gram and removing other words.

4.3 Distilled BiLSTM experiment results

As shown above, the results of BiLSTM model are much better than BiLSTM model alone, and the results of SST and QQP data sets are even better than ELMo, indicating that the model can learn part of BERT’s generalization ability. However, the performance of Distilled BiLSTM is much worse than that of BERT, indicating that there is still a lot of knowledge that cannot be transferred to BiLSTM.

The parameters and inference time of BiLSTM are shown above. The parameters of BiLSTM are much less than that of Bert-large, which is 335 times less than that of Bert-large, and the inference time is 434 times faster than that of Bert-Large. The compression effect is quite obvious.

5. To summarize

DistilBERT’s model works better, but Distilled BiLSTM is more compact.

DistilBERT model used KL divergence to calculate soft target, but Distilled BiLSTM used MSE. The reason HuggingFace points out in his blog is that DistilBERT trains a language model, but Distilled BiLSTM trains a downstream classification task, and the output spatial dimension of the language model is much larger. In DistilBERT, MSE may cancel out logits.

6. References

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
Distilling the Knowledge in a Neural Network
HuggingFace blog