Abstract: In this paper, we use one-shot NAS method to search BERT for NAS.

This article is shared from huawei Cloud community “[NAS Papers][Transformer][Pre-training model] Intensive Reading NAS-Bert”, author: Su Dao.

NAS-BERT:Task-Agnostic and Adaptive-Size BERT Compression with Neural ArchitectureSearch

Brief introduction:

The paper code is not open source, but the paper is quite clear, should be able to manual implementation. The number of BERT parameters is too Large and the inference is too slow (although tensorrt8.x has been supported to achieve good inference effect, bert-large inference only takes 1.2 milliseconds), but excellence is always the pursuit of researchers, so this paper uses one-shot NAS method with shared weight to do NAS search for BERT.

Methods covered include block-wise Search, progressive Decreasing, and Performanceapproximation


1. Definition of search space

The OPS of search space includes the convolution kernel size of deep separable convolution [3/5/7], Hidden size [128/192/256/384/512] head number of MHA [2/4/4/6/8], FNN[512/768/1021/1536/2048], and identity connection, also known as the jump layer, a total of 26 op, as shown in the following figure:

Note that the relationship between MHA and FNN is binary, but it can be said that the first layer is MHA and the second layer is FNN, thus forming a basic Transformer block. It can be said that this method breaks the conventional Transformer block search and contains the structure of Transformer and BERT. There are also chain links between different layers, and only one OP is selected for each layer, as shown in the following figure

2. Training mode of super network

【 block-wise Training + Knowledge cut, mass Training +KD Distillation 】

(1) First divide the hypernetwork into N Blocks

(2) With the original BERT as the Teacher model, BERT is also divided into N Blocks

(3) The input of the NTH block in the super network (Student) is the output of the NTH block in the teacher model. The mean square deviation of the output of the NTH block in the teacher model is used as loss to predict the output of the NTH block in the teacher model

(4) The training of supernetwork is single frame random sampling training

(5) since the hiddensize of student block may be different from the hiddensize of teacher block, the hidden input and output of teacher block can be directly used as the training data of student block. To solve this problem, it is necessary to use a learnable linear transformation layer at the input and output of the student block to convert each hidden size to match the size of the teacher block, as shown in the figure below

The Progressive Shrinking 】

Because the search space is too large, the super-network needs effective training. ProgressiveShrinking can be used to speed up the training and improve the search efficiency, hereinafter referred to as PS. However, it is not possible to simply eliminate the architecture, because the large architecture is difficult to converge at the initial stage of retraining and the effect is not good, but it does not mean that its representation ability is poor, so this paper sets a PS rule:

A ^t represents the largest architecture in the super network, **p(▪)** represents the latency size, l(▪) represents the latency size, B represents B interval buckets, and B represents the current interval number. If an architecture does not meet a _p_b > p (a) > pb_1 and l_b > l (a) > l_b – 1 _ the interval, is excluded.

The PS process is to extract E architectures from each B bucket, verify the set, and eliminate R architectures with the largest loss, and repeat this process until only M architectures are in each bucket

3, the Model Selection

Build a table, including latency, loss, parameter number, and structure coding. Loss and latency are prediction evaluation methods. Specific evaluation methods can be seen in the paper. And then you take those T schemas and you run them through the validation set, and you pick the best one.

The experimental results

1. Compared with the original BERT, GLUE Datasets have been improved to some extent:

2. Compared with other varieties of BERT, the effect is also good:

Ablation experiments

1. Is PS effective?

If the PS method is not used, a huge verification time (5min vs 50hours) is needed, and the super-network training is more difficult to converge, affecting the architecture sequencing:

2. Is it PS architecture or PS out node

The conclusion is that photoshop is too rough to remove node, and the effect is not good:

3. Is there a two-stage distillation?

This paper distilled and explored the pre-training stage and finetune stage, namely pre-trainKD and Finetune KD, and the conclusion is as follows:

1, pre-training distillation effect is better than finetune distillation

2, two stages together distillation effect is the best

Click to follow, the first time to learn about Huawei cloud fresh technology ~