Each business of Meituan has rich NLP scenarios, and the construction of models in these scenarios requires a lot of annotation resources and costs a lot. Small sample learning is devoted to training better models in the case of scarce data resources. This paper introduces some existing methods from the aspects of active learning, data enhancement, semi-supervised learning, domain transfer, integrated learning and self-training, and carries out experiments in Meituan scenario, and some improvements have been made in the effect. I hope it can help or inspire students engaged in relevant research.

1 background

Each business of Meituan has rich NLP scenarios, and the construction of models in these scenarios requires a lot of annotation resources and costs a lot. Small sample learning is dedicated to training better models in the case of scarce data resources, mainly with the following two evaluation criteria:

  • Improving algorithm effect: in the case of certain labeled resources, using small sample learning can improve corresponding indicators as much as possible.
  • Save annotation data: in the case that the algorithm effect is expected to reach a certain level, it is expected to reduce annotation data as much as possible.

In the case of small samples in the NLP field, there are mainly the following three scenarios, and small sample learning takes different measures to solve these problems:

  1. Sample space scarcity (left in Figure 1) : When the sample quantity is small, the distribution is scarce. Data enhancement aims to make better use of the relationship between sample /Embedding and improve the model generalization performance.
  2. Samples are distributed in local space (Figure 1) : a domain often has only a small amount of labeled data and a large amount of unlabeled data. According to the different ways of using unlabeled data, we divided it into two types. The first type is semi-supervised learning, in which both labeled samples and unlabeled samples are learned in the Finetune process of the model, and the consistency of prediction of unlabeled data is utilized by the model. The second is integrated learning + self-training, which emphasizes that the prediction results of unlabeled data from multiple models are integrated into the training as pseudo-labeled data.
  3. Differences in sample distribution between different fields (right of Figure 1) : After fully learning the annotation information in one field, it cannot be directly applied to other fields due to differences in sample space. Transfer learning aims to quickly learn the knowledge in other fields after fully learning the knowledge in one field.

In addition to the three scenarios mentioned above, another is how to select more targeted samples for manual annotation (active learning) within the limited cost of annotation. Therefore, we divide small sample learning into the following types:

  • Data enhancement: Data enhancement can be divided into sample enhancement and Embedding enhancement. In the past, sample enhancement was used to enhance the image data in computer vision. Some simple operations of the image, such as rotating the image or converting it to gray scale, do not change its semantics. The existence of semantic invariant transformation makes enhancement become an important tool in computer vision research. In the field of NLP, sample enhancement also tries to expand text data without changing sentence gist. The main methods include simple text replacement, pre-training language model to generate similar sentences, etc. The enhanced data can be learned from simple to difficult in the way of course learning. Embedding enhancement operates on the Embedding layer of the model and can enhance the robustness of the model by adding disturbance/interpolation to the Embedding layer.
  • Semi-supervised learning: Supervised learning often requires a large amount of annotated data, and the cost of annotated data is relatively high. Therefore, how to use a large amount of unannotated data to improve the effect of supervised learning is of great significance. In recent years, semi-supervised deep learning has made great progress, especially in computer vision. Relevant studies mainly focus on how to construct unsupervised signals from unlabeled data and combine modeling with supervised learning. At present, the main methods are to construct loss functions based on consistent regularization of unlabeled data.
  • Integrated learning + self-training: The supervised learning goal of machine learning is to learn a stable model with good performance in all aspects, but in reality, sometimes only multiple models with preferences can be obtained (the weakly supervised model performs better in some aspects). Ensemble learning is to combine multiple weakly supervised models here in order to get a better and more comprehensive strongly supervised model. The underlying idea of ensemble learning is that even if a weak classifier gets a wrong prediction, other weak classifiers can also correct the error. Simply using multiple models for integration in prediction will increase online burden. Therefore, we use multiple models to predict a large number of unlabeled data, select data with high combination confidence and combine them into the training set for training, and finally integrate the advantages of multiple models into a unified model. This part is also a kind of semi-supervised learning, but the main difference from the semi-supervised learning mentioned above is that the methods mentioned in the latter emphasize the consistency of prediction of unlabeled data by using models in the Finetune stage, while this part emphasizes the fusion of prediction results of multiple models, so it is listed separately.
  • Few- Shot Learning/ Field Migration: Human after have a certain knowledge reserves can quickly learn new knowledge, the researchers hope to machine learning model can also have the capability, a model has a large number of data from a certain category learning, after a lot of information for the new category can be based on a small number of tag samples to get a new classifier, the classifier can identify the samples in the new category.
  • Active learning: Active learning is an iterative process of machine learning and manual participation. Suitable candidate sets are screened for manual annotation through machine learning methods. The general idea is: to obtain the sample data that is “difficult” to classify through machine learning method, and let the manual confirm and review again, and then use the data obtained by manual annotation to train the supervised learning model again, and gradually improve the model effect.

2 Method Review

BERT, a pre-trained language model, has achieved excellent results in many tasks of NLP. BERT is a deep bidirectional language representation model based on Transformer, which constructs a multilayer bidirectional Encoder network using Transformer structure. We took BERT as the Baseline model and Finetune the pre-trained model on specific task samples.

2.1 Data Enhancement

Data enhancement can be divided into data augmentation and Embedding augmentation. Data augmentation transforms the expression forms of text, such as translation, synonym substitution, random deletion and so on, while keeping the semantics unchanged. Model enhancement mainly includes Mixup and adversarial training. Mixup has a series of variations in NLP, including SeqMix, Manifold Mixup and so on. Data enhancement improves the robustness of the model, makes the model pay more attention to the semantic information of the text, and is no longer sensitive to the local noise of the text. In small sample scenarios, text enhancement technology can effectively improve the robustness of the model and its generalization ability.

2.1.1 Sample enhancement

2.1.1.1 Simple Data enhancement EDA

EDA1 replaces some words/phrases in sentences through knowledge base, mainly including the following four operations:

  • Synonym Replacement (SR) : N words are randomly selected from a sentence, and then synonyms are randomly selected from a thesaurus and replaced, regardless of the stop words.
  • Random Insertion (RI) : Randomly find a word in a sentence that does not belong to the stop word set, find its Random synonym, and insert the synonym into a Random position in the sentence. Repeat n times.
  • Random Swap (RS) : A Random selection of two words in a sentence and Swap their positions. Repeat n times.
  • Random Deletion (RD) : Randomly delete each word in a sentence with probability P.
2.1.1.2 Back Translation

Back translation is a data enhancement method often used by NLP in machine translation. Its essence is to produce some translation results quickly to increase data. Back translation translates the original data into other languages and then back into the original language. Due to the differences in the logical order of languages, the method of back translation can often obtain new data that is quite different from the original data.

2.1.1.3 Pre-trained language model
  • Text enhancement 2 based on context information: the trained language model is used to randomly Mask a word or word in the text, and then the text is input into the language model. The Top KKK words predicted by the language model are selected to replace the deleted words in the original text, so as to form KKK new text.
  • Lambda3 is text-enhanced based on the language generation model, and Lambda is based on the pre-trained language model GPT, enabling the model to capture the structure of the language to produce coherent sentences. Fine-tune the model on a small number of data sets for different tasks and use the fine-tuned model to generate new sentences.

2.1.2 Enhanced sample use

The above methods generate a batch of data enhancement text, and the enhanced text is large in quantity and noisy. The original annotated data contains little data and no noise. A paper in NAACL20214 suggests a way to learn from both samples in a curriculum.

Standard Data Augmentation

Direct integration of raw data and enhanced data for training.

Curriculum Data Augmentation

Study simple annotation data first, and then study enhanced data with noise after acquiring certain knowledge. There are two ways to use the data:

  1. Two-stage: Train on raw data first and then train on raw and enhanced data together after the development set converges.
  2. Gradual: First training on the original data, and then gradually adding enhanced data in a linear manner by controlling the perturbation variable τ\tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau

2.1.3 Embedding enhancement

2.1.3.1 Mixup

Natural language is Systematic in its ability to understand properly reorganized sentences. Consider the following example, where the two on the left are the original text, and the understandable sample 5 on the right can be produced by substituting/recombining some of the words in it.

Mixup6 and 7 propose a more general model based on vector enhancement, which samples any two samples from the training data and constructs the mixed samples and mixed labels as new augmented data. Among them (xi, yi) (x_i y_i) (xi, yi) and (xj, yj) (x_j y_j) (xj, yj) as the original sample, (x ^, ^) y (\ hat {x}, \ hat {y}) (x ^ ^ y) for the restructuring of the generated new samples. When λ\lambdaλ is limited to {0, 1}, the combination of figure 4 is produced.


x ^ = Lambda. x i + ( 1 Lambda. ) x j . y ^ = Lambda. y i + ( 1 Lambda. ) y j . \begin{aligned} \hat{x} &= \lambda x_i + (1 – \lambda) x_j, \\ \hat{y} &= \lambda y_i + (1 – \lambda) y_j, \end{aligned}

The value λ\lambdaλ satisfies the Beta distribution in the experiment.


Lambda. …… Beta ( Alpha. . Alpha. ) \lambda \sim \text{Beta}(\alpha, \alpha)

Intuitively, it requires that when the input of the model is a linear combination of the other two inputs, the output is also a linear combination of the output of the two data input separately into the model, which in fact requires the model to be approximately a linear system to prevent over-fitting of the model. Mixup transformation itself can be regarded as a regularization technology. There are several variations of Mixup, SeqMix and Manifold Mixup.

a. SeqMix

SeqMix8 marks the confusion degree of sentences generated by Mixup. Discriminator is used to judge the confusion degree of sentences. Only sentences with low confusion degree are selected to reduce the influence of noisy sentences. Also replace Sub sequences rather than entire sentences for tasks that are more word-focused, such as named entity recognition.

  • Figure 5. (a) The original Mixup, Mixup the entire sequence to generate a new sequence.
  • Figure 5. (b) Subsequence Mixup: Mixup the legitimate subsequence and replace the original subsequence respectively to generate two new sequences.
  • In Figure 5 (c), the subsequence Mixup restricted by tags is Mixup only for the subsequence of tags of the same category. The original subsequence is replaced respectively to generate two new sequences.

b. Manifold Mixup

Manifold mixing algorithm, Manifold Mixup9 generalizes the above Mixup operations to features. Features have higher order semantic information, and interpolation on their dimensions may produce more meaningful samples. In a model similar to BERT, the number of layers KKK is randomly selected to Mixup the feature representation of this layer, and the specific operations are as follows:

  • Randomly select the KKK layer of the network (including the input layer).
  • The two batches of data to the network, to spread before until the first KKK layer, get hidden features said (gk (x), y) (g_k (x), y) (gk (x), y) and (gk (x ‘), y ‘) (g_k (x), y ‘) (gk (x ‘), y ‘).
  • Mixup to generate new samples (g, y ~ ~ k) = (Mixup lambda (gk (x), gk) (x ‘), (Mixup lambda (y, y ‘)) (\ tilde {g} _k, \ tilde {} y) = (\ text {Mixup} _ {\ lambda} (g_k (x), G_k (x ‘)), (\ text {Mixup} _ {\ lambda} (y, y ‘)) (g, y ~ ~ k) = (Mixup lambda (gk (x), gk) (x ‘), (Mixup lambda (y, y ‘))
  • Continue forward propagation to get the output.
  • Calculate the loss value and gradient.
2.1.3.2 Antagonistic training

Adversarial training (AT)10 Significantly improves model Loss by adding small perturbations to input samples. Adversarial training is to train a model that can effectively identify original samples and Adversarial samples. If the model can remain smooth even against noise, the whole network can show good consistency. When adversarial training is applied to the classifier, the corresponding loss function of adversarial training is (an additional term is added on the basis of the original loss function) :


l o g ( p ( y x + r a d v ) ; Theta. ) ; where  r a d v = arg min r . r Or less ϵ log p ( y x + r ; Theta. ~ ) -log(p(y | x+r_{adv}); \theta); \text{where } r_{adv} = \arg\min_{r, ||r|| \le \epsilon} \log p(y | x+r; \tilde{\theta})

Where XXX is the input sequence, θ\thetaθ is the model parameter, RRR is the disturbance on the input, θ~\tilde{\theta}θ~ means that the current parameters of the model are set as a constant, that is, the back propagation algorithm will not update the current model parameters when constructing the adversation sample. Radvr_ {adv}radv is calculated as:


r a d v = ϵ g g 2  where  g = x log p ( y x ; Theta. ~ ) r_{adv} = \epsilon \frac{g}{||g||_2} \text{ where } g=\nabla_x \log p(y|x; \tilde{\theta})
2.1.3.3 Contrast learning R-DROP

Dropout is a commonly used regularization method for deep learning models. Regularized Dropout (R-DROP)11 makes two Dropout of the same sentence and forces the output probability of different submodels generated by Dropout to be consistent. Specifically, for each training sample, R-DROP minimizes the KL divergence between the output probabilities of submodels generated by different Dropout.

The left figure in Figure 6 shows that each input sample (x,y)(x,y)(x,y) passes through the model twice to produce two probability distributions. The right figure shows that due to the randomness of Dropout, repeating the same sample twice can produce two submodels.

The given training data D = {} (xi, yi) I = 1 nD = \ {(x_i y_i) \} _ {I = 1} ^ nD = {(xi, yi)}, I = 1 n model for Pw (y ∣ x) P (y | x) Pw ^ w (∣ x y), the cross entropy loss of training data is:


L C E = 1 n i = 1 n log P w ( y i x i ) L_{CE} = \frac{1}{n}\sum_{i=1}^n-\log P^w(y_i|x_i)

Through different Dropout can get two sub models P1w (y ∣ x) P_1 ^ w (y | x) P1w (∣ x y) and P2w (y ∣ x) P_2 ^ w (y | x) P2w (∣ x y), R – minimum Drop two child model output probability of two-way KL divergence:


L K L i = 1 2 ( D K L ( P 1 w ( y i x i ) P 2 w ( y i x i ) ) + D K L ( P 2 w ( y i x i ) P 1 w ( y i x i ) ) ) L_{KL}^i = \frac{1}{2}(D_{KL}(P^w_1(y_i|x_i)||P^w_2(y_i|x_i)) + D_{KL}(P^w_2(y_i|x_i)||P^w_1(y_i|x_i)))

For the training data (xi,yi)(x_i, y_i)(xi,yi), the final loss function is:


L i = L C E i + Alpha. L K L i = log P 1 w ( y i x i ) log P 2 w ( y i x i ) + Alpha. 2 ( D K L ( P 1 w ( y i x i ) P 2 w ( y i x i ) ) + D K L ( P 2 w ( y i x i ) P 1 w ( y i x i ) ) ) L^i = L_{CE}^i + \alpha L_{KL}^i = -\log P_1^w(y_i|x_i) -\log P_2^w(y_i|x_i) + \frac{\alpha}{2}(D_{KL}(P^w_1(y_i|x_i)||P^w_2(y_i|x_i)) + D_{KL}(P^w_2(y_i|x_i)||P^w_1(y_i|x_i)))

2.2 Semi-supervised learning

In actual business, there is a large amount of unmarked data, and it requires a lot of manpower to annotate the data. The core goal of semi-supervised learning is to train models with strong generalization ability through a small amount of labeled data and a large amount of unlabeled data, so as to solve practical problems.

  • Input data: a large number of unlabeled data and a small amount of labeled data in the same field.

  • Mainly based on the following theoretical basis:

  1. Self-consistent Regularization: The resulting augmented data is input to a model whose predicted values are consistent with those of the original data, i.e., self-consistent.
  2. Entropy Minimization: As a rule, the classification boundary of a classifier should not pass through a region of marginal density. The method is to force the classifier to make low entropy prediction for unlabeled data.
  3. Traditional Regularization: a method to improve model generalization ability and prevent over-fitting, such as L2 Regularization.
  • How to construct self-consistent regularization term of semi-supervised learning is the focus of many semi-supervised models.
    • The ideas of Temporal Ensemble, Mean Teacher and MixTemporal all make use of the integration of historical models and construct consistent regulars by using the prediction results/model parameters of historical models.
    • VAT constructs the regular term by adding disturbance to Embedding.
    • MixMatch, MixText and UDA make use of the results of sample enhancement. MixMatch makes use of image rotation and scaling, etc. MixText and UDA depend on the consistency between samples generated through translation, and the model results largely depend on the quality of sample enhancement.

2.2.1 Temporal Ensembling

Using Temporal ensemble model, Temporal ensemble 12 takes the exponential moving average (EMA) of historical forecast results as pseudo-labels of unlabeled data, thus constructing consistent regular terms with the current forecast results.

  • Calculate cross entropy loss with labeled data.
  • For unlabeled data, the exponential moving average (EMA) of the historical multiple prediction results was used as the reconstruction target to calculate THE MSE loss, so as to avoid the large error caused by the one-time prediction of the model as the reconstruction item, which is conducive to smoothing the noise in the single prediction.
  • The final loss function is the weighted sum of cross entropy loss of labeled data and mean square error of unlabeled data.

Where xix_IXI contains labeled data and unlabeled data text, yiy_iyi represents labeled data label, ziz_IZi represents prediction result of xix_IXI by current model, Z ~ I \tilde{z}_iz~ I represents the probability distribution of the sliding average predicted results of unlabeled epoch data, and w(t)w(t)w(t) w(t) represents the weight of unlabeled DATA MSE. At the beginning of training, w(t)→0w(t) \rightarrow 0w(t)→0. At this point, the model tends to learn to label data. As the number of iterations increases, W (t) W (t)w(t) slowly increases, and the model learns a large amount of unlabeled data.

2.2.2 Mean the Teacher

Mean Teacher13 is basically consistent with Temporal Ensembling model ideologically, which requires matrix to preserve historical prediction probability. Mean Teacher model, on the other hand, makes exponential sliding average EDA of model parameters as Teacher model. The original model is used as the Student model. Consistent regularization is constructed by using the predictive structure of Teacher model and Student model.

  • In Temporal Ensembling, the target tag of untagged data comes from the weighted average of the predicted results of previous epochs of the model. However, in Mean Teacher, the target labels without label data came from the predicted results of Teacher model.
  • Since label prediction is achieved by averaging model parameters, the information without labels can be updated into the model at each Step, rather than waiting until the end of an Epoch for updating as in the Temporal Ensembling model.

2.2.3 VAT

Virtual Adversarial Training (VAT) 14 What distinguishes it from the Temporal Ensembling model is that the latter uses data enhancement, Dropout to noise untagged data, The former imposes noise in the direction of the steepest change in the model, known as antagonistic noise. If the model can remain smooth even against noise, the whole network can show good consistency. Virtual adversarial training extends the adversarial training to the semi-supervised domain by adding regulars to the model so that the output distribution of a sample is the same as that after perturbation.

  • Firstly, random standard normal perturbation D ~ N(0,1)d \sim \mathcal{N} (0,1)d ~ N(0,1) is extracted from unlabeled data and added to Embedding, and the KL divergence is used to calculate the gradient.
  • Then the obtained gradient is used to calculate the counterdisturbance and conduct counterdisturbance training.
  • The cross entropy loss of labeled data and the counter loss of unlabeled data are the final loss function.

2.2.4 MixMatch

MixMatch15 is an aggregator, integrating data enhancement, Mixup, Sharpening and other methods. The two modules that play an important role are Mixup and Sharpening.

  • KKK strip enhancement data without marked data in the image field comes from image rotation, scaling, etc.
  • Input data X = (xb, pb) X = (x_b, p_b) X = (xb, pb), not the annotation data U = (ub) U = (u_b) U = (ub), without annotation data generated by figure 9 tags qbq_bqb, Qbq_bqb is the average prediction result of the model for unlabeled data and its bar enhanced data.
  • X = (xb, pb) X = (x_b, p_b) X = (xb, pb) and U = (ub, qb) U = (u_b q_b) U = (ub, qb) do Mixup operation.
  • The cross entropy loss of labeled data and the consistent regular loss of unlabeled data are calculated.

2.2.5 MixText

The whole idea of MixText16 is the same as MixMatch. First, the unlabeled data is enhanced by Back Translation, and the weighted sum of the prediction results of the enhanced data and the original data is used as the label of the unlabeled data. Then, both the labeled data and the unlabeled data are Mixup.

  • Unlabeled data is enhanced by Back Translation.
  • The weighted sum of the predicted results of the enhanced data and the original data is used as the label of the unlabeled data.
  • Mixup both labeled data and unlabeled data.
  • Learning cross entropy loss with annotated data and consistent regularization without annotated data.

Where xlX_LXL is the labeled sample, XUx_UXU is the unlabeled sample, and XAX_AXA is the enhanced translation sample of the unlabeled sample. Similar to Manifold Mixup, Mixup is also made for the hidden representation of the MMM layer of the model, which can mine the implicit relationship between sentences. By interpolating labeled and unlabeled data simultaneously, information from unlabeled sentences can be used while learning labeled sentences.

2.2.6 MixTemporal

Drawing on the ideas of MixMatch and MixText, the prediction of unlabeled data is not enhanced by EDA or back translation, but by using the ideas of Ensemble, the exponential moving average (EMA) of historical prediction results of multiple epochs is used as the pseudo label of unlabeled data. No need for additional thesaurus and translation tools, etc., implementation is more convenient.

The final Loss function calculation consists of two parts: cross entropy Loss after Mixup + regular Loss for unlabeled data consistency.

Where xix_IXI is labeled sample, yiy_iyi is labeled sample label, Ziz_IZi is the prediction result of xix_IXI by current model, and Z ~ I \tilde{z} _IZ ~ I is the probability distribution of the moving average of the prediction results of multiple epochs without labeled data. W (t)w(t)w(t) indicates the weight of unlabeled data MSE.

2.2.7 UDA

UDA17, from Google, also uses consistent re. For image UDA, a high-quality data enhancement method RandAugment is adopted. For text UDA, translation and non-core word replacement are used. Tf-idf value is used to measure the importance of a word for a text to determine whether the word should be replaced, and then combined with translation to make replacement.

Training Skills:

  • Confident-based masking: Filters consistent regular items by the threshold.
  • Sharpening Predictions: Makes the probability distribution of Predictions more extreme.
  • 3. Domain-relevance Data Filtering: Remove domain-irrelevant Data, also selected based on confidence.

2.3 Integrated learning + self-training

In supervised learning algorithms, the goal is to learn a stable model with good performance in all aspects, but the actual situation is often not so ideal, sometimes only multiple preferred models can be obtained (weakly supervised model, better performance in some aspects). Ensemble learning is to combine multiple weakly supervised models here in order to get a better and more comprehensive strongly supervised model. The underlying idea of ensemble learning is that even if a weak classifier gets a wrong prediction, other weak classifiers can also correct the error. If the Diversity among the models to be combined is significant, Ensemble will usually have a better result later.

Since training using a small amount of labeled data and a lot of tag data for joint training model, the first to use the trained classifier was used to predict all the tag data, and then select the high degree of confidence of tags as a pseudo annotation data, use pseudo scalar data and artificial labeled training data together to training classifier.

  • Integrated learning: Training several different models, such as BERT model, Mixup model, semi-supervised learning model.
    • Each model was used to predict the label probability distribution of UUU (Unlabeled Data) in Data pool.
    • Soft Prediction of probability distribution of data pool U is obtained by calculating the weighted sum of probability distribution of labels.
  • Self-training: Trains one model to combine other models.
    • Student model learning data pool UUU Soft Prediction of high confidence samples.
    • The Student model acts as the final strong learner.

2.4 Domain Migration

Domain migration mainly solves the problem of few samples and many categories.

MAML18: A widely cited method published in 2017, but the training process is more complex and later surpassed by the meta-baseline method in several areas.

Meta-baseline19: The New Baseline method published in 2020 greatly exceeds the former and is simple, so it is called New Baseline. The meta-baseline approach involves pre-training the classifier on all Base Classes and meta-learning on a few-shot classification algorithm based on the nearest centroid.

  • Classifier-baseline: Feature Encoder fθf_{\theta}fθ is obtained by pre-training the Classifier with cross entropy loss on all Base Classes and removing the last FC layer.
  • Meta-Baseline
    • Pre-training Stage: Classifier-Baseline model
    • Meta – Learning Stage: Given a few-shot Task (n-way K-shot), Calculating the average characteristics of each category wc = 1 ∣ Sc ∣ ∑ x ∈ SCF theta w_c (x) = \ frac {1} {| S_c |} \ sum_ f_ \ {x in S_c} {\ theta} (x) wc = 1 ∑ ∈ x ∣ Sc ∣ SCF theta (x).
    • For each sample in query-set, cosine similarity is used to calculate the similarity with each category to reduce the intra-class variance.

2.5 Active learning

Active Learning queries the most useful unlabeled samples through a certain algorithm and assigns them to experts for marking. Then, the queried samples are used to train the classification model to improve the accuracy of the model. The unlabeled data thrown by the model is “Hard Sample”. Different definitions of “Hard Sample” can derive a lot of methods, for example, the Hard Sample can be Ambiguous Sample, that is, the Sample that is most difficult for the model to distinguish; It can be the sample that improves (changes) the model the most, for example, the gradient improves the most; Compared with supervised learning, active learning pays more attention to the model or learns “Hard Sample” so as to obtain a better model with fewer training samples.

  • Query strategy is the core of active learning, and the query of uncertainty sampling is the most commonly used. The key of uncertainty sampling is to describe the uncertainty of samples, which usually has the following ideas:
    • For example, in the case of three categories, the probability of prediction for the two samples is (0.8, 0.1, 0.1) and (0.51, 0.31, 0.18) respectively. In this case, the second data is more “difficult” to distinguish, so it has more marking value.
    • Margin Sampling: Select the sample data that are more likely to be classified into two categories. In other words, the probability difference between these two categories is similar, and the sample with the smallest probability difference between the largest and the second largest predicted by the edge Sampling selection model.
    • Entropy is used to measure the uncertainty of a system. Higher Entropy means greater uncertainty of the system, and lower Entropy means less uncertainty of the system. Sample data with higher Entropy can be selected as data to be annotated.
  • Iteration idea: input, initial small amount of annotated data
    L 0 L_0
    , the data pool is not marked
    U U
    , deep learning model
    M M
    .

    1. Annotate data set L←L0L \leftarrow L_0L←L0.
    2. LLL was used to train model MMM and predict UUU of unlabeled data pool.
    3. Select the samples to be annotated from UUU using the corresponding query strategy and add them to the annotation data set LLL.
    4. Repeat process 2-3 until accuracy is achieved or the budget is marked.

3 Application Practice

Figure 16 shows the fine tuning of the model based on BERT20, including the inter-sentence relationship task and the single sentence classification task.

  1. Single sentence classification task: The NSP task in BERT model pre-training makes the output of the “[CLS]” position in BERT contain the information of the whole sentence pair (sentence), which is used to fine-tune the model on labeled data and give the prediction results. Common single sentence classification tasks such as text classification, emotion classification and so on.
  2. Intersentence relation task: splice two sentences with “[SEP]” and judge the relationship between two sentences with the output of “[CLS]” position. Common sentence relationship tasks such as natural language inference, semantic similarity judgment, etc.

3.1 Experimental Results

Above, we have mentioned the following two evaluation criteria: improving algorithm effect and saving annotation data. We have conducted fine-tuning experiments on Meituan business and general Benchmark data set respectively.

3.1.1 Small sample learning improves algorithm effect

FIG. 17 and 18 list the results of four kinds of Embedding enhancement, two kinds of semi-supervised learning, and integrated learning + self-training model. We have also tried other methods mentioned above, such as UDA/MixText, which rely on the results of external translation software and have inconsistent results. Therefore, we mainly use the above models. The comparison of several models is as follows:

  • Data enhancement: Among the four Embedding enhancement results, the adversarial training (AT) performance is the most stable, which can improve the model by 1PP on average.
  • Semi-supervised learning: compared with semi-supervised learning, data enhancement results are stable, but semi-supervised learning can improve model 1.5pp-2PP, and model 2PP-4PP can be improved on AFQMC dataset.
  • Integrated learning + self-training: integrated learning + self-training integrates the prediction results of single model on unlabeled data, and basically achieves or approaches the best result, which can improve the model by 1.5pp-2pp on average.

3.1.2 Take the initiative to learn and reduce training samples

Active Learning selects samples with low confidence as the next batch of labeling, which can reduce repetition/similar sample labeling and reduce labeling cost. Active Learning can be used to select launch samples when annotating data for a new project from scratch.

  • Under the current data set, an average of 500 Active Learning selection data can reach 1000 random sample results, and 900 data can approach 1500 random sample results. Active Learning can be used to select launch samples when annotating data for a new project from scratch.

3.2 Application in Meituan business

3.2.1 Classification of aesthetic medicine themes

The notes on Meituan and Dianping are divided into eight categories according to their themes: curiosity hunting, shop exploring, evaluation, real people’s cases, treatment process, pit avoiding, effect comparison and science popularization. When users click on a certain theme, the corresponding notes will be returned, and experiences will be shared on the encyclopedia page and scheme page of Meituan and the Medical beauty channel of the Comment App. This task is a typical text classification task, and the accuracy rate of small sample learning reaches 89.24% by using 2,989 training data.

3.2.2 Strategy recognition

Mining tourism strategies from UGC and notes, providing content supply of tourism strategies, applying it to the strategy module under scenic spot search, and recalling the notes describing tourism strategies. This task was a binary classification task, and the accuracy of small sample learning reached 87% by using 384 training data.

3.2.3 Marking medical beauty efficacy

Note contents of Meituan and comments will be recalled according to the efficacy, the efficacy types include: water replenishment, whitening, face thinning, wrinkle removal, etc., which will be online to the page of Medical Beauty channel. This task is an intersentence relationship task, the full note content is 1.04 million, 110 types of efficacy need to be marked, small sample learning only 2909 training data accuracy reached 91.88%.

3.2.4 Marking medico brand

Brand upstream enterprises have the appeal of brand publicity and marketing for their products, and content marketing is one of the mainstream and effective marketing methods. Brand marking is to recall notes detailing the brand for each brand such as “Yifuquan” and “Shuweike”. First, the brand words are used for matching, and then the correlation between the brand words and the matched notes is judged. Whether to describe the brand in detail or simply mention the brand name and put it online in the Yimei brand hall. This task is an intersentence relationship task. There are 103 kinds of brands, 15 kinds of major brands, 64 pieces of data for each label, and 5-8 pieces of data for the rest brands. The accuracy rate of small sample learning is 88.59% with only 1676 pieces of training data.

3.2.5 Other Service Applications

  • Authenticity of medical beauty evaluation: If there are false comments in the comments and meituan medical beauty business, it will do great harm to the experience, which needs to be detected through the model. Annotated data were selected through Active Learning, data enhancement, semi-supervised Learning and integrated Learning + self-training optimization model. Finally, only 1757 data were annotated by the business side, and the accuracy of the model in detecting perceptual false evaluation reached 95.88%, exceeding the business expectation.
  • POI marking sentiment analysis: This task is an intersentence relationship task that judges the emotional orientation (positive, negative, uncertain, and irrelevant) of Query and content. The existing model was trained on 10,000 pieces of data, and the model accuracy was improved by 0.63pp on the existing model through small sample learning.
  • Text classification: This task is a text classification task, which divides the text into 17 categories. The existing model was trained on 700 pieces of data, and the model accuracy was improved by 2.5pp on the existing model through small sample learning.

4 Future Prospects

  1. Continuously improve existing models and explore more models. There is still a lot of room for improvement in the current experimental results, and the model needs to be continuously explored and improved. At the same time, explore more domain migration models and apply them to the business to achieve the best results that the business side can train with the least data.
  2. Experiment on more task types. At present, experiments are mainly carried out on task types such as single sentence classification and inter-sentence classification, and further MRC model and named entity recognition model are needed.
  3. Explore domain migration in depth and train common models. At present, we are connected with many businesses, so we have accumulated data sets of text classification and intersentence relationship in many fields. We hope to train a general model for various tasks in this field, so that a new business can achieve good business results with less data. For example, the text classification task and inter-sentence relationship task in this field can be reexpressed as a general model of text implication task through Facebook’s EFL21 model, which can be directly transferred to new businesses.
  4. Build a small sample learning platform. Currently, we are integrating the small sample learning ability into the company’s unified BERT platform, which is open to all business parties of the company for flexible use. Later, after more in-depth exploration of small sample learning, we will try to establish a separate small sample learning platform to provide more low-resource learning ability.

reference

  • [1] Wei J, Zou K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks[J]. arXiv preprint ArXiv: 1901.11196, 2019.
  • [2] Kobayashi S. Contextual augmentation: Autocorrelation and Spectral analysis of label-based autocorrelation in label-based autocorrelation [J]. Autocorrelation and Autocorrelation, 2018, 46 (1) : 145-156.
  • [3] Anaby-Tavor A, Carmeli B, Goldbraich E, et al. Not Enough Data? Deep Learning to the Rescue! [J]. ArXiv preprint arXiv:1911.03118, 2019.
  • [4] Wei J, Huang C, Vosoughi S, et al. Few-Shot Text Classification with Triplet Networks, Data Augmentation, [J]. ArXiv Preprint arXiv:2103.07552, 2021
  • [5] Andreas J. Augmentation of goods-enough compositional data [J]. ArXiv Preprint arXiv: 194.09545, 2019.
  • [6] Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Research on empirical Risk minimization[J]. ArXiv Preprint arXiv:1710.09412, 2017.
  • [7] Jiang D, Wang Y, Wang Y, Et al. Short-time spectral analysis of label-specific autocorrelation in label-specific autocorrelation [J].
  • [8] Zhang R, Yu Y, Zhang C. Seqmix: Position Modeling in sequence labeling [J]. ArXiv preprint arXiv:2010.02322, 2020.
  • [9] Verma V, Lamb A, Beckham C, et al. Manifold mixup: Better representations by interpolating hidden states[C]//International Conference on Machine Learning. PMLR, 2019: 6438-6447.
  • [10] Miyato T, Dai A M, Adversarial training methods for semi-supervised text classification[J]. ArXiv preprint arXiv:1605.07725, 2016.
  • [11] Liang X, Wu L, Li J, et al. R-Drop: A Regularized Dropout for Neural Networks[J]. ArXiv PrePrint arXiv:2106.14448, 2021.
  • [12] Chen S, Chen S, Zhang S, et al. Ensemble learning under ensembling [J].
  • [13] Tarvainen A, Valpola H. Mean teachers are better role models: Bion-averaged consistency targets improve semi-supervised deep learning results[J]. ArXiv Preprint arXiv: 203.01780, 2017.
  • [14] Miyato T, Maeda S, Koyama M, et al. Virtual adversarial training: a regularization method for supervised and semi-supervised learning[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 41(8): 1979-1993.
  • [15] Berthelot D, Carlini N, Goodfellow I, et al. Mixmatch: Holistic Approach to semi-supervised learning[J]. ArXiv Preprint arXiv:1905.02249, 2019.
  • [16] Chen J, Yang Z, Yang D. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification[J]. arXiv preprint ArXiv: 2004.12239, 2020.
  • [17] Xie Q, Dai Z, Hovy E, Label-supervised data augmentation and their correlation with consistency training[J]. Appl Microbiol, 2010, 38 (1) : 105-110.
  • [18] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning. PMLR, 2017: 1126-1135.
  • [19] Chen Y, Wang X, Liu Z, et al. A new meta-model of fee-shot learning[J]. ArXiv Preprint arXiv:2003.04390, 2020.
  • [20] Devlin J, Chang M W, Lee K, et al. Bert: Research on deep Bidirectional Transformers [J]. ArXiv Preprint arXiv:1810.04805, 2018.
  • [21] Wang S, Wang H, Zhang Y, et al. A novel method for detecting the pattern of the Learner with a Few Shot pattern [J]. ArXiv Preprint arXiv: 204.14690, 2012.

Author’s brief introduction

Luo Ying, Xu Jun, Xie Rui, Wu Wei, etc., all from Meituan Search and NLP Department /NLP Center.

Recruitment information

Meituan Search and NLP Department /NLP Center is the core team responsible for THE RESEARCH and development of MEituan ARTIFICIAL intelligence technology, and its mission is to build the world’s first-class natural language processing core technology and service capabilities, relying on NLP, Deep Learning, Knowledge Graph and other technologies. Process meituan massive text data and provide intelligent text semantic understanding services for various businesses of Meituan.

The NLP Center is looking for an expert in natural language processing/machine learning algorithms. Interested students can send their resumes to [email protected].

Read more technical articles from meituan’s technical team

Front end | | algorithm back-end | | | data security operations | iOS | Android | test

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.