Categories: Paper reading notes

Tags: Small sample learning

Summary: This paper proposes a semi-supervised small sample learning model for transfer learning, which makes full use of letters on labeled base classes and unlabeled new classes

Abstract

This paper proposes a semi-supervised small sample learning transfer learning model, which can make full use of the information on the base class and the new class. It includes three parts: 1. Feature extractor pre-trained on base class; 2. 2. Use feature extractor to initialize the weight of the classifier of the new class; 3. Semi-supervised learning method is adopted to further improve the classifier. The author puts forward a new method called MixMatch, which uses imprint and MixMatch to realize the three parts.

The introduction

The author first summarizes the two main schools of small sample learning: meta-learning method and transfer learning method.

Episode training strategy is adopted in meta-learning. Episode is a batch-like mechanism in that it is a subset of data sampled from a dataset that contains very little data from the base class, thus mimicking the situation where there is very little annotated data in the test. The annotation data in the episode is divided into two parts, namely the Support set and the Query set. The Support Set is used to build the model, and the Query Set is used to evaluate the performance of the model.

This paper is inspired by the method of transfer learning, in which the author attempts to pre-train a model using unlabeled data from base classes and new classes, and then use this model to learn the classifier of a new class.

The main contributions

1. A semi-supervised small sample learning transfer learning model is proposed, which can make full use of the information of class and unlabeled new class data

2. Developed a method called TransMatch, which combines the advantages of small sample learning method based on transfer learning and semi-supervised learning method

3. Extensive experiments are carried out on popular small sample learning data sets, and it is shown that the method can indeed make full use of tag-free number perception information

Related work

1. Small sample learning

The work related to small sample learning can be divided into two categories, one is based on meta-learning method, the other is based on transfer learning method.

Metaclearning-based approach: Metaclearning-based small sample learning, also known as learning to learn, aims to learn a paradigm that is suitable for identifying new classes of tasks with only a small number of samples. Meta-learning consists of two stages, meta-training stage and meta-testing stage. It is basically similar to the usual training phase and test phase, except that episode strategy is adopted in the training phase, and there are only a few samples for each class in the test phase. Meta-learning methods can be divided into two categories: 1. 2. Optimization based approach.

The goal of the index-based approach is to learn a good metric that measures the distance between a Support set and a Query set, or the similarity between the two.

The purpose of the optimization method is to design an optimization algorithm so that the training information can be applied to the testing stage. (I think that’s how we train models in general.)

The method based on transfer learning: transfer learning does not use episode training strategy. It pretrains the model on a large number of labeled data of the base class, and then ADAPTS the pretraining model to the related tasks of the new class with a small sample.

2. Semi-supervised learning

Semi-supervised learning can learn labeled and unlabeled data. It can be divided into two categories: self-consistent regularization method and entropy minimization method.

The self-consistent regularization method is regularization by adding noise or data enhancement.

The aim of entropy minimization method is to reduce the entropy of unlabeled data.

The MixMatch method used in this paper combines different types of self-consistent regularization method and entropy minimization method, which can be described as super performance.

3. Semi-supervised small sample learning

When the number of samples in a new class is small, it is tempting to use unlabeled data to improve the performance of the model. This idea leads to semi-supervised small sample learning methods, and there has been a lot of work in this area, but most of it is based on meta-learning. It is not appropriate to integrate the episode training strategy of meta-learning directly with the semi-supervised learning method. Meanwhile, the transfer learning method can achieve the same performance as the meta-learning method, which is the source of inspiration for the author. Semi-supervised small sample learning based on meta-learning has the following shortcomings: 1. The current performance is not optimal; 2. 2. More powerful methods like MixMatch cannot be integrated; 3. Direct use of semi-supervised learning during testing may lead to worse performance.

Problem definition

Data set DbaseD_{base}Dbase: Data set of base classes, each containing many annotated samples. The class it contains is called CbaseC_{base}Cbase. DnovelD_{NOVEL}Dnovel: new class data set, each class contains a small number of labeled samples, but the data set contains a large number of unlabeled samples. The class contained therein is called CnovelC_{novel}Cnovel. Classes in the new class and classes in the base class are disjoint.

The author’s goal is to learn a robust classifier that utilizes a small number of labeled samples and a large number of unlabeled samples in the new class. Use the base class DbaseD_{base}Dbase as the dependent dataset.

methods

The method proposed by the author is to use the data of base class to pretrain the model. Then, the pretraining model is used as a feature extractor to extract the features of a small number of labeled samples in the new class. Then these features are directly used as the initial weights of the new class classifier, and further fine-tuning is made on this basis.

The pre-trained feature extractor trains the feature extractor using data from the base class. This is the same as the purpose of pre-training for transfer learning, extracting knowledge from the base class as much as possible, and then transferring it to the new class.

N classes are sampled from the new class DnovelD_{novel}Dnovel, and K samples with labels are sampled from each class, which forms the n-way K-shot problem. This part answers two questions: 1. How to Imprint weight? 2. What does the classifier actually do? The core formula of Imprint weight is shown in Formula 1


w c = 1 K k = 1 K f e ( x k c ) (1) w_c=\frac{1}{K}\sum^{K}_{k=1}f^e(x^c_k)\tag{1}

The subscript C represents the c class, and fef^ EFe represents the feature extractor obtained in the previous stage. XKCX ^c_kxkc is the KTH sample of the CTH class. Obviously, this is to extract the average value of the features of the N-way K-shot sample and use this average value as the weight.

The classifier is actually calculating a similarity. See formula 2


f n o v e l ( x ) = [ c o s ( Theta. ( w 1 . x ) ) . . . . . c o s ( Theta. ( w N . x ) ) ] (2) f^{novel}(x)=[cos(\theta(w_1,x)),….cos(\theta(w_N,x))]^{\prime}\tag{2}

f n o v e l ( f e ( x ) ) (3) f^{novel}(f^e(x))\tag{3}

It can be seen from Formula 2 and 3 that the classifier of the new class is actually calculating the cosine similarity between the feature of sample X and the average feature of k-shot. The class with the highest similarity is the most predictive class. However, this is just an initial value for the classifier weight, which needs to be fine-tuned in the next stage.

Fine-tuning phase The author uses the MixMatch approach to fine-tune the classifier. On the one hand, MixMatch has superior performance in semi-supervised learning tasks, and on the other hand, MixMatch can make good use of unlabeled data. With annotation data of the batch record for L = {(xi, PI)}, I = 1 bl = \ {\} (x_i, p_i) ^ _ {B} {I = 1} L = {} (xi, PI) I = 1 B, a batch of annotation data record for U = U = 1 uu = {xu} \ {x_u \} ^ U_ {U} = 1 U = {xu} U = 1 U.

Labels without marked data can be estimated by an Imprint classifier in the second part. Firstly, each sample without annotated data is enhanced to generate M enhanced versions, so as to obtain data set {Xu,1… xu,M}\{x_{u,1},… x_{u,M}\}{xu,1,… Xu,M}, the samples of these M versions are input into the same classifier respectively, which will generate M different predictions. The average value of these M predicted values is taken as shown in Formula 4. After delay, a delay operation (T=0.5) is made to minimize the entropy of unlabeled data. The delay result will be the final estimate, as shown in Formula 5.


p ˉ u = 1 M i = 1 M f ( x u . i ) (4) \bar{p}_u=\frac{1}{M}\sum^{M}_{i=1}f(x_{u,i})\tag{4}

p u = p ˉ u 1 / T / j = 1 N ( p ˉ u ) j 1 / T (5) p_u=\bar{p}_{u}^{1/T}/\sum^N_{j=1}(\bar{p}_u)^{1/T}_j\tag{5}

The optimization objective includes two parts, one is the loss of cross entropy and the other is the loss of self-consistent regularization, as shown in Formula 6.


l o s s = 1 X 1 ( x . p ) X 1 p   l o g ( f ( x ) ) + 1 N X 2 ( x . p ) X 2 p f ( x ) 2 2 (6) loss=-\frac{1}{|\mathcal{X}^{‘}_1|}\sum_{(x,p)\in \mathcal{X}^{‘}_1}p\ log(f(x)) + \frac{1}{N|\mathcal{X}^{‘}_2|}\sum_{(x,p)\in \mathcal{X}^{‘}_2}||p-f(x)||^2_2\tag{6}

F (⋅) F (·) F (⋅) represents a new class classifier, which is used to predict unmarked data. MixMatch method adopts Mixup data enhancement method, that is, constructing mixed samples and mixed labels. First merge LLL and UUU (the merge should be in the axis=0 direction), then perform a shuffle operation, as shown in Formula 7. The result is called w\ mathcal{W}W, and then divide the w\ mathcal{W}W into two parts, as shown in Formula 8. This enhanced by the two data sets X1 ‘\ mathcal {X} ^ {\ prime} _ {1} X1’ and X2 ‘\ mathcal {X} ^ {\ prime} _ {2} X2’. The X1 ‘\ mathcal {X} ^ {\ prime} _ {1} X1’ is the name ‘LLL data collection and W \ mathcal {W} W ∣ before L ∣ | L | ∣ L ∣ a mixed samples. X2 ‘\ mathcal {X} ^ {\ prime} _ {2} X2’ UUU and W \ mathcal {W} W remaining ∣ U ∣ | U | ∣ U ∣ a mixed samples. Therefore, the PPP label of Formula 6 should be a mixed label. But why is there an N in the second part of formula 6.


W = S h u f f l e ( C o n c a t ( L . U ) ) (7) \mathcal{W}=Shuffle(Concat(L,U))\tag{7}

X 1 = M i x U p { L i . W i } i ( 1… L ) X 2 = M i x U p { U i . W i + L } i ( 1… U ) (8) \mathcal{X}^{‘}_{1}=MixUp\{L_i,\mathcal{W}_i\}\qquad i\in (1…. |L|) \\ \mathcal{X}^{\prime}_{2}=MixUp\{U_i,\mathcal{W}_{i+|L|}\} \qquad i\in (1…. |U|) \tag{8}

The overview

1. The base class data set is used to pre-train a feature extractor, which is used to extract the features of the new class sample, and the average value of the features of the new class sample is taken to imprint the weight of the new class classifier. 2. Merge the samples with labeled data and the samples without labeled data to make a shuffle and form a new set w\ mathcal{W}W. Labels without labeled samples can be obtained by an imprint classifier. 3, will take annotation data set name ‘LLL and W \ mathcal {W} W first ∣ ∣ L | L | ∣ L ∣ MixUp operation to get a sample data set X1’ \ mathcal {X} _1 ^ {\ prime} X1 ‘, X2 ‘\ mathcal {X} ^ {\ prime} _ {2} X2’ UUU and W \ mathcal {W} W remaining ∣ U ∣ | U | ∣ U ∣ a mixed samples. 4. The cross entropy loss of X1 ‘\mathcal{X}_1^{\prime}X1’ and the self-consistent regularization loss of X2 ‘\mathcal{X}_2^{\prime}X2’ were calculated by means of an Imprint classifier. With the loss, the gradient can be calculated and then the model parameters can be updated by back propagation.

The appendix

MixMatch zhuanlan.zhihu.com/p/66281890 what is the process