directory

A basic understanding

How does meta-learning differ from traditional machine learning?

The basic idea

MAML

What is the difference between MAML and pre-training?

1. Different loss functions

2. Different ideas of optimization

Advantages and characteristics of MAML

MAML working mechanism

MAML app: Toy Example

Reptile


A basic understanding

Meta Learning, translated as meta-learning, can also be thought of as learn to learn.

How does meta-learning differ from traditional machine learning?

Zhihu blogger “Nan Youqiao” gave an easy-to-understand example in understanding meta-learning and traditional machine learning to share with you:

The training algorithm is analogous to students’ learning in school. The traditional machine learning task is to train a model in each subject respectively, while meta-learning is to improve students’ overall learning ability and learn to learn.

In schools, some students get good grades in all subjects, while others are partial to subjects.

  • All subjects are good, indicating that students have strong meta-learning ability, have learned how to learn, and can quickly adapt to the learning tasks of different subjects.
  • Partial division student “yuan learns” ability is relatively weak, can some division study result is good only, change door division is not good. Can’t draw inferences from others.

Now often used deep neural networks are “partial”, classification and regression corresponding network model is completely different, even if the same classification task, the face recognition network architecture used in the classification of ImageNet data, may not be able to achieve high accuracy.

 

There’s another difference:

  • Traditional deep learning methods are all learning from scratch, that is, learning from scratch, which is a greater consumption and test of computing power and time.
  • Meta-learning emphasizes learning a model with good discrimination and generalization ability for unknown samples and unknown categories from several different small task samples

The basic idea

Write in front: the pictures are from teacher Li Hongyi’s teaching video

Figure 1

Interpretation of Figure 1:

Meta Learning is also known as “Learn to Learn”.

The training samples and test samples of meta-learning are task-based. Through different types of tasks to train models, update model parameters, master learning skills, and then learn other tasks better. For example, task 1 is speech recognition, task 2 is image recognition, task 100 is text classification, task 101 is different from the previous 100 tasks, training task is the 100 different tasks, test task is the 101st task.

Figure 2

Interpretation of Figure 2:

In machine learning, a training set in a training sample is called a train set and a test set is called a test set. Meta-learning is widely used in small sample learning. In meta-learning, the training set in the training sample is called support set, and the test set in the training sample is called Query set.

Note: In machine learning, there is only one large sample data set, which is divided into two parts, called train set and test set;

However, in meta-learning, there is more than one data set, and there are as many data sets as there are different tasks, and each data set is divided into two parts, called support set and Query set respectively.

Validation sets are not considered here.

Figure 3

Interpretation of Figure 3:

 

Figure 3 shows the operation mode of traditional deep learning, namely:

  1. Define a network architecture;
  2. Initialization parameter
  3. Update parameters through the optimizer of your choice;
  4. Parameters were updated by epoch twice;
  5. Get the final output of the network.

What is the connection between meta-learning and traditional deep learning?

The things in the red box in Figure 3 are defined by human design, which is often referred to as “hyperparameters”, while the goal of meta-learning is to automatically learn or replace the things in the box. Different methods of substitution will create different meta-learning algorithms.

Figure 4.

Interpretation of Figure 4:

Figure 4 illustrates the principle of meta-learning.

In the neural network algorithm, a loss function is defined to evaluate the quality of the model, and the loss of meta-learning is obtained by adding the test losses of N tasks. The test loss defined on the NTH task isFor N tasks, the total loss isThis is the optimization goal of meta-learning.

Assume that there are two tasks, Task1 and Task2. Through training Task1, the loss function l1 of Task1 is obtained, and through training Task2, the loss function l2 of Task2 is obtained. Then, the loss function of the two tasks is added together to obtain the loss function of the entire training task, which is the formula in the upper right corner of figure 4.

 

If you don’t know enough about meta-learning, I’ll explain it in more detail:

There are many algorithms of Meta Learning. Some superior algorithms can output different Neural network structures and hyperparameters for different training tasks, such as Neural Architecture Search (NAS) and AutoML. Most of these algorithms are too complex for the rest of us to implement. Another Meta Learning algorithm that is easy to implement is MAML and Reptile, which are introduced in this paper. They do not change the structure of deep neural network, but only change the initialization parameters of the network.

 

MAML

To understand the meaning and derivation of loss function of MAML algorithm, it is necessary to distinguish it from pre-training.

Interpretation of Figure 5:

We define the initialization parameter to be, its initialization parameter is, the model parameter after training on the NTH test task is defined as, so the total loss function of MAML is 。

Figure 5

What is the difference between MAML and pre-training?

1. Different loss functions

The loss function of MAML is 。

The pre-training loss function is.

Intuitively, it can be understood that the loss evaluated by MAML is the loss test after the task training, while pre-training is to calculate the loss directly on the original basis without any training. See Figure 6.

Figure 6.

2. Different ideas of optimization

Here to share my first see the loss function of the most appropriate description: (zhuanlan.zhihu.com/p/72920138)

The secret of the loss function is that the initialization parameters control the game, and the task parameters are independent

Figure 7.

 

Figure 8.

As shown in Figure 7 and Figure 8:

In the figure above, the abscissa represents network parameters and the ordinate represents loss function. The light green and dark green curves represent the loss function variation curves of two tasks with parameters.

Assuming the model parametersandVectors are all one-dimensional,

The original intention of Model pre-training is to find a way to minimize the sum of losses of all tasks from the beginningIt does not guarantee the best training for all tasks, as shown in the figure above, that is, convergence to local optimum. As can be seen from Figure 7, the loss value reaches the minimum value according to the calculation formula, but the task2 (light green) line can only converge to the green dot on the left, that is, the local minimum, while the global minimum on the wholeAppears to the right of.

And MAML’s original intention was to find an impartial one, so that both the loss curves of task 1 and Task 2 can be reduced to the global optimum respectively. It can be seen from Figure 8 that the loss value reaches the minimum value according to the calculation formulaIn this case, task1 converges to the green dot on the left and Task2 converges to the green dot on the right, both of which are global minima.

Teacher Li Hongyi gave a very vivid analogy here: He compared MAML to choosing to study for a Doctoral degree, which is more concerned with students’ future development potential; However, model pre-training is like choosing to work in a big factory directly after graduation, and cashing in the skills immediately. What you care about is how well you perform at the moment. See Figure 9.

Figure 9.

Advantages and characteristics of MAML

See Figure 10: MAML

  1. Fast calculation speed
  2. All update parameter steps are limited to one, one-step
  3. With this algorithm, you can update more times when testing the performance of new tasks.
  4. Suitable for situations with limited data

Figure 10.

MAML working mechanism

In the paper introducing MAML, the algorithm given is shown in Figure 11:

Figure 11.

A detailed explanation of each step is given below: reference (zhuanlan.zhihu.com/p/57864886)

  • Require1: Task distribution, that is, several tasks are randomly selected to form a task pool
  • Require2: Step size is the learning rate. MAML is based on double gradient. Each iteration includes two parameter updates, so there are two learning rates that can be adjusted.
  1. Randomly initialize the parameters of the model
  2. A cycle can be understood as an iterative process or an epoch
  3. Sample several tasks randomly to form a batch.
  4. Loop through each task in Batch
  5. The gradient of each parameter is calculated using the support set in a task in batch. In n-way K-shot Settings, there should be NK support sets. (N-way K-shot means there are N different tasks, with K different samples for each task).
  6. The first gradient update.
  7. End the first gradient update
  8. Second gradient update. The sample used here is a Query Set. After the end of Step 8, the model finishes the training in this batch and returns to Step 3 to continue sampling the next batch.

Here’s a more intuitive diagram of the MAML process:

Figure 12

FIG. 12 is interpreted as follows:

MAML app: Toy Example

The goal of this toy example is to fit sinusoids:A and B are random numbers, and each group of A and B corresponds to a sinusoidal curve. K points are sampled from the sinusoidal curve and their horizontal and vertical coordinates are used as a group of tasks. The horizontal coordinate is the input of the neural network, and the vertical coordinate is the output of the neural network.

We hope to learn a set of initialization parameters of the neural network through learning many tasks, and then input K points of the test task. After rapid learning, the neural network can fit the sinusoidal curve corresponding to the test task.

Figure 13

The left side initializes the neural network parameters with the normal Fine-Tune algorithm. We observed that when the sum of the loss functions of all training tasks was taken as the total loss function, network parameters were directly updatedBecause a and B can be set arbitrarily, the expected value of all possible sinusoidal functions added together is 0. Therefore, in order to obtain the global minima of the sum of all training task loss functions, Whatever the input coordinates, the neural network will output 0.

On the right is the network trained by MAML. The initialization result of MAML is the green line, which is different from the orange line. But as finetuning progressed, the results came closer to the orange line.

 

With regard to MAML introduced earlier, a question is raised:

When updating the training mission network, only one step was taken, and then the meta network was updated. Why is it one step? Can it be multiple steps?

As mentioned in Teacher Li Hongyi’s course:

  • Update only once, the speed is relatively fast; In Meta Learning, there are many sub-tasks, which are updated many times and take a long time to train.
  • MAML wants the initialization parameters to work well when finetuning the new task. If you only update once, you can get good performance on new tasks. Having this as a goal makes meta network parameter training good (goals and requirements aligned).
  • It is also possible to finetuning many times when the initialization parameters are applied to a specific task.
  • Fee-shot learning tends to have less data.

Reptile

Reptile is similar to MAML, with the following algorithm diagram:

Figure 14

In Reptile, update every time, a batch task (batchsize=1 in the figure) is required to be sampled, and multiple gradient descent is applied to each task to obtain the corresponding of each task. And then calculateAnd the parameters of the main taskThe difference vector of, as an updateThe direction of the. This iterates over and over again to get global initialization parameters.

Its pseudocode is as follows:

Reptile, each time sample produces 1 training task

 

Reptile: Each sample sends one batch training task

 

In the Reptile:

  • The network of training missions can be updated multiple times
  • Instead of calculating gradients as MAML does (and thus improving engineering performance), Reptile uses a single parameter directlyUpdate meta network parameters by multiplying the difference between meta network parameters and training task network parameters
  • In terms of effect, Reptile is about the same as MAML

 

The above is an in-depth understanding of meta-learning, the follow-up may be MAML mathematical formula derivation, interested readers comment ~


The resources

【 1 】 zhuanlan.zhihu.com/p/72920138

【 2 】 zhuanlan.zhihu.com/p/57864886

【 3 】 zhuanlan.zhihu.com/p/108503451

[4] MAML thesis arxiv.org/pdf/1703.03…

[5] zhuanlan.zhihu.com/p/136975128