Deciphering the TinyBert model of knowledge distillation

Abstract: This paper focuses on improving the optimization mechanism of information bottleneck, and explains the two points that mutual information is difficult to estimate in high latitude space and the tradeoff problem in the optimization mechanism of information bottleneck.

This article is shared from the Huawei cloud community “[Cloud Resident Co-creation] Appreciation of the beauty: The big guy to variational distillation of cross-modal pedestrian rediscover other work”, author: Qiming.

Farewell to Mutual Information: Variational taohua for CrossModal Person Re-identification

Summary of papers

This paper focuses on improving the optimization mechanism of information bottleneck, and explains the two points that mutual information is difficult to estimate in high latitude space and the tradeoff problem in the optimization mechanism of information bottleneck.

Information bottleneck research background

This report is divided into three parts. To make it easier to understand, we first introduce the background of information bottlenecks.

As for the concept of “information bottleneck”, it was officially put forward by scholars around 2000, and its ideal goal is to obtain a minimum sufficient standard. That means extracting all the discriminant information that is useful for the task, while filtering out the redundant information. From a practical point of view, the deployment of information bottlenecks is simply optimized for the sections highlighted in red below:

So far, information bottleneck, as a representation learning method under the guidance of information theory, has been widely used in many fields, including computer vision, natural language processing, neuroscience and so on. Of course, some scholars have used information bottleneck to solve the problem of neural network black box.

However, mutual information has three disadvantages:

1. Its effectiveness depends heavily on the accuracy of mutual information estimation

Although information bottleneck has advanced conception and concept, its effectiveness depends heavily on the estimation accuracy of mutual information. According to a large number of theoretical analysis, as well as a lot of current work in practice, we can know that in high-dimensional space, mutual information is actually very powerful.

From the expression above,

V stands for observation, which you can directly understand as a high-dimensional feature map;

Z represents a representation of it, which can be understood as a low-latitude representation obtained by information bottleneck compression.

Now we need to calculate the mutual information between them.

Theoretically, we need to know these three distributions to calculate mutual information (as shown in the figure above). Unfortunately, for the potential distribution of the observed quantity itself, we can only have a limited number of data points, and cannot observe its specific potential distribution through these limited data points, let alone the relevant information of the spatial variable Z.

So, what if we use a substitute estimator in the solution space to guess? It’s not very feasible. The mutual information estimator is likely to be a gimmick because its credibility is not very high, and as a number of papers presented at the ICLR last year demonstrated.

2. The tradeoff between predictive performance and simplicity is difficult

Another serious problem is that information platform optimization is essentially a trade-off. This means that this mechanism puts the discriminance and simplicity of representations on both sides of the scale (as shown above).

In order to eliminate redundant information, it will also cause the loss of partial judgment information. But if you want to keep more discriminative information, then a fair amount of redundant information is also preserved. This makes the initial goal of the information bottleneck impossible to achieve.

Or let’s look at it in terms of optimization goals. Let’s say we give a very large beta, which means the model is more prone to pruning at this point. Obviously, the compression is up, but the model does not preserve much determinism at this point.

Similarly, given a very small beta (let’s say 10^(-5)), the model is relatively more likely to accomplish the goal given by the first mutual information. But the model doesn’t care about de-redundancy at this point.

Therefore, in the process of selecting β, we actually weigh the importance of the two goals under different tasks, which proves the problem mentioned at the beginning of the article: the essence of information bottleneck optimization is a kind of tradeoff.

3. Weakness for multi-view problems

In addition to the above two problems, we can also find that the information bottleneck can be defined dually for the information contained in the task through the label given by the task, that is to say, we can define discriminant information (red part) and redundant information (blue part) according to whether it is helpful to the task.

However, when tasks involve multi-view data, the information bottleneck has no definite basis to write the information again from the perspective of multi-view. The result is that it is more sensitive to view changes, or it lacks the ability to deal with multi-view problems.

Variational information bottleneck work introduction

After the traditional information bottleneck, let’s introduce another landmark work: Variational Information Bottleneck. A notable contribution of this work, published in ICLR 2017, is the introduction of “variational inference” (see figure below) : ** converts mutual information into entropy. ** Although this work did not solve the problems mentioned above, this idea inspired almost all subsequent related work.

Converting mutual information to entropy is a big step forward. However, there are still several shortcomings:

1. The trade-off between representational discriminant performance and simplicity has not been solved

Unfortunately, the variational information bottleneck also failed to solve the tradeoff between discriminance and simplicity in the optimization mechanism. The optimized balance still oscillates with λ.

2. The validity of variational upper bound cannot be guaranteed

The second problem is that the optimization of variational information bottleneck is actually an upper bound, but the validity of the upper bound is debatable. Because it requires a bright distribution Q (z) of the spatial variable Z to approximate a potential distribution P(z). However, in practice this is hard to guarantee.

3. Complex operations, such as reparameters and resampling, are involved

The third point is that the results of variational inference optimization will involve many complex operations (such as reparameter and resampling operations with high uncertainties), which will increase certain fluctuations in the training process and make the training not very stable and highly complex.

The research methods

The problems mentioned above are common faults of variational information bottleneck targeting methods, which to some extent hinder the time application of information bottleneck. So, let’s talk about a solution that essentially solves all of these problems.

sufficiency

First, we need to introduce the concept of adequacy: z contains all the discriminant information about Y.

It requires that the coding process of the information bottleneck does not allow the loss of discriminant information, that is to say, after V reaches Z through the information bottleneck, only redundant information is allowed to be eliminated. Of course, this is an ideal requirement (as shown in the figure above).

With the concept of “adequacy”, we split the mutual information between the observed quantity and its representation to obtain the redundant information in blue and the discriminant information in red, and then obtain the result of the following line according to the information processing inequality. This result is significant because it shows that we need to go through three sub-processes to obtain the minimum sufficient criterion, that is, the optimal criterion.

The first subprocess is actually raising the upper limit of the total amount of discriminant information contained in the representation Z. Why do you say that? Because everything z contains comes from its observations. So by increasing its observation, its upper limit on the amount of its own discriminant information, it’s also raising its own upper limit on z.

And the second subprocess is to let the representation Z approach its upper limit of discriminability. These two terms actually correspond to the adequacy requirement.

The conditional mutual information of the third subprocess, as mentioned above, represents the redundant information contained in the goal, so the minimization term corresponds to the simplest goal. Here, “conditional mutual information” is briefly explained. It refers to the information contained in Z that is only related to V and irrelevant to Y. In short, it is redundant information unrelated to the task. In fact, the first sub-process can be seen from the previous variational information bottleneck. In fact, a conditional entropy is optimized, that is, a cross entropy is calculated with the initial feature graph and label of the observation quantity V, and then optimized. So this term is essentially consistent with the given task, so it doesn’t need special treatment for the moment.

As for the other two optimization objectives, they are essentially equivalent. And it’s worth noting that this equivalence implies that in the process of improving the discriminance of representations, redundancy is eliminated. By pulling the two objects which used to be opposite to each other to the same side of the balance, the information bottleneck is directly rid of the original balancing problem, making the information bottleneck with the minimum sufficient standard theoretically feasible.

Theorem one and lemma one

Theorem 1: Minimize I(v; Y) – I (z; Y) is equivalent to minimizing the difference between v and z in the conditional entropy of the task target y, that is:

minI(v; Y) – I (z; Y) as indicated by min H (y | z) – H (y | v),

Of condition entropy are defined as H (y | z) : = – ∫ p (z) dz ∫ p (y | z) log p (y | z) dy.

Lemma 1: When the prediction made by representation Z on task objective Y is the same as its observation quantity V, representation Z is sufficient for task objective Y, namely:

In order to achieve the goal set above, it is also necessary to avoid the estimation of mutual information in high-dimensional space. Therefore, this paper puts forward the two contents of the key theorems and lemmas in detail.

For easy comprehension, look at the logic diagram above. Theorem 1 is directly transformed into the difference between conditional entropy through the optimization of blue mutual information. In other words, if you want to achieve the above two (blue) goals, you can change to minimize the difference in conditional entropy.

Lemma one, on this basis, converts this result into a KL divergence, which is essentially two logits.

In other words, in practice, we only need to optimize such a simple KL divergence to achieve both the adequacy and minimization of the representation. It’s much simpler than a traditional information bottleneck.

The network structure itself is simple: one encoder, one information bottleneck, and one KL divergence. Because of its form, this method was also called Variational self-distillation, or VSD.

Compared with the original optimization mechanism of mutual information bottleneck, it can be found that VSD has three prominent advantages:

No mutual information estimation and more accurate fitting
Solve trade-offs in optimization
It does not involve heavy parameter, sampling and other complicated operations

Consistency

Only the information that is discriminant and consistent between views is saved to enhance the robustness of representations to view changes.

**** definition: represents Z1, z2 satisfies the consistency between views if and only if I(z1; y) = I(v1v2; y) = I(z2; Y).

With theorem 1 and lemma 1 in hand, the next task is to extend variational self-distillation into the context of multi-view learning.

As shown above, this is a basic framework. Two images x1 and x2 are input into an encoder to obtain two original high-dimensional feature graphs V1 and V2, and then v1 and V2 are sent to the information bottleneck to obtain two compressed low-dimensional representations Z1 and Z2.

As shown in the figure above, this mutual information is the mutual information between the observed quantity and its representation in the same view. However, we should pay attention to the difference between the split and VSD, because the division of information here is based on whether it reflects the commonality between views, rather than the requirement of discrimination and redundancy, so the result of the split is I(Z1; V2) = i(v2; v1|y) + I(z1; Y).

Then, according to whether the view meets the discriminant requirements, the common information among the views of the hierarchy is divided twice to obtain two redundant information and discriminant information (as shown in the figure above).

If we want to improve the robustness of representation to view changes, and thus improve the accuracy of the task, we only need to maintain I(z1; Y (in red) will do. I(v1; Z1 | v2) (blue) and I (v2; V1 | y) (green) will be lost. The objectives of optimization are as follows:

Theorem 2: Given two sufficient observations v1, v2, their corresponding representations Z1 and Z2 satisfy inter-view consistency if and only if this condition is satisfied: I(v1; z1|v2) + I(v2; Z2 | 0 or less v1) and I (v2; v1|y) + I(v1; V2 | y) 0 or less

Theorem two can be used to illustrate the nature of consistency between views. In essence, inter-view consistency requires the elimination of view-specific information and the elimination of redundant information unrelated to the task to maximize the promotion representation.

The two methods

Remove view-specific information

Variational Mutual Learning (VML) : Minimize JS divergence between z1 and Z2 prediction distributions to eliminate view-specific information contained in them. Specific objectives are as follows:

Eliminate redundant information

Variational cross-distillation (VCD, corresponding to the red in the above figure) : In the retained view consistency information, the discriminant information was purified by cross-optimizing the KL divergence between observations and different view representations, and the redundant information was eliminated at the same time. The specific objectives were as follows (v1 and Z1 are the same) :

The diagram above shows the structure of the two processes. Originally, there were specificity and consistency, but the information was divided into two parts according to VML, and all the characteristic information was eliminated by using variational mutual learning. Then there were two remaining consistency information in orange: redundant information and decision information. At this point, variational cross-distillation is required to remove the redundant information (green) and retain only the discriminant information (red).

The experimental results

Next, let’s analyze the experimental part of the article. In order to verify the effectiveness of the method, we apply the three methods mentioned above: variational self-distillation, cross distillation and mutual learning to the problem of cross-modal crowd identification.

The problem of crowd recognition across modal lines is a sub-problem of computer science. The core goal is to match a given portrait with a photo in another modal. For example, in the infrared image marked in the green box below, we want to find visible light images corresponding to the same person in an image library, either using infrared light to find visible light, or using visible light to find infrared light.

Framework overview

Overview of model architecture

The model consists of three independent branches, and each branch contains only one encoder and one information bottleneck. The specific structure is shown in the figure below.

The important thing to note here is that since the upper and lower branches, the orange part only accepts and processes infrared things, and the blue part only accepts and processes visible things, they do not involve multiple views, so they can be bound to them using VSD.

The middle branch, when trained, receives and processes data from both modes. Therefore, when training, using VCD is variational cross distillation and variational mutual learning cooperative training analysis.

Overview of loss functions

The loss function consists of two parts, namely variational distillation proposed in the paper, and the training constraint most commonly used by re-ID. Note that VSD construres only single-mode branches, whereas VCD construres cross-mode branches in collaboration with VML.

Experimental standard: SYSU-MM01 & RegDB

SYSU-MM01:

The dataset consisted of 287,628 visible and 15,792 infrared images of 491 objects. The images of each target were taken indoors and outdoors by six non-overlapping cameras.

The evaluation criteria include all-search and indoorSearch. Standard evaluation criteria were used for all experimental results in this paper.

RegDB:

The dataset includes 412 objects in total, and each object corresponds to ten visible and infrared images taken at the same time.

The evaluation criteria include visible-to-infrared and infrared to visible. The final evaluation result was the average accuracy of ten experiments, and each experiment was carried out on a randomly divided evaluation set.

Results analysis

The work related to cross modal crowd recognition can be divided into four categories: Network Design, Metric Design, Generative class, Representation learning class.

As the first work to explore representational learning, this method has been able to outperform competitors by such a wide margin without involving survival processes and complex network structures. It is also for this reason that the variational distillation loss proposed in this paper can be easily integrated into different categories of methods to tap greater potential.

On another data set, we can see a similar result.

Next, we will select some representative ablation experiments to analyze the effectiveness of the method in practice.

Before we start, we need to make it clear that the dimension of observation v is uniformly set to 2048, which is commonly used in the Re-ID community. The dimension of representation defaults to 256; Information bottleneck uses GS mutual information estimator uniformly.

Ablation experiments: Variational distillation vs information bottleneck under single modal branching conditions

Without considering the multi-view condition, only the adequacy of the representation is concerned.

As shown in the figure above, variational self-distillation can provide a huge performance boost. 28.69 to 59.62 are very intuitive numbers, indicating that variational self-distillation can effectively improve the discrimination of characterization, and extract more valuable information while removing a lot of redundant information.

Ablation experiments: Variational distillation vs information bottleneck under multimodal branching conditions

Let’s look at the result below the multiple views. When we test only across modal branches, we find two phenomena:

First, the performance of variational distillation is reduced. It was 59 and now it’s 49. Here we speculate that it is some discarded modal specific information. The intermediate score is retained to satisfy both characteristic information, so the modal specific information is discarded first. However, there is also considerable discriminability in the discarded modal specificity information, so the cost of modal consistency is satisfied, that is, the precision loss caused by discriminability loss.

Second, the performance of the traditional information bottleneck has not changed very much under the condition of multi-mode. It was 28, now it’s 24. We believe that traditional information bottlenecks are not very good at distinguishing consistency from specificity because they are not focused on multi-view problems and are not capable of dealing with them. Therefore, the condition of multiple views does not bring significant fluctuations.

Ablation experiment: Variational distillation vs information bottleneck under three-branch condition

After adding the middle branch on the basis of the double branch, the overall performance of the model is basically unchanged. We can draw the following conclusions:

As long as the discriminant information is satisfied, the information can be saved.

The middle branch holds information that satisfies two requirements. One of them satisfies the discriminant requirement, which is to say that the information stored in the middle branch is actually a subset of the information stored in the top and bottom.

In view of the information bottleneck, the improvement brought to it by the three branches is quite obvious. Because none of its branches can hold all the discriminant information, let alone the “multi-view” thing.

Ablation experiment: comparison of “adequacy” under different compression rates

Let’s look at the effect of compressibility of characterization on performance. According to the uniform standard of Re-ID design, the original dimension of feature map was designed as 2048.

We adjust the change of representation V to the overall performance of the model. When the dimension is less than 256, the performance will increase with the increase of the dimension. We speculate that when the compression rate is too compressed, no matter how strong the model is, there are not so many channels to store enough discriminant information, which will easily lead to the phenomenon of inadequacy.

However, when the dimensions exceed 256, performance begins to decline. What we think about this is that the extra channel makes part of the redundant information can be retained, which leads to the reduction of the overall discriminability and generalization. This is known as Redundancy.

In order to better display the differences of different methods, we combined different feature Spaces into one plane with TFNE (as shown in the figure below).

We first analyze the adequacy, that is, the comparison between VSD and traditional information bottleneck. The superscripts “V” and “I” stand for visible and infrared data, while the subscripts “Sp” stand for View specific, that is, they are derived from single-mode analysis.

We can see that the feature space of the traditional information bottleneck can be said to be chaotic, indicating that the model cannot clearly distinguish the categories of different objects. In other words, the loss of discriminant information is serious; However, VSD is completely opposite. Although there are still considerable differences in feature space between different modes, because a large part of the saved decision information belongs to modal specific information, almost every error is clear and distinct, indicating that the model can better meet the adequacy with the help of VSD.

In the figure below, sh indicates that they are from the shared branch, they are from the multimodal branch, and the superscripts “V” and “I” still represent the visible and infrared data points.

The same feature space of information bottleneck is still chaotic under the condition of multi-view. And without explanation, it’s virtually impossible to tell which of these two graphs is single-mode and which is multi-mode. This proves the previous point: traditional information bottlenecks are simply not capable of handling multiple views.

The feature space processed by variational cross-distillation is a little loose in comparison with VSD (the requirement of view inevitably causes some loss of discriminant information), but the coincidence degree of the feature space of the two modes is very high, which shows that the method is effective for consistency information.

Next, we project data of different modes into the same feature space, and orange and blue represent infrared image data points and visible image data points respectively.

We can see that with the help of variational cross distillation, the characteristic Spaces of different modes are almost identical. The effectiveness of variational cross distillation can be directly illustrated by comparing the results of information bottleneck.

Repetition code

Performance comparison: Pytorch vs. Mindspore

Both PyTorch and MindSpore are used to train the model, while performance testing requires extracting features from the model and sending them to the corresponding data and officially supported test files, so the comparison of results is definitely fair.

We can see that the model produced by MindSpore is much better than PyTorch in terms of both the baseline and the overall framework (since the experiment in the lower right corner is only half run), both in terms of accuracy and in terms of training duration.

If you are interested in MindSpore, visit:

www.huaweicloud.com/product/mod…

Click to follow, the first time to learn about Huawei cloud fresh technology ~