The author | Feng

Sorting algorithm house | dialogue

Editor’s Note:

Many people are familiar with comparative learning, but the role of temperature coefficient may be less clear.

Volume friends hello, I am dialogue.

The temperature coefficient in contrast learning is a mysterious parameter. Most papers use small temperature coefficients by default for self-supervised contrast learning (e.g. 0.07, 0.2). However, there is no explanation of the use of small temperature coefficient, and how the temperature coefficient affects the learning process, namely the significance of the role of temperature coefficient.

Today, I would like to introduce a paper on temperature coefficient of Contrastive Loss in CVPR2021, produced by our school, which explains the specific role of temperature coefficient and explores the learning mechanism of Contrastive learning. I believe that after reading it, One more step ahead of others on the road to inroll.

First of all, summarize the findings of this paper:

1. Contrast loss function is a loss function with difficult negative sample self-discovery property, which is crucial for learning high-quality self-supervised representation. Loss function without this property will greatly deteriorate the performance of self-supervised learning. The effect of focusing on difficult samples is that, for those that have moved away, there is no need to continue to grow away. Instead, the focus is on keeping those that do not, so that the resulting representation Spaces grow more uniform.

2. The function of the temperature coefficient is to adjust the degree of attention to the difficult sample: the smaller the temperature coefficient is, the more attention is paid to the separation of this sample from the most similar samples). The author makes an in-depth analysis and experiment on the temperature coefficient, and uses the temperature coefficient to explain how contrastive learning acquires useful representations.

3. There exists a Uniformity-Tolerance Dilemma in comparison with losses. The small temperature coefficient is more concerned with separating difficult samples similar to this sample and thus tends to yield a more uniform representation. However, difficulty samples are often highly similar to this sample, for example, different instances of the same category, that is, there are many negative samples of difficulties are actually potential positive samples. Over-forcing the separation of difficult samples destroys the underlying semantic structure of learning.

The function of temperature coefficient is analyzed theoretically and verified experimentally.

First, comparative loss pays more attention to the characteristics of difficult samples

The first form of InfoNCE loss is widely used in supervised learning:

Where is the temperature coefficient. Intuitively, this loss function requires as much similarity as possible between the I-th sample and its other augmentation copy (the positive sample) and as little resemblance as possible to the other instances (the negative sample). However, many losses can meet this requirement, such as the following in its simplest form:

However, in the actual training process, the effect of using it as a loss function is very poor. The paper presents the performance comparison of using Contrastive Loss (Eq1) and simple loss (Eq2), and the temperature coefficient is 0.07:

The above results show that Contrastive Loss is far better than Simple Loss in all data sets. Through exploration, the author finds that Contrastive Loss is a self-discovered Loss function of difficult samples, which is different from Simple Loss. As can be seen from Formula (2), Simple Loss punishes all negative sample similarity with the same weight (the gradient of Loss function is the same for all negative sample similarity). Contrastive Loss, on the other hand, automatically penalized the negative samples that were closer and more similar. This can be observed by comparing the loss (in Eq1) with a simple calculation of the penalty gradient for the similarity of different negative samples:

Gradient of orthonormal sample:

Gradient for negative samples:

Among them:

The denominator is the same for all negative sample comparisons. So the bigger it is, the bigger the molecular term is, the bigger the gradient term is. That is, the contrast loss gives a more similar (difficult) negative sample a greater gradient away from that sample. Different negative samples can be considered as the force of the same pole point charge at different distances. The closer the point charge is, the greater the Coulombic repulsion is, while the further the point charge is, the less the repulsion is. The same is true of comparative losses. This property is more conducive to the formation of uniform distribution characteristics in the hypersphere.

In order to verify that the difference between contrast loss and simple loss in the above table is really because contrast loss has the self-discovery characteristic of difficult sample, the author also uses an explicit difficult sample mining algorithm for simple loss. In other words, the most similar 4096 samples were selected as negative samples, and the simple loss of Eq2 was used as the loss function. The simple loss function of the explicit difficult sample mining algorithm was greatly improved, far exceeding the comparative loss when the temperature coefficient was 0.07. The results are shown in the following table:

Two, the effect of temperature coefficient

In addition to the self-discovery properties of the difficult samples introduced above, by observing Eq3 and Eq4, we can easily find that the absolute value of the gradient of the loss function for the positive samples is equal to the sum of the absolute values of all the gradients for the negative samples, i.e

Given this observation, the authors define a relative penalty intensity for the JTH negative sample:

For all, a Boltzmann probability distribution is formed, the entropy of which increases strictly as the temperature coefficient increases, that is, monotonically increasing (just satisfy all the inequalities). Thus the authors found that the temperature coefficient determines the entropy of the distribution. If we order from large to small to form a sequence statistic, then the magnitude of entropy will determine the steepness of the distribution, as shown in the figure below, which is the relationship between the negative sample penalty gradient and the similarity under different temperature coefficients demonstrated by the author. When the temperature coefficient is very small, such as the blue line 0.07, the penalty gradient increases dramatically with the increase of. As the temperature coefficient gradually increases, the entropy of the relative gradient gradually increases, and the probability distribution gradually approaches the uniform distribution, such as the green line on the way. Then the attention to negative samples with large similarity gradually decreases.

The function of the temperature coefficient is demonstrated above, that is, the temperature coefficient determines the attention degree of the comparison loss to the difficult negative samples. The larger the temperature coefficient is, the equal treatment is usually given, and the more difficult negative samples will not be paid too much attention. The smaller the temperature coefficient is, the more attention is paid to the difficult negative sample with very large similarity to the sample, and the larger the gradient is given to the difficult negative sample to separate from the positive sample.

In order to explain the effect of temperature coefficient more concretely, the author calculates two extreme cases, that is, the temperature coefficient tends to zero and the temperature coefficient goes to infinity.

When the temperature coefficient approaches 0:

As can be seen, the contrast loss at this time degenerates into a loss function focusing only on the most difficult negative sample. And as the temperature coefficient approaches infinity:

At this point, the weight of comparison loss to all negative samples is the same, namely, comparison loss loses the characteristics of difficult samples. Interestingly, as the temperature coefficient approaches infinity, the loss becomes the simple loss described earlier.

Through the above two limit cases, the author also analyzes that the comparison loss tends to be “equal” with the increase of the temperature coefficient, and only pays attention to the most difficult negative samples with the decrease of the temperature coefficient, so as to regulate the negative sample attention.

Three, uniformity – tolerance dilemma

Based on the exploration of the effect of temperature coefficient, the author further points out a potential problem in contrast learning, namely the homogeneity – tolerance dilemma.

For temperature coefficient, smaller temperature coefficient pays more attention to difficult samples, so it is easier to form uniform representation space. Uniform features are very important for representation learning, as can be seen in the paper of ICML2020 <>. But on the other hand, since there is no real category label in unsupervised learning, contrast learning generally takes all samples except this sample as negative samples.

In this case, negative samples that are very similar to positive samples are likely to be potentially positive samples. For example, the picture with the highest similarity with the current apple picture is usually another apple. At this time, too much attention to the difficult negative sample will destroy the semantic information that the network has learned after some training, especially in the later training period. As the training goes on, the information acquired on the network is more and more close to the real semantic characteristics, so the negative samples are more likely to be potential positive samples. Therefore, a revelation is that the temperature coefficient can be increased with the number of iterations, which may be the work of the author in the future. Therefore, a good temperature coefficient should be a compromise between uniformity and tolerance.

Homogeneity – tolerance under different temperature coefficients was quantified and visualized as shown in the figure above.

4. Experimental verification

The following figure is the experimental verification of temperature coefficient. The red box is the similarity of the positive sample, while the horizontal coordinate is the similarity distribution of the 10 samples with the largest similarity. It can be found that the smaller the temperature coefficient is, the larger the similarity gap between the positive sample and the most difficult negative sample is, indicating that the smaller the temperature coefficient is, the more likely it is to separate the most difficult negative sample. This experiment supports previous theoretical analyses.

On the other hand, the author also verified the optimal temperature coefficient of different data sets. The green column in the figure below shows the performance of the loss with the temperature coefficient. In addition, the authors also verify that the comparison loss of explicit difficulty sample discovery is adopted, and the correlation between performance and temperature coefficient is weakened after the explicit difficulty sample mining algorithm is adopted. When the temperature coefficient is higher than an appropriate value, the performance of the model generated by the loss is basically stable.

Five, the summary

In this paper, the author tries to understand some of the specific nature and behavior of unsupervised contrast loss. The authors first analyze that contrast loss is a loss function of difficult sample perception. It is also verified that the perception of difficult samples is an indispensable property of comparative loss. If the loss function does not have this property, the performance will deteriorate seriously even if there are many negative samples. In addition, the authors also studied the effect of temperature coefficient and found that temperature coefficient controls the degree of negative sample perception. And generates the stability Dilemma. In general, this paper reveals some useful properties and phenomena of comparative learning, and it is believed that this paper will inspire more researchers to design better loss and algorithms.

Machine learning/deep learning algorithm AC group

Machine learning/contrast learning algorithm communication group has been established! If you want to join the exchange group, you can directly add my wechat id: Duibai996.

When adding: nickname + school/company. The group gathered a lot of academic and industrial leaders, welcome to exchange algorithm experience, daily can also chat ~

Finally, I welcome you to follow my wechat official account: Duibainotes, which tracks the forefront of machine learning such as NLP, recommendation system and comparative learning. I also share my entrepreneurial experience and life perception on a daily basis. Students who want to further communicate can also add my wechat account to discuss technical problems with me, thank you!