This is the 26th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

In my opinion, without strict mathematical proof, it is not enough to think about the meaning of a formula or a symbol only from the perspective of sensibility. Therefore, after consulting some data, I organized the function of τ\tauτ into another article: Understanding of the parameter τ\tauτ in Contrastive Loss

To sum up the method of SimCSE, I feel it is really clever, because it is very subjective to give two sentences for human to judge whether they are similar or not, for example: “I like Beijing” and “I don’t like Beijing”. Are these two sentences similar? A model is like a newborn child. If you teach it that these two sentences are similar, it thinks they are similar. If you teach it that these two sentences are not similar, it thinks they are not similar. At this point, the performance or accuracy of the model has little to do with the training process and model structure. What really affects the prediction results of the model is people, or the data annotated by people

But if you ask anyone about the difference between “I like Beijing” and “I like Beijing”, I don’t think a normal person would say anything different. The method of Dropout generation of positive samples by SimCSE can be regarded as the smallest form of data amplification, because the semantics of the original sentence and the generated sentence are exactly the same, but the generated Embedding is different. This avoids the need to artificially label the data, or to make the sample objective

Alignment and Uniformity

The goal of contrastive learning is to learn a high quality semantic representation space from data, so how to evaluate the quality of this representation space? Wang and Isola(2020) proposed two indicators to measure the quality of comparative learning: alignment and enrichment, where alignment calculates the average distance of Xix_IXI and Xi + X_I ^+ XI + :


align E ( x . x + ) …… p pos f ( x ) f ( x + ) 2 (2) \ell_{\text{align}} \triangleq \mathop{\mathbb{E}}\limits_{(x, x^+)\sim p_{\text{pos}}} \Vert f(x) – f(x^+)\Vert^2\tag{2}

Generates the uniformity of the overall distribution of the vector:


uniform  log E x . y …… i . i . d p data e 2 f ( x ) f ( y ) 2 (3) \ell_{\text {uniform }} \triangleq \log \mathop{\mathbb{E}}\limits_{x, y \stackrel{i . i . d}{\sim} p_{\text{data}}} e^{-2\Vert f(x)-f(y)\Vert^{2}}\tag{3}

We hope that these two indicators should be as low as possible, that is, on the one hand, positive samples should be close enough to each other, and on the other hand, semantic vectors should be evenly distributed on the hypersphere as much as possible, because the information entropy of uniform distribution is the highest, and the more uniform distribution, the more information will be retained. The author randomly selected 100,000 sentences from Wikipedia to fine-tune BERT, and tested them on STS-B Dev. The experimental results are shown in the following table:

None is the random Dropout method proposed by the author, and the other methods are based on None to change xi+x_{I}^+xi+. It can be seen that the addition of explicit data amplification methods will reduce the model performance to varying degrees. The effect closest to Dropout is to delete a word. However, deleting a word does not generate great growth. The author carries out an experiment to prove this, as shown in the figure below: