Article source | turbine cloud community (turbine cloud, focus on AI industry platform for the sharing work force)

The original address | Dropout

The original author | Mathor


The most popular NLP paper of the first half of 2021 was SimCSE: Simple Contrastive Learning of Sentence Embeddings. The full name of SimCSE is Simple Contrastive Sentence Embedding

Sentence Embedding

Sentence Embedding has always been a hot issue in NLP field, mainly because of its wide range of applications and as the cornerstone of many tasks. There are many methods to obtain the sentence vector, the common ones are directly taking the output of [CLS] position as the sentence vector, or summing and averaging the output of all words, etc. However, the Anisotropy problem has been proved in all the above methods. Generally speaking, the inconsistent representation of Word Embedding dimensions will occur in the process of model training, so that the sentence vectors obtained cannot be directly compared

At present, the most popular methods to solve this problem are:

  1. Linear transformations: Bert-flow, Bert-Broke. These two are more like post-processing, which alleviates the anisotropy problem by performing some transformations on the sentence vectors extracted by BERT
  2. Comparison learning: SimCSE. The idea of contrast learning is to pull similar samples closer and push away dissimilar samples so as to improve the model’s sentence representation ability

Unsupervised SimCSE

SimCSE uses self-supervised learning to improve sentence representation. Since SimCSE has no label data (unsupervised), each sentence itself is treated as similar. To put it bluntly, it is essentially (self, self)(self, self)(self, self) as positive example, (self, others)(self, others)(self, others) as negative example to train the contrastive learning model. Of course, it’s not as simple as that. If you just take two samples that are exactly the same as each other as positive examples, your generalization ability will be compromised. Typically, we use some form of data amplification to make the two samples of a positive example look different, but in NLP, how to do data amplification is a problem. SimCSE provides an extremely simple and elegant solution: just use Dropout as data amplification!

Specifically, NNN sentences using Dropout Encoder yield vectors H1 (0), H2 (0)… HN (0) \ boldsymbol {h} _1 ^ {(0)}, \ boldsymbol {h} _2 ^ {} (0),… , \ boldsymbol {h} _N ^ {} (0) h1 (0), h2 (0),… ,hN(0), and then re-run the Encoder (this time another random Dropout) to produce the vectors H1 (1), H2 (1)… , hN (1) \ boldsymbol {h} _1 ^ {(1)}, \ boldsymbol {h} _2 ^ {} (1),… , \ boldsymbol {h} _N ^ {} (1) the h1 (1), h2 (1),… , hN (1), we can (1) hi (0), (hi) (\ boldsymbol {h} _i ^ {(0)}, \ boldsymbol {h} _i ^ {} (1)) (1) hi (0), (hi) as a (slightly different) is, for the training target

Among them,
sim ( h 1 . h 2 ) = h 1 T h 2 h 1 h 2 \text{sim}(\boldsymbol{h}_1, \boldsymbol{h}_2)=\frac{\boldsymbol{h}_1^T\boldsymbol{h}_2}{\Vert \boldsymbol{h}_1\Vert \cdot \Vert \boldsymbol{h}_2\Vert}
. In fact, equation (1) if you don’t look
log -\log
and
tau \tau
The rest is very much like
Softmax \text{Softmax}
. Set in the paper
tau = 0.05 \ tau = 0.05
And as for this
tau \tau
What does it do? I saw some explanations on the Internet:

  1. If you directly use cosine similarity as logits input to Softmax\text{Softmax}Softmax, since the range of cosine similarity is [−1,1][-1,1][−1,1], The range is too small for Softmax\text{Softmax}Softmax to give a large enough difference between positive and negative samples. The end result is that the model is not sufficiently trained and therefore needs to be corrected by dividing by a small enough parameter τ\tau tau tau to amplify the value
  2. Super parameter tau \ tau tau meeting will be the focus of the model updating, focuses on the negative example, difficult and do the corresponding punishment, the greater the difficulty, also is with the hi (0) {h} _i ^ {} (0) hi (0) the closer distance, is assigned to the more punishment. And it makes a lot of sense, (0), we will sim (hi hj (1)) \ text {sim} (\ boldsymbol {h} _i ^ {(0)}, \ boldsymbol {h} _j ^ {} (1)) sim (hi (0), hj (1)) divided by the number Tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau (0), then the sim (hi hj (1)) \ text {sim} (\ boldsymbol {h} _i ^ {(0)}, \ boldsymbol {h} _j ^ {} (1)) sim (hi (0), hj (1)), the more close to 1 negative samples, after tau \ tau tau enlarged will dominate

In my opinion, without strict mathematical proof, it is not enough to think about the meaning of a formula or a symbol only from the perspective of sensibility. Therefore, after consulting some data, I organized the function of τ\tauτ into another article: Understanding of the parameter τ\tauτ in Contrastive Loss

To sum up the method of SimCSE, I feel it is really clever, because it is very subjective to give two sentences for human to judge whether they are similar or not, for example: “I like Beijing” and “I don’t like Beijing”. Are these two sentences similar? A model is like a newborn child. If you teach it that these two sentences are similar, it thinks they are similar. If you teach it that these two sentences are not similar, it thinks they are not similar. At this point, the performance or accuracy of the model has little to do with the training process and model structure. What really affects the prediction results of the model is people, or the data annotated by people

But if you ask anyone about the difference between “I like Beijing” and “I like Beijing”, I don’t think a normal person would say anything different. The method of Dropout generation of positive samples by SimCSE can be regarded as the smallest form of data amplification, because the semantics of the original sentence and the generated sentence are exactly the same, but the generated Embedding is different. This avoids the need to artificially label the data, or to make the sample objective

Alignment and Uniformity

The goal of contrastive learning is to learn a high quality semantic representation space from data, so how to evaluate the quality of this representation space? Wang and Isola(2020) proposed two indicators to measure the quality of comparative learning: alignment and stability, where alignment is calculated
x i x_i
and
x i + x_i^+
Average distance of:

Generates the uniformity of the overall distribution of the vector:

We hope that these two indicators should be as low as possible, that is, on the one hand, positive samples should be close enough to each other, and on the other hand, semantic vectors should be evenly distributed on the hypersphere as much as possible, because the information entropy of uniform distribution is the highest, and the more uniform distribution, the more information will be retained. The author randomly selected 100,000 sentences from Wikipedia to fine-tune BERT, and tested them on STS-B Dev. The experimental results are shown in the following table:

None is the random Dropout method proposed by the author, and the other methods are based on None to change xi+x_{I}^+xi+. It can be seen that the addition of explicit data amplification methods will reduce the model performance to varying degrees. The effect closest to Dropout is to delete a word. However, deleting a word does not generate great growth. The author carries out an experiment to prove this, as shown in the figure below:

Connection to Anisotropy

In recent years, many studies have mentioned the anisotropy problem of semantic vector distribution generated by language model. Before discussing why Contrastive Learning can solve the anisotropy problem of word vector, let’s first understand what anisotropy is. Specifically, if our word vector is set to 2 dimensions, Anisotropy is Anisotropy if the unit length of the basis vector on each dimension is different.

For example, in the figure below, where the basis vectors are non-orthogonal and anisotropic (the unit length is not equal), calculate that the cosine similarity of x1xx_1xx1x and x2x_2x2 is 0, and the cosine similarity of x1x_1x1 and x3x_3x3 is also 0. However, from the perspective of geometry, x1x_1x1 and x3x_3x3 are actually more similar, but from the calculation result, x1x_1x1 and x2X_2x2 and x3x_3x3 have the same similarity, which is caused by anisotropy

The author of SimCSE proves that when the number of negative samples tends to infinity, the training objective of contrast learning can be asymptotically expressed as:

Just to explain this a little bit, just for convenience, Next 1 tau E (x, x +) ~ ppos [f (x) (x +)] Tf \ frac {1} {\ tau} \ mathop {\ mathbb {E}} \ limits_ {\ sim (x, x ^ +) P_ {\ text {pos}}} \ left [f (x) ^ ^ + (x) Tf \ right] tau (x, x +) 1 ~ pposE [f (x) (x +) Tf] called first, The Ex – pdata [log ⁡ Ex – ~ pdata [ef (x) (x -)/tau Tf]] \ mathop {\ mathbb {E}} \ limits_ \ {x sim p_ {\ text {data}}} \ left [\ log \mathop{\mathbb{E}}\limits_{x^-\sim P_ {\ text {data}}} \ left [e ^ {f (x) ^ Tf (x ^ -) / \ tau} \ right] \ x ~ pdataE right] [logx – ~ pdataE [ef (x) (x -)/tau Tf]] known as the second.

Our ultimate goal is to make Formula (4) as small as possible. Specifically, if the first term is bigger and the second term is smaller, the overall result is very small. If the first term is large, the similarity between positive sample pairs is large; If the second term is small, it indicates that the similarity between negative sample pairs is small, which is also the model performance we hope to see in the end

And then we try to go from equation (1) to equation (4), Notice that in fact F (x) = hi (0), (1) f (x +) = hi, f = hj (x -) (1) f (x) = \ boldsymbol {h} _ {I} ^ {} (0) and f (x ^ +) = \ boldsymbol {h} _ {I} ^ {} (1), f (x ^ -) = \ boldsymbol {h} _ {j }} ^ {(1) (0) f (x) = hi, f (1) = (x +) hi, f = (x -) hj (1)

From here on out, there’s no strict equals, there’s some equivalence or proportional relationship. Such as the original
sim ( h 1 . h 2 ) = h 1 T h 2 h 1 h 2 \text{sim}(\boldsymbol{h}_1, \boldsymbol{h}_2)=\frac{\boldsymbol{h}_1^T\boldsymbol{h}_2}{\Vert \boldsymbol{h}_1\Vert \cdot \Vert \boldsymbol{h}_2\Vert}
Here we omit the denominator, change it to expectation, and change the sum to expectation, then

We can further derive the lower bound of the second term by Jensen’s inequality:

The first part of the equal sign is easy to understand, that is, the expectation is changed to the sum of probabilities, and f(x)f(x) and F (x−)f(x^-)f(x−) are changed back to hi,hj\ boldSymbol {h}_i,\boldsymbol{h}_jhi,hj. For those of you who aren’t familiar with Jensen’s inequality, let me make it a little bit more basic. For a convex function f(x)f(x)f(x), if λ I ≥0\lambda_i \ge 0λ I ≥0 and ∑ I λ I =1\sum_i \lambda_i=1∑ I λ I =1, then there are

Back to the proof of equation (5), since
log ⁡ \ log
It’s convex, and we’re going to take
1 m \frac{1}{m}
As a
Lambda. i \lambda_i
.
e h i T h j / tau e^{\boldsymbol{h}_i^T \boldsymbol{h}_j/\tau}
As a
x i x_i
, can be obtained by Jensen’s inequality

So, after a long time, just to review our ultimate goal is to optimize, or minimize, the second term of Equation (4). Set W\mathbf{W}W as x = m{x_i}_{I =1}^ mXII = m. Then optimizing the second term is equivalent to minimizing the upper bound of WWT\mathbf{W}\mathbf{W}^TWWT. Why is that? Because the Sum (WWT) = ∑ ∑ I = 1 m j = 1 mhithj! \text{Sum}(\mathbf{W}\mathbf{W}^T)=\sum_{i=1}^m \sum_{j=1}^m\boldsymbol{h}_i^T \boldsymbol{h}_j! The Sum (WWT) = ∑ ∑ I = 1 m j = 1 mhithj! Suppose we have normalized hi\boldsymbol{h} _iHI, where the diagonal elements of WWT\mathbf{W}\mathbf{W}^TWWT are all ones, Tr (WWT)\text{tr}(\mathbf{W}\mathbf{W}^T)tr(WWT) is the sum of eigenvalues and is a constant. According to Merikoski (1984), if all elements of WWT\mathbf{W}\mathbf{W}^TWWT are positive, Sum(WWT)\text{Sum}(\mathbf{W}\mathbf{W}^T)Sum(WWT) is the upper bound of the largest eigenvalue of WWT\mathbf{W}\mathbf{W}^T, so when we minimize the second term, It indirectly minimizes the maximum eigenvalue of WWT\mathbf{W}\mathbf{W}^TWWT, which implicitly flattens the singular spectrum of the embedded space, or makes the distribution of the embedded space more uniform

So far, I think I have explained the core content of SimCSE clearly enough. As for the supervised learning part of the original paper, I will not say anything here, because it is essentially to modify the definition of positive sample pairs and negative sample pairs

Results

The experiment of the original paper is very rich, readers can read the original text carefully, here is a simple comparison of the experiment

There is nothing to analyze about the overall results. From the figure above, we can see that SimCSE achieves SOTA in multiple data sets. Moreover, the author finds that by adding MLM pre-training objectives on the basis of the original training objectives, Training their loss of the two targets together by scaling their ℓ+λℓ MLM \ell + \lambda \ell^{MLM}ℓ+λℓ MLM is able to enhance the effect of the model by preventing SimCSE from forgetting their token level knowledge. It was a bit surprising to me that after doing so many operations, the model was able to extract sentence level features fairly well, but the token level knowledge was forgotten again. It was really robbing Peter to pay Paul

Code

Although SimCSE is theoretically a batch of sentences into two Encoders (the two encoders are just different Dropout), we actually use a batch of all the samples to copy and pass the Encoder once. Assumes that the initial input is [A, B] [A, B] [A, B] two sentences, first copy over the/A, A, B, B/A, A, B, B/A, A, B, B, The Encoder to obtain other vector for [hA (0), hA (1), hB (0), hB (1)] [\ boldsymbol {h} _A ^ {(0)}, \ boldsymbol {h} _A ^ {(1)}, \ boldsymbol {h} _B ^ {} (0). \ boldsymbol {h} _B ^ {(1)}] [hA (0), hA (1), hB (0), hB (1)], the question now is, what is our label?

It is clear that we know that if given the hA (0), hA (1) \ boldsymbol {h} _A ^ {(0)}, \ boldsymbol {h} _A ^ {} (1) hA (0), hA (1), their label is 1; If a given hB (0), hB (1) \ boldsymbol {h} _B ^ {(0)}, \ boldsymbol {h} _B ^ {} (1) hB (0), hB (1), the label is 1, other also is zero, Therefore, we can provide the following table (the places labeled as 1 are the locations of the same sentence with different Embedding).

The table above can be converted to label: [1,0,3,2][1, 0,3,2][1, 0,3,2]. Assuming that there are 4 sentences in the original batch and 8 sentences in total after replication, we can obtain label: [1,0,3,2,5,4,7,6][1, 0,3,2,5,4,7,6] according to the arrangement of the above table. It is ok to generate labels according to this rule, and this rule is quite obvious, I will not explain

import torch
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') def SimCSE_loss (Mr Pred, tau = 0.05) : ids = torch.arange(0, pred.shape[0], device=device) y_true = ids + 1 - ids % 2 * 2 similarities = F.cosine_similarity(pred.unsqueeze(1), pred.unsqueeze(0), dim=2)# Masked diagonal matrices, i.e., loss that are themselves equal
    similarities = similarities - torch.eye(pred.shape[0], device=device) * 1e12
    similarities = similarities / tau
    returnTorch. Mean (F.cros_entropy (denominator, y_true)) Pred = Torch. Tensor ([[0.3, 0.2, 2.1, 3.1], [0.3, 0.2, 2.1, 3.1], [1.79, 3, 2.11, 0.89], [- 1.79-3, 2.11, 0.89]]) SimCSE_loss (Mr Pred)Copy the code

References

  • SimCSE: Simple Contrastive Learning of Sentence Embeddings

  • SimCSE: Simple Contrastive Learning of Sentence Embeddings

  • Is the Chinese mission still SOTA? We added some experiments to SimCSE

  • SimCSE paper interpretation

  • SimCSE Comparative Learning: What is text augmentation? I just need simple Dropout two times

  • Zhang Junlin: Research progress of contrastive learning

  • SimCSE paper super analysis

  • Super detailed comparative study and SimCSE knowledge points

  • Bert: Vector anisotropy?