This is the 25th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

The most popular NLP paper of the first half of 2021 was SimCSE: Simple Contrastive Learning of Sentence Embeddings. The full name of SimCSE is Simple Contrastive Sentence Embedding

Sentence Embedding

Sentence Embedding has always been a hot issue in NLP field, mainly because of its wide range of applications and as the cornerstone of many tasks. There are many methods to obtain the sentence vector, the common ones are directly taking the output of [CLS] position as the sentence vector, or summing and averaging the output of all words, etc. However, the Anisotropy problem has been proved in all the above methods. Generally speaking, the inconsistent representation of Word Embedding dimensions will occur in the process of model training, so that the sentence vectors obtained cannot be directly compared

At present, the most popular methods to solve this problem are:

  1. Linear transformations: Bert-flow, Bert-Broke. These two are more like post-processing, which alleviates the anisotropy problem by performing some transformations on the sentence vectors extracted by BERT
  2. Comparison learning: SimCSE. The idea of contrast learning is to pull similar samples closer and push away dissimilar samples so as to improve the model’s sentence representation ability

Unsupervised SimCSE

SimCSE uses self-supervised learning to improve sentence representation. Since SimCSE has no label data (unsupervised), each sentence itself is treated as similar. To put it bluntly, is essentially (and yourself) (\ text {}, \ text {own}) (oneself, oneself) as a positive example, (yourself and others) (\ text {}, \ text {the} others (yourself and others) as a negative example to train learning model. Of course, it’s not as simple as that. If you just take two samples that are exactly the same as each other as positive examples, your generalization ability will be compromised. Typically, we use some form of data amplification to make the two samples of a positive example look different, but in NLP, how to do data amplification is a problem. SimCSE provides an extremely simple and elegant solution: just use Dropout as data amplification!

Specifically, NNN sentences using Dropout Encoder yield vectors H1 (0), H2 (0)… ,hN(0)\boldsymbol{h}_1^{(0)},\boldsymbol{h}_2^{(0)},… ,\boldsymbol{h}_N^{(0)}h1(0),h2(0),… ,hN(0), and then re-run the Encoder (this time another random Dropout) to produce the vectors H1 (1), H2 (1)… ,hN(1)\boldsymbol{h}_1^{(1)},\boldsymbol{h}_2^{(1)},… ,\boldsymbol{h}_N^{(1)}h1(1),h2(1),… , hN (1), we could (1) hi (0), (hi) (\ boldsymbol {h} _i ^ {(0)}, \ boldsymbol {h} _i ^ {} (1)) (1) hi (0), (hi) as a (slightly different) is, for the training target


i = log e sim ( h i ( 0 ) . h i ( 1 ) ) / tau j = 1 N e sim ( h i ( 0 ) . h j ( 1 ) ) / tau (1) \ell_i=-\log \frac{e^{\text{sim}(\boldsymbol{h}_i^{(0)},\boldsymbol{h}_i^{(1)})/\tau}}{\sum_{j=1}^N e^{\text{sim}(\boldsymbol{h}_i^{(0)},\boldsymbol{h}_j^{(1)})/\tau}}\tag{1}

Among them, Sim (h1, h2) = h1Th2 ∥ h1 ∥ ⋅ ∥ h2 ∥ \ text {sim} (\ boldsymbol {h} _1, \boldsymbol{h}_2)=\frac{\boldsymbol{h}_1^T\boldsymbol{h}_2}{\Vert \boldsymbol{h}_1\Vert \cdot \Vert _2 \ Vert \ boldsymbol {h}} sim (h1, h2) = ∥ h1 ∥ ⋅ ∥ h2 ∥ h1Th2. In fact, if you ignore the −log⁡-\log−log and tau tau tau tau parts, the rest of Equation (1) looks very much like Softmax\text{Softmax}Softmax. In this paper, τ=0.05 tau =0.05 τ=0.05. As for the function of this tau tau tau, I read some explanations on the Internet:

  1. If you directly use cosine similarity as logits input to Softmax\text{Softmax}Softmax, since the range of cosine similarity is [−1,1][-1,1][−1,1], The range is too small for Softmax\text{Softmax}Softmax to give a large enough difference between positive and negative samples. The end result is that the model is not sufficiently trained and therefore needs to be corrected by dividing by a small enough parameter τ\tau tau tau to amplify the value
  2. The hyperparameter τ\tauτ will focus the model update on the difficult negative cases and punish them accordingly. The more difficult it is, the closer it is to hi(0)\boldsymbol{h}_i^{(0)}hi(0), the more punishments will be assigned. And it makes a lot of sense, (0), we will sim (hi hj (1)) \ text {sim} (\ boldsymbol {h} _i ^ {(0)}, \ boldsymbol {h} _j ^ {} (1)) sim (hi (0), hj (1)) divided by the number Tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau tau (0), then the sim (hi hj (1)) \ text {sim} (\ boldsymbol {h} _i ^ {(0)}, \ boldsymbol {h} _j ^ {} (1)) sim (hi (0), hj (1)), the more close to 1 negative samples, after tau \ tau tau enlarged will dominate