“This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

SimCSE: Simple Contrastive Learning of Sentence Embeddings


Tianyu Gao, Xingcheng Yao, Danqi Chen

Links to papers: arxiv.org/abs/2104.08…

Code link: github.com/princeton-n…

What is text augmentation: I only need two Dropout

This paper introduces the idea of contrast learning into sentence embedding, and it is a very rare high-level paper that SOTA (unsupervised and supervised semantic similarity computation) is brushed up. Various leaders have interpreted the task and only made simple notes and other leaders’ exploration of the task to broaden their ideas.

Big guy interpretation record:

Xi Xiaoyao: mp.weixin.qq.com/s/BpbI_S9lX…

This paper presents a simple contrastive learning framework, SimCSE, for learning sentence representation. Both dropout+ contrast learning and NLI+ contrast learning are very beneficial to the learning of sentence representation. SimCSE significantly refreshed the STS mission list and achieved a new round of SOTA. Obviously, they are dropuout data that we are used to and NLI data that we are familiar with for a long time. However, the authors of this paper can view them from a brand new perspective, establish a connection between them and comparative learning, achieve a very significant improvement, and reasonably explain why work.

Zhihu reading: zhuanlan.zhihu.com/p/368353121

B station video interpretation: www.bilibili.com/video/BV1oQ…

Su Jianlin bosses experiments on Chinese task: www.spaces.ac.cn/archives/83…

First, I will introduce contrastive learning. Why does contrastive learning work

Comparative study

What is contrast learning

A Primer on Contrastive Learning, published by The Institute of Applied Mathematics of the Chinese Academy of Sciences, is A Primer on Contrastive Learning.

Supervised contrastive learning

Supervised contrast learning is a machine learning technique that learns general features of a data set by simultaneously teaching the model which data points are similar or different. Supervised contrast learning is a kind of supervised learning method, which requires high sample quality. Supervised contrast learning usually achieves better results than general supervised learning, and the video below visualises the process.

Here is the finch video card, click the link to view: video

In most real-world scenarios, we do not add labels to each sample. In medical imaging, for example, obtaining samples is so difficult that professionals have to spend countless hours manually classifying and dividing images in order to create labels. While generating clean labeled datasets is expensive, we are generating large amounts of unlabeled data all the time. Self-supervised learning is a method that allows us to learn from these unlabeled data. One way to take advantage of this vast amount of unlabeled data is to set learning goals appropriately so as to gain oversight from the data itself.

In NLP, Word2vec and Mask Language Model are typical self-supervised learning. There are many examples of self-supervised learning in CV, such as cutting a picture into smaller blocks, predicting the relationships between blocks, and putting together a puzzle of smaller blocks.

Self-supervised learning

Self-supervised contrast learning and supervised contrast learning are similar on the whole, but the difference lies in that self-supervised contrast learning constructs labeled samples through data enhancement, while supervised contrast learning constructs contrast samples with manual labeling.

Therefore, the core and difficulty of self-supervised contrast learning is to construct high-quality contrast samples. In CV, positive samples are generally generated by shearing, rotation, Gaussian noise, masking, dyeing and other operations. In NLP, noise is usually added through translation, addition, deletion and modification of characters, so as to generate positive samples. Due to the discrete nature of text, it is generally difficult to generate good enhanced samples of tag invariant.

How to classify correctly using comparison sample pairs? And that’s where the self-supervised contrast loss function comes in

Among themIs an Encoder function that maps samples to a low-dimensional space or a low-dimensional sphere surface.It’s a positive sample pair,From the sample distributionA sample taken from,Negative sample pairs.

Comparison loss will make the features of the learned positive pair close to each other (pull), while pushing the features of the negative pair from random sampling (push). In the section on alignment and uniformity below, we will discuss the nature of Contrastive Loss in detail.

What is the relationship between Contrastive Loss and Softmax Loss?Which are:

This is not aIs the problem of multiple categories,From you toIdentify the correct category in the categoryThat is, the category in which you are. Unlike Softmax, for each sampleHere,It’s all different.

Alignment: Two samples of a positive pair should map to neighboring feature vectors, and the feature is invariant to the noise factor.

Clothes clothes: Eigenvectors should be roughly evenly distributed in the unit hypersphereAnd retain as much data as possible.

Alignment and enrichment and softmax loss in this article are the same as intra-class compactness and inter-class separability and are not a new concept. But this paper gives a more mathematical representation of both concepts and proves that they can be implemented.

Contrast learning why work

Understanding Contrastive Representation Learning Through Alignment and Stability on? Paper “Understanding Contrastive Representation Learning Through Alignment and Enrichment on The Hypersphere.”

It points out that contrastive learning is useful mainly because it optimizes two goals:

  1. Positive examples are kept close together.
  2. The representation of the random sample should be scattered over the hyperplane sphere.

The two goals were measured with alignment and stability respectively.


Alignment: Calculates the expected vector distance between positive example pairs

The more similar the samples are, the higher the degree of alignment is. Because an alignment is measured by distance, a smaller distance indicates a higher degree of alignment.


Clothes: Evaluates the degree of vector uniformity of all data. The more uniform, the more information is retained.

You can imagine taking any two data x and y from the representation space, and you want them to be far apart. The farther apart they are, the more uniform the spatial distribution proves. So the lower the value, the better.

SimCSE also uses these two indicators to measure the generated sentence vector, and proves that the semantic space of the text also satisfies: the lower the alignment value and the lower the clothes value, the higher the quality of the vector representation and the higher the Spearman correlation coefficient on THE STS task.

Personal understanding: SimCSE also satisfies the semantic space. Like word vector /KGE, it generates embedding through the distance expectation of abstract space. It proves that seeking the idea of distance and distribution in semantic space is an effective interpretable way.

SimCSE

SimCSE has two variants: Unsupervised SimCSE and Supervised SimCSE. The main difference lies in the construction of positive and negative cases of comparative learning. Here’s how they’re built.

Unsupervised SimCSE

Unsupervised SimCSE introduces dropout noise to the input, assuming that the input after noise is still close to the original input in semantic space. Its positive and negative examples are constructed as follows:

Example: Given input, using pre-trained language modelscodingI get two vectors twiceandAs a positive example.

Negative example: Use the in-batch negatives to sample another input randomly from one batchAs aThe negative cases.

Training objective function:

The following figure shows an example of Unsupervised SimCSE:

How do I create a Dropout Mask?

For the Unsupervised SimCSE, the core is how to generate the Dropout mask. When I first finished reading the dropout, I was amazed at how much better I could use the dropout. The original text reads as follows:

In other words, We pass the same input sentence to the pre-trained encoder twice and obtain two embeddings as “positive pairs”, By leaving a central office.

This is because BERT creates a different dropout mask for each dropout depending on the opportunity. So SimCSL doesn’t need to change the original BERT, it just needs to feed the same sentence to the model twice, and the two resulting vectors are the result of applying two different dropout masks. And then you pair the two vectors as positive examples. (Really simple)

Supervised SimCSE

This paper also proposes Supervised SimCSE, which constructs positive and negative examples of comparative learning by using annotated data. In order to explore which annotated data is more conducive to sentence vector learning, experiments are carried out on a variety of data sets, and it is found that NLI data is most conducive to sentence representation learning. The following takes NLI data as an example to introduce the process of Supervised SimCSE.

Supervised SimCSE introduced the NLI task to supervise the comparative learning process. The model assumes that if there is an implicative relationship between two sentences, the sentence vector distance between them should be close. If two sentences contradict each other, they should be farther apart. Therefore, implied sentence pairs and contradictory sentence pairs in NLI correspond to positive example pairs and negative example pairs in comparative learning respectively. Therefore, positive and negative examples are constructed in Supervised SimCSE as follows:

Example: Sample pair of entailment relationship in NLI. Negative example: a) in-batch negatives and b) a sample pair of oc in NLI;

Training objectives:

The experimental results

Dropout better than traditional data Enhancement?

The following figure compares Spearman’s Correlation on THE STS-B validation set with Unsupervised SimCSE (line 1: None) and common data enhancement methods. Crop K % represents the random deletion of k% length span, word deletion represents the random deletion of K % words, delete one word deletes only one word, and MLM 15% represents the random replacement of 15% words. All of the dropout methods in the table above have a dropout ratio of 0.1. (Since we compare dropout ratios, p=0.1 works best.)

The experimental results clearly show that SimCSE is far superior to other data enhancement methods. Hua’s understanding is that the traditional method of data enhancement is to directly change the original input. After encoding, the distance between the enhanced data and the original data in semantic space should be greater than that of using the method of dropout.

Dropout and comparative learning

To understand why dropout works, the author visualizes the variation trend of alignment and stability during training with different methods.

In the figure above, we compare different data intensification /dropout modesandChange direction during training (sampling every 10 training steps). Fix 0.1 means that when P =0.1, use the same Dropout mask twice. For Fixed 0.1 and No Dropout, the sentence representations of positive example pairs are exactly the same,

It can be seen that as the number of training steps increases, unsup.simCSE’sDecrease steadily. althoughThe trend of decrease is not obvious, but the initial value is relatively low. The above chart further demonstrates that SimCSE works because it enables Alingclothes and generates decreases.

What about semantic text similarity?

SimCSE is evaluated on the STS (Semantic text Similarity) task. The evaluation index was Spearman’s Correlation.

The table compares various approaches to modeling sentence vectors, including simple averaging of the Glove vector, to SOTA: Bert-Flow and Bert-Broke. As can be seen, SimCSE has achieved significant improvement in all encoders and unsupervised modes. For example, when bert-Base and Robeta-base were unsupervised, compared with Bert-broke Avg. Increased by 7.96% and 14.77%, respectively.

In addition, different sentence representation models are comparedandWith their results on the STS mission:

As can be seen:

  • Avg. BERT modelIs low, butHigher;
  • In contrast, bert-Flow and Bert-Whitening indicate post-processing for BERTLower, butBut it is very high;
  • Both values of unsup.simCSE and SimCSE are lower, and their STS results are also better.

instructionsandIt needs to be used in combination. Only when both values are low, the sentence vector representation learned by the model is most suitable for STS task.


Transfer learning effect

In addition to the evaluation on the STS task, the trained sentence vectors were migrated to seven tasks.

SimCSE in transfer learning does not show significant advantages. The author’s explanation is that sentence-level training objectives do not directly benefit transfer learning. In order to make the transfer learning effect better, MLM loss and comparative learning loss are still tried to be trained together, and a small improvement is achieved (the row marked w/MLM in the table above).

Open source situation

Code link: github.com/princeton-n…

The pre-trained language model in this paper has been integrated into HuggingFace and can be called directly through the API, just like the BERT model.

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
Copy the code

Su shen’s experiments on the Chinese mission

Kexue. FM/Archives /83…

Inspired by Bert-Flow, Sosun conceived a method called “Bert-Whitening” that briefly became the new SOTA of semantic similarity, and at least two new papers appeared on Arxiv with significantly better results than Bert-Whitening. The first one is Generating Datasets with Pretrained Language Models, which constructs data pairs from GPT2_XL unsupervised by using templates to train similarity Models. Personally, although Generating Datasets with Pretrained Language Models are inspiring to some degree and the effect is ok, But the costs and uncertainties of reproduction are too great. The other is SimCSE: Simple Contrastive Learning of Sentence Embeddings, which proposed SimCSE significantly outperformed Bert-Flow and Bert-Broke on English data, and the method was particularly Simple ~

So, does SimCSE work in Chinese as well? Can Chinese semantic similarity be greatly improved? Su did a supplementary experiment. Source address: https://github.com/bojone/SimCSE

Results of Su God experiment:

The data for each cell is in the form of “A/B/C”, where a is the raw result without any processing, B is the result of Bert-Whitening (no dimension reduction), and C is the result of SimCSE. If C goes higher than B, C is green, otherwise it is red, so the more green, This indicates that SimCSE is better than Bert-Whitening.

If you broke a job on the job, SimCSE did beat Bert-Whitening by more than 10 points, and did better on BQ than supervised trained SimBERT, except for PAWSX. And the fact that a model like SimBERT, which has already been trained by supervision, can be further improved shows that it is indeed powerful. (As for why PAWSX is “different”, the article “Unsupervised semantic similarity is stronger? We did a more comprehensive evaluation” has made a brief analysis.)

Meanwhile, it can also be seen that under SimCSE, first-last-AVG Pooling, which performs better in Bert-flow and Bert-Whitening, has no advantage. Instead, it is better to obtain [CLS] vector directly. However, to our surprise, The Pooler ([CLS], “Dense”) was disorientedly poor

Since bert-Whiteing is only a linear transformation, the author also experiments whether SimCSE alone can reproduce the effect of this linear transformation. Specifically, the weights of Encoder are fixed, followed by a Dense layer with no activation function, and then SimCSE is targeted to train only the last Dense layer. It was found that SIMCses in this condition were not as bert-broke. That means that SimCSE needs to be tweaking Encoder to be effective, and it also means that Bert-Whitening might contain something SimCSE doesn’t have, and maybe a combination of the two would work better somehow.

In the process of learning contrast learning saw su God’s new article: Su God’s new job: SimBERTv2 is coming! Fusion retrieval and generated RoFormer Sim model SimBERT=BERT+UniLM+ contrast learning RoFormer Sim=RoFormer+UniLM+ contrast learning +BART+ distillation