The paper

BERT and RoBERTa have achieved the results of SOTA on the regression task of sentence pairs, such as Semantic Textual Similarity. However, they both need to send two sentences into the network at the same time, which incurs huge computational overhead: Finding the most similar sentence pairs from 10,000 sentences requires approximately 50 million (C100002=49,995,000C_{10000}^2=49,995,000C100002=49,995,000) inferences, which takes approximately 65 hours on our V100GPU. This structure makes BERT unsuitable for semantic similarity search, as well as for unsupervised tasks such as clustering

A common approach to clustering and semantic search is to map each sentence into a vector space, making semantically similar sentences very close. Generally, there are two ways to obtain sentence vectors:

  1. Calculate the average of all Token output vectors
  2. use[CLS]Vector of position output

However, the UKP researcher experiment found that the results obtained using the above two methods were not good on the text similarity (STS) task, and even the Glove vector was significantly better than the plain BERT sentence embeddings (see the first three lines below).

The authors of sentent-bert (SBERT) modify the pre-trained BERT by using **Siamese and Triplet networks ** to generate Sentence Embedding vectors with semantics. The semantically similar sentences can be found with cosine similarity, Manhattan distance, Euclidian distance and so on. SBERT can reduce the BERT/RoBERTa 65 hours mentioned above to 5 seconds (about 0.01 seconds for cosine similarity calculation) while ensuring accuracy. This allows SBERT to perform new specific tasks, such as clustering, semantically based information retrieval, and so on

Model is introduced

Pooling strategy

SBERT adds a Pooling operation to the BERT/RoBERTa output to generate a fixed dimensional sentence Embedding. Three Pooling strategies are adopted for comparison in the experiment:

  1. CLS: Directly use the output vector of the CLS position as the entire sentence vector
  2. MEAN: Calculate the average of all Token output vectors as the whole sentence vector
  3. MAX: Take the maximum value of each dimension of all Token output vectors as the whole sentence vector

The experimental comparison results of the three strategies are as follows

It can be seen from the results that MEAN has the best effect, so MEAN strategy is also adopted by default in subsequent experiments

Model structure

In order to be able to fine-tune BERT/RoBERTa, twin networks and triplet networks are used to update parameters to achieve more semantic information in the generated sentence vector. The structure of the network depends on the specific training data. The following mechanisms and objective functions are tested

Classification Objective Function

For classification problems, the author will vector u, v, ∣ ∣ u – v, u, v, | | u – v, u, v, ∣ u – v ∣ three vectors joining together in together, and then multiplied by a weighting parameters Wt ∈ R3n x kW_t \ \ mathbb in ^ {R} {3} n \ times k Wt ∈ R3n x k, Where NNN represents the dimension of the vector, and KKK represents the number of label


o = s o f t m a x ( W t [ u ; v ; u v ] ) o = softmax(W_t[u;v;|u-v|])

The loss function is CrossEntropyLoss

Note: the original formula for softmax (Wt (u, v, ∣ u ∣ – v)) softmax (W_t (u, v, | | u – v)) softmax (Wt (u, v, ∣ u ∣ – v)), I personally prefer to use [;] [;] [;] Represents the meaning of vector Mosaic

Regression Objective Function

The cosine similarity calculation structure of the two sentences embedding vector U, VU, VU, V is shown as follows, and the loss function is MAE (mean squared error).

Triplet Objective Function

For more information about the Triplet Network, see my article Siamese Network & Triplet Network. Given a main clause AAA, a positive PPP and a negative NNN, the triplet loss adjustment network makes the distance between AAA and PPP as small as possible and the distance between AAA and NNN as large as possible. Mathematically, we expect to minimize the following loss function:


m a x ( s a s p s a s n + ϵ . 0 ) max(||s_a-s_p||-||s_a-s_n||+\epsilon, 0)

XXX sxs_xsx said sentence of embedding, ∣ ∣ ⋅ ∣ ∣ | | · | | ∣ ∣ ⋅ ∣ ∣ said distance, The edge parameter ϵ\epsilonϵ indicates that sas_ASA should be at least closer to sps_PSP than sas_ASA to SNs_nsn ϵ\epsilonϵ. In the experiment, the Euclidean distance was used as the distance measure, with ϵ\epsilonϵ set to 1

Model training details

The author combined SNLI (Stanford Natural Language Inference) and Multi-genre NLI data sets in training. SNLI has 570,000 artificially labeled sentence pairs, labeled as contradictory, eintailment and neutral. MultiNLI is an updated version of SNLI with the same format and label and has 430,000 sentence pairs, mainly a series of spoken and written text

The implication relation describes the inferential relationship between two texts, one of which serves as a Premise and the other as a Hypothesis. If a Hypothesis can be deduced from the Premise, it is said that the Premise contains the Hypothesis. For example:

Sentence A (Premise) Sentence B (Hypothesis) Label
A soccer game with multiple males playing. Some men are playing a sport. entailment
An older and younger man smiling. Two men are smiling and laughing at the cats playing on the floor. neutral
A man inspects the uniform of a figure in some East Asian country. The man is sleeping. contradiction

In the experiment, the author used softmax classification objective function of category 3 to fine-tune SBERT, batch_size=16, Adam optimizer, Learning_rate = 2E-5

Melting research

In order to perform ablation studies on different aspects of SBERT to better understand their relative importance, classification models were constructed on SNLI and Multi-NLI datasets, and regression models were constructed on STS Benchmark datasets. In the pooling strategy, MEAN, MAX and CLS strategies are compared. In the classification objective function, different vector combination methods are compared. The results are as follows

Results show that the Pooling strategy affect smaller, vector combination strategy, and [u, v, ∣ u – v ∣] [u, v, | | u – v] [u, v, ∣ ∣ u – v] the best effect

Reference

  • Paper read | Sentence – BERT: Sentence Embeddings using Siamese BERT – Networks
  • Richer Sentence Embeddings using Sentence-BERT — Part I
  • Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019)
  • Sentence-bert: A twin network that can quickly calculate Sentence similarity