Model deduction of deep text Matching

1. datasets

Test the text similarity model on two datasets, ATEC and LCQMC. Take a look at each example:

ATEC: Ants spend bai bills in installment with repayment during the month, this month I spend bai staging of ants, also not, 0 I spend bai tied card bound not only on how to return a responsibility, how binding not spend bai, how can 0 closed by bai, help me to close by bai entrance, 1 by bai how can only be divided into six months reimbursement, bai set limit, can also set it next month, 0 LCQMC: Verification failed. Hello, are you still in visitor status? Where is the place to suffer? What's the place to eat? What's the last four digits? What's the last four digitsCopy the code

dataset	Number (train/dev/test)	Train distribution (0:1)
ATEC	11/22/33	A 75435-16870
LCQMC	11/22/33	A 100192-138574

From the number of class distributions, ATEC is an extremely unbalanced data set.

2. solutions

There are several common solutions to this kind of problem. The data is preprocessed first. Use jieba word to filter out stop_words and use Baidu stop_wotDS word list link.

2.1 the transformer + siameses – net

Experiment 1: Glove embedding, Dim =300, tok-level

Analysis: the results were all wrong on ATEC, all predicted as negative examples.

Experiment 2: Bert embedding is used

Experiment 3: Randomly initialize the embedding

2.2 AlBert

AlBert model is used for fine-tuning,  1. set 1, Epoch: 1.0EVAL_accuracy = 0.81263494 eval_accuracy = 0.81263494 eval_loss = 0.41541135 global_step = 3730 loss = 0.41508695 2. The set 2, Epoch :2.0 eval_accuracy = 0.78366095 eval_accuracy = 0.78366095 eval_loss = 1.7134513 global_step = 7461 loss = Eval_accuracy = 0.788774 eval_accuracy = 0.788774 eval_loss = 2.2002842 global_step = 55960Copy the code

| the dataset | F1 score | precision | epoch | aveTime | | — – | — – | — – | — – | | LCQMC | | | | 1 0.84032 0.84032 | LCQMC | 0.844 0.844 5 | | | | LCQMC | | | | 10 0.84192 0.84192 | LCQMC 15 | | | | 0.80304 0.80304

Why the more epoch, the worse its performance?

Model description

ALBert is always a lite Bert, a lite version of Bert, but also an upgraded version. Compared with Bert, ALBert had three major changes:

Factorized Embedding parameterization
Cross-layer Parameter Sharing
Change the NSP task to SOP task. Inter-sentence coherence Loss

code paper

Disadvantages:

Long forecast timeCopy the code

2.3 AlBERT + siameses – netlink

Model description

The project structure of this model is very simple, that is, BERT acts as an encoder and uses the structure of twin networks to perform fine-tuning on specific tasks directly to update BERT’s weight. The resulting sentence vector can be used to calculate distance directly from the cosine similarity. As shown in figure:We use the AlBert model here.

| the dataset | F1 score | precision | checkpoint | aveTime | | — – | — – | — – | — – | | LCQMC | | | 149228 | 0.80712 0.80712 7461 | | LCQMC 0.71472 0.71472 | | | | LCQMC | | | 37307 | 0.71472 0.71472

A guide to building Siameses-Net using AlBert. code

2.4 simnet

Model description

Max ⁡ 0, margin – (S (Q, D +) – S (Q, D -)) \ Max {0, margin – (S (Q, D +) – S (Q, D -))} max0, margin – (S (Q, D +) – S – (Q, D)) code of the document

2.5 MatchPyramid

Summary of ideas: Inspired by the image recognition, we first calculate the similarity between the tokens at each position of the input two sentences, construct the similarity matrix, then use CNN model to extract features on the similarity matrix, and finally add a layer of MaxPool, and then use SoftMax classification.

2.6 PolyEncoder

Model description

In this paper, he referred to the Siameses-Net as bi-encoder and AlBert as cross-encoder. Poly-encoder is a compromise solution that takes into account the performance and accuracy of Predit.

The network structure is still symmetric and the encoder on the right encoder codes the candidate $y_{condi}$ . The left-hand side encodes the input Q at the top level $m$ A vector $(y^1_{ctxt}… y^m_{ctxt})$ And then with $y_{condi}$ Do the attention operation, where $w$ You can think of it as the value that is evaluated relative to the current candidate. That is:

For candidate AAA, ycondiy_{condi}ycondi can also be cached in advance, so only the first m vectors of the top layer need to be obtained during QQQ predict, and yctxty_{CTXT}yctxt can be obtained. The cosine distance calculated by Ycondiy_ {condi}ycondi is QAQAQA similarity.

Now there is still a question of how to select these MMM vectors. The author focuses on testing two strategies:

Directly select the first M, as shown in Figure 1.
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ,cm)(c_1,… ,c_m)(c1,… ,cm), and then take the corresponding subscript yctxtiy^ I _{CTXT}yctxti, as shown in Figure 2.

Paper Code refer to 1

2.7 ColBERT (SIGIR 2020)

1. Model introduction

ColBERT also uses Siamese structures. It can be divided into query-encoder and doc-encoder. This model uses the token-level emebdding obtained by encoder, and then calculates the association of each token between Query and Doc to measure the similarity.

2. Model introduction

Query and Doc are encoded separately

The same Bert model is used for Query and Doc encoding, so to distinguish Query and Doc, prefix the input sequence with [Q] and [D] respectively. Bert is an encoder, CNN performs dimension transformation to reduce dimension of Bert’s hidden layer output, and Normalize is l2 regularization for later calculation of cosine similarity.

The padding of Query input sequences with special [mask] characters for shorter lengths can have Query Angmentation, the authors explain, and can have a strong effect on the results of the experiment. The authors interpret this Query Angmentation effectiveness as allowing the BERT model to generate query-based embedded representations at the locations of these masks. The goal is to learn how to augment query representations by using new words or re-weighing words according to their importance.

For Doc, you need to filter out punctuation. 2. Later Interaction

Finally, the similarity is calculated, which is the sum of the similarity between each token and query-embedding for Query as the final score $S_{q,d}$ (Since L2 regularization was done for the matrix first, calculating dot-products is actually calculating cosine similarity).

In addition, this paper introduces a method to directly retrieve complete data sets by faISS. The author calls it end-to-end top-k Retrieval:

faiss-based candidate filtering
re-ranking

Faiss: Facebook open source toolkit for efficient similarity search and clustering of dense vectors.

3. Results analysis and conclusions

First, let’s take a look at the overall results. The figure above compares them with the four main scenarios, which are:

Bag-of-words (BOW) based methods such as BM25.
Effecient NLU Model based on BOW
Neural network Matching Models.
Language Model Pretraining for IR Based on BERT.

In the dimension of ** effectiveness of results (MRP@10), the introduction of pre-trained language model has greatly improved. In the retrieval performance (Lattency) ** dimension, the interactive computing model represented by BERT is significantly worse, while the ColBERT proposed in this paper can achieve good tradeoff and performance in the two dimensions.

See the figure above. Screenshot 2020-09-12 12.43.31 PMIf the end2End method is used, the result is better, but the latency increases a lot, in terms of cost, is not cost-effective.

Results on another dataset (TREC CAR) using BM25+ColBERT also yielded good results.

In the figure above, the authors conducted ablation experiments on ColBERT and concluded that: 1. There was a significant increase in the use of Query augmentation. 2.ColBERT uses later inteaction rather than a single [CLS] vector for dot-product. 3. ColBERT adopts Maximum similarity, and the effect is significantly better than average similarity. 4. It is better to use end2end to search directly from the complete language set than distribution (top 1000 with BM25 first, in re-ranking). 5.ColBERT needs to encode each Doc using BERT and store the resulting doc-representation as an index. The figure above shows the storage space required by using different dimensions and Bytes/Dim. And the corresponding results (MRP@10). As you can see, setting dimension M =24, bit 2, can be very obvious to see the storage, and get good results.

【 reference 】 paper code

2.7 Text Matching with Richer Alignment Features

(1 paper

3. Results evaluation indicators

precision@k Average Precision@K: Average Precision@K refers to the MAP of Average accuracy from the first correct recall to the KTH correct recall (Mean Average Precision@K) until the KTH correct recall.

code

4. Other content

Measurement of Text Similarity: Text_matching-github A Survey of Semantic Similarity – A Survey of Text Matching Techniques

Model deduction of deep text Matching

1. datasets

2. solutions

2.1 the transformer + siameses – net

2.2 AlBert

2.3 AlBERT + siameses – netlink

2.4 simnet

2.5 MatchPyramid

2.6 PolyEncoder

2.7 ColBERT (SIGIR 2020)

2.7 Text Matching with Richer Alignment Features

3. Results evaluation indicators

4. Other content

Related Posts

Conditional independence and D-division in Directed Graphs

Image retrieval Based on MATLAB GUI comprehensive feature image retrieval

CAPTCHA to go or stay?