Practice of MT-Bert in text retrieval task

Based on MS MARCO, Microsoft’s large-scale real scene data set for reading comprehension, Meituan Search and NLP Center proposed a BERT algorithm scheme for this text retrieval task, DR-Bert, which was the first model to break 0.4 on the official evaluation index MRR@10.

This paper shares the practice of DR-BERT algorithm in text retrieval task, hoping to inspire and help students engaged in retrieval and ranking related research.

background

Improving machine reading comprehension (MRC) and open field question answering (QA) capabilities is an important goal in the field of natural language processing (NLP). Many of the breakthroughs in ARTIFICIAL intelligence are based on large, publicly available data sets. In computer vision, for example, object classification models based on ImageNet data sets have surpassed human performance. Similarly, in the field of speech recognition, some large speech databases also enable the deep learning model to greatly improve the ability of speech recognition.

In recent years, more and more MRC and QA datasets have emerged in order to improve the natural language understanding of models. However, these data sets are more or less flawed, such as insufficient data volume, reliance on manual construction of Queries, and so on. To solve these problems, Microsoft proposes a Reading Comprehension dataset MS MARCO (Microsoft Machine Reading Comprehension) [1] based on large-scale real scene data. The dataset is based on real search queries in the Bing search engine and Cortana intelligent assistant, and contains 1 million queries, 8 million documents and 180,000 human-edited answers. Based on MS MARCO data set, Microsoft proposed two different tasks: one is given a problem, retrieve all the documents in the data set and sort them, which belongs to the document retrieval and sorting task; The other is to generate an answer based on a question and given relevant documentation, which is part of the QA task. In Meituan business, document retrieval and sorting algorithms are widely used in search, advertising, recommendation and other scenarios. In addition, the time consumption of QA tasks directly on all candidate documents is unacceptable. QA tasks must rely on sorting tasks to filter out the top documents, and the performance of the sorting algorithm directly affects the performance of QA tasks. For these reasons, we focused on document retrieval and sorting tasks based on MS MARCO.

Since its launch in October 2018, MACRO has attracted the participation of many enterprises and universities, including Alibaba Dharma Institute, Facebook, Microsoft, Carnegie Mellon University, Tsinghua University and so on. On meituan’s pre-trained MT-BERT platform [14], we proposed a BERT algorithm scheme for this text retrieval task. It is called DR-BERT (Enhancing bert-based Document Ranking Model with Task-adaptive Training and OOV Matching Method). Dr-bert was the first model to break 0.4 on the official evaluation index MRR@10, and it topped the list from May 21 (model submission date) to August 12, 2020. The organizers also tweeted their congratulations separately, as shown in Figure 1 below. The core innovations of DR-Bert model mainly include domain adaptive pre-training, two-stage model fine-tuning and two OOV (Out of Vocabulary) matching methods.

Related introduction

Learning to Rank

In the field of information retrieval, many machine Learning to Rank models have been used to solve the problem of document sorting in the early stage, including LambdaRank[2], AdaRank[3], etc. These models rely on many features of manual construction. With the popularity of deep learning technology in the field of machine learning, researchers have proposed many neural sequencing models, such as DSSM[4] and KNRM[5]. These models map the representation of issues and documents into a continuous vector space, and then compute their similarity through neural networks, thus avoiding tedious manual feature construction.

According to the different learning objectives, the sorting model can be divided into Pointwise, Pairwise and Listwise. A diagram of these three approaches is shown in Figure 2. The Pointwise method directly predicts the relative score of each document and question. Although this method is easy to implement, it is more important for sorting to learn the relationship between different documents. With this in mind, the Pairwise approach turns the sorting problem into a comparison of Pairwise documents. Specifically, given a problem, each document is compared in pairs with other documents to determine whether it is better than the others. In this way, the model learns about the relative relationships between different documents.

However, there are two problems with Pairwise’s sorting task. First, it optimizes the comparison of two documents rather than sorting more documents, which is different from the goal of sorting documents. Second, randomly extracting pairs from a document can easily cause the problem of training data bias. To remedy these problems, the Listwise method extends Pairwsie’s thinking and directly learns about the relationships between sorts. Depending on the form of the loss function used, the researchers propose a variety of different Listwise models. For example, ListNet[6] directly uses the top-1 probability distribution for each document as a sorted list, optimized using cross entropy losses. ListMLE[7] uses maximum likelihood for optimization. SoftRank[8] directly uses NDCG, a sorting metric, for optimization. Most studies show that Listwise learning produces better sorting results than Pointwise and Pairwise methods.

BERT

Since Google’s BERT[9] proposed in 2018, pre-trained language models have achieved great success in the field of natural language processing, achieving SOTA effects on a variety of NLP tasks. BERT is essentially an encoder based on Transformer architecture. The key to its success is to extract semantic features of different levels by using the self-attention mechanism in multi-layer Transoformer, which has a strong ability of semantic representation. As shown in Figure 3, BERT training is divided into two parts: one is pre-training based on large-scale corpus, and the other is fine-tuning for specific tasks.

In the field of information retrieval, many researchers also began to use BERT to complete sorting tasks. For example, [10][11] used BERT to conduct experiments on MS MARCO, and the results obtained greatly exceeded the best neural network sequencing model at that time. [10] Pointwise learning method is used, while [11] Pairwise learning method is used. Although these efforts achieved good results, they did not take advantage of the comparison information of the sorting itself. Based on this, we have made great progress by combining BERT’s own semantic representation ability with Listwise sorting.

Model is introduced

Task description

Preliminary screening based on DeepCT candidate

Due to the large amount of data in MS MARCO, it will consume a lot of time to directly calculate the correlation between Query and all documents using the deep neural network model. Therefore, most sorting models use a two-stage sorting method. In the first stage, candidate documents of top-K are screened preliminarily, and then in the second stage, candidate documents are refined by deep neural network. Here, we use BM25 algorithm to carry out the first step of retrieval. Common document representation methods of BM25 include TF-IDF and so on. But TF-IDF cannot consider the context semantics of each word. In order to improve this problem, DeepCT[12] first uses BERT to encode the document separately, and then outputs the importance score of each word. With BERT’s powerful semantic representation ability, the importance of words in documents can be well measured. As figure 4 below shows, the darker the word, the higher its importance. The “stomach” is more important in the first document.

DeepCT training objectives are as follows:

Domain adaptive pre-training

Since our model is based on BERT, the corpus used in BERT’s pre-training is not in the same domain as the corpus used in the current task. We reached this conclusion based on the analysis of top-10000 high-frequency words in the two parts of the corpus, and we found that MARCO’s top-10000 high-frequency words were more than 40% different from BERT’s baseline corpus. Therefore, it is necessary to use the corpus of the current field to pre-train BERT. Since MS MARCO is a large-scale corpus, we can directly use the document contents in this dataset to pre-train BERT. In the first stage, we used MLM and NSP pre-training objective functions to conduct pre-training on MS MARCO.

Two-stage fine tuning

Here is our proposed fine-tuning model. Figure 5 shows the structure of our proposed model. There are two stages: Pointwise tuning and Listwise tuning.

Pointwise fine tuning of problem type perception

In the first stage of refinement, our goal is to establish the relationship between questions and documents through Pointwise training. We take query-document as the input, encode it using BERT, and match the problem with the Document. Considering that the matching pattern of the problem and the document is strongly related to the type of the problem, we believe that the type of the problem also needs to be considered at this stage. Therefore, we use problems, problem types, and documents encoded with BERT to get a semantic representation of deep interactions. Specifically, we spliced the question type T, question Q and the ith document Di into a sequence input, as shown in the following formula:

The fractional RI is optimized by cross entropy loss function. Through the above pre-training, the model learned different matching patterns for different problems. This stage of pre-training can be called type-adaptive model tuning.

Listwise fine adjustment

In order to make the model directly learn the comparison relation of different sorts, we fine-tune the model by Listwise method. Specifically, in the training process, for each question, we sampled N + positive examples and N – negative examples as input, and these documents were randomly generated from candidate document set D. Note that due to hardware limitations, we cannot input all candidate documents into the current model. Therefore, we chose random sampling for training.

As for why the two-stage fine-tuning model is used, there are two main reasons:

We found that learning the correlation features of the question and document first and then learning the sorting features was better than learning the sorting features directly.
MARCO is a collection of data that is not adequately annotated. In other words, many documents related to the problem are not labeled as 1, and this noise tends to cause the model to overfit. The model in the first stage is used to filter the noise in the training data so that better data can be used to supervise the fine-tuning model in the second stage.

Solve the OOV mismatching problem

In BERT, in order to reduce the size of the word list and solve the problem of out-of-vocabulary (OOV), WordPiece method is used to divide words. WordPiece breaks up OOV words that are not in the glossary into fragments, as shown in Figure 6. The original question contained the word “bogue” and the document contained the word “bogus.” Under the WordPiece method, “bogue” was cut into “bog” and “##ue”, and “bogus” was cut into “bog” and “##us”. We found that “bogus” and “bogue” were unrelated words, but WordPiece cut out the matching segment “bog”, resulting in a higher correlation score.

To solve this problem, we propose a feature that does an exact match to the original word. An “exact match” is when a word appears in both the document and the question. Accurate matching is a very important technology in information retrieval and machine reading comprehension. According to previous studies, many reading comprehension models can be improved by adding this feature. Specifically, in the fine-tuning stage, we construct an accurate matching feature for each word, which indicates whether the word appears in the problem and the document. Before the encoding phase, we map this feature to a vector and combine it with the original Embedding:

In addition, we also propose a word restoration mechanism as shown in Figure 7, which can merge the Subtoken representations of WordPiece segmentation to better solve the problem of OOV mismatching. Specifically, we combine the representation of the Subtoken by Average Pooling as the input of the hidden layer. In addition, as shown in Figure 7, we use MASK to process the non-primacy hidden layer position corresponding to Subtoken. It is worth noting that the word reduction mechanism can also avoid the over-fitting problem of the model. This is because MARCO’s set annotations are sparse. In other words, many positive examples are not marked as 1, so it is easy for the model to overfit these negative samples. Word reduction is part of the Dropout mechanism.

Summary and Prospect

The above content has carried on the detailed introduction to our proposed DR-Bert model. The DR-Bert model proposed by us mainly adopts task adaptive pretraining and two-stage model fine-tuning training. In addition, word reduction mechanism and exact matching feature are proposed to improve the matching effect of OOV words. Through MS MARCO’s experiments on large-scale data sets, the superiority of this model is fully verified. I hope these can be helpful or enlightening to everyone.

reference

[1] Payal Bajaj, Daniel Campos, et al. 2016. “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset” NIPS.
[2] Christopher J. C. Burges, Robert Ragno, et al. 2006. “Learning to Rank with Nonsmooth Cost Functions” NIPS.
[3] Jun Xu and Hang Li. 2007. “AdaRank: A Boosting Algorithm for Information Retrieval”. SIGIR.
[4] Po-Sen Huang, Xiaodong He, et al. 2013. “Learning deep structured semantic models for web search using clickthrough data”. CIKM.
[5] Chenyan Xiong, Zhuyun Dai, et al. 2017. “End-to-end neural ad-hoc ranking with kernel pooling”. SIGIR.
[6] Zhe Cao, Tao Qin, et al. 2007. “Learning to rank: from pairwise approach to listwise approach”. ICML.
[7] Fen Xia, Tie-Yan Liu, et al. 2008. “Listwise Approach to Learning to Rank: Theory and Algorithm”. ICML.
[8] Mike Taylor, John Guiver, et al. 2008. “SoftRank: Optimising Non-Smooth Rank Metrics”. In WSDM.
[9] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. 2018. “Bert: Pre-training of deep bidirectional Transformers for Language Understanding “. ArXiv preprint arXiv:1810.04805.
[10] Rodrigo Nogueira and Kyunghyun Cho.2019. “Passage Re-ranking with BERT”. ArXiv Preprint arXiv:1901.04085 (2019).
[11] Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, And Jimmy Lin.2019. “Multi-stage Document Ranking with BERT”. ArXiv Preprint arXiv:1910.14424 (2019).
[12] Zhuyun Dai and Jamie Callan. 2019. “Context-aware sentence/passage term importance estimation for first stage Retrieval “. ArXiv preprint arXiv:1910.10687 (2019)
[13] Hiroshi Mamitsuka. 2017. “Learning to Rank: Applications to Bioinformatics”.

[14] Yang Yang, Jia Hao et al. Meituan BERT’s exploration and practice.

Author’s brief introduction

Xing Wu, Hong Yin, King Kong, Fu Zheng, Wu Wei, etc., all from Meituan AI platform/search and NLP center.

Special thanks to Professor Jin Beihong, researcher of Institute of Software, Chinese Academy of Sciences, for her guidance and help in the MARCO competition and article writing process.

To read more technical articles, please scan the code to follow the wechat public number – Meituan technical team!