It is a new term season, a large number of students into colleges and universities at the same time there is a group of students about to graduate, every student can not escape the baptism of the paper. In the field of academic papers, almost every college student has to face the link of paper retrieval and re-checking. To write a high-quality paper, a large amount of information reserve is essential in the early stage, and literature retrieval becomes an important way for us to obtain information.

Guided by customer needs, Wanfang data knowledge service platform integrates hundreds of millions of global high-quality knowledge resources. Relying on powerful data collection ability, and applying advanced information processing technology and retrieval technology, wanfang provides high-quality information resource products for decision-making subjects, scientific research subjects and innovation subjects. Today we will talk about how to use Baidu PaddleNLP upgrade paper retrieval system.

Business background

The core problem of Wanfang paper retrieval system is the task of text matching. This system needs to search for similar literatures in hundreds of millions of knowledge resources quickly based on retrieval matching algorithm and user’s retrieval words.

In the process of system task execution, the correlation between search terms and literature will be directly reflected in the ranking of results page, and the ranking accuracy will directly affect the efficiency of users’ search decision and search experience. Therefore, it is very important to quickly and accurately describe the deep semantic correlation between search terms and documents.

However, in the face of massive data and frequent user search requests, solving the problems of high speed and high efficiency at the same time brings many challenges to wanfang literature retrieval system:

  • Difficulty 1 — Less annotated data: Due to the shortage of human resources, it is impossible to annotate massive data resources in the system. How to use massive unsupervised data to automatically generate weakly supervised data?
  • Difficulty 2 — It is difficult to accurately calculate semantic similarity: how to accurately calculate the similarity between user search terms and literature?
  • Difficulty 3 — Poor timeliness of retrieval: In the face of massive resources and increasing user demand, how to find relevant literature quickly and efficiently is also a big challenge.

In addition to retrieval scenes, text similarity calculation is also the core method of paper repetition and similar paper recommendation. In these businesses, we went through a long exploration, culminating in the use of flying OARS. Thanks to PaddleNLP’s rich Chinese pre-training models and the ability of model selection and deployment for industrial-level scenarios, we built an end-to-end industrial-level text vector learning and computing environment very efficiently, and realized the multi-faceted upgrading of the academic retrieval system.

Technical selection and project practice

Paddle provides powerful product functions and technical support in industrial practice. Based on the rich frontier pre-training model in PaddleNLP, Paddle Serving is used to realize the rapid deployment of the server and solve the pain point in the actual business landing.

Training data tags are constructed by high-quality Chinese pre-trained Word Embedding provided by PaddleNLP, and the algorithm accuracy is greatly improved by combining SimCSE and sentence-bert, a text matching pre-training model optimized by the depth of the fly oar.

In terms of model performance, we adopted multi-threaded data preprocessing, model downgrading and TensorRT deployment. The selection of mature development tools greatly reduces the difficulty of applying deep learning technology to industrial landing.

▲ Overall architecture diagram of technical solution

The overall architecture diagram of our technical solution is shown above. Generally speaking, it mainly includes three parts: constructing data, model selection and industrial deployment.

  • Structural data

Wanfang business has accumulated massive unsupervised data, but little annotated data. We use PaddleNLP open source high-quality Chinese pre-trained word vectors to quickly build weakly supervised similar text matching data, saving a lot of labor annotation costs.

In order to further improve the data index, we also adopt the unsupervised semantic matching model SimCSE.

In addition, wanfang search system has accumulated a large number of user behavior log data (such as browsing, clicking, reading, downloading, etc.), and we have also screened out a large number of supervision data from the business perspective.

SimCSE Reference: github.com/PaddlePaddl…

  • Model selection

As for the text similarity calculation, we have used literal matching, Word2vec, FastText and other methods, but failed to learn sufficient precision text semantic representation. We know that Baidu has rich technical accumulation in search scenes, and we also pay attention to PaddleNLP, which integrates a series of pre-trained semantic models such as ERNIE and BERT, and provides a systematic scheme for search scenes.

In recent years, pre-trained language models represented by BERT and ERNIE have become the mainstream models of NLP task.

Sentance-bert uses twin network structure, fine-tune based on BERT model, and introduces (DSSM) twin tower model, which is in line with our business scenario, so we choose this model as our benchmark model.

Compared with FastText model, the matching effect of Sentence-bert is improved by 70%, and the overall user experience is greatly improved.

After the literature in the database was calculated by Sentence-BERT in advance to obtain the literature vector, the open source vector database Milvus was used to build an index library to quickly recall similar vectors and reduce the response time of the retrieval system.

Sentence-bert reference: github.com/PaddlePaddl…

Semantic indexing policy Reference: github.com/PaddlePaddl…

  • Industrial deployment

Online retrieval systems in particular need to consider the need for rapid response. The sentence-BERT 12-layer Transfomer structure, with a large number of parameters and computation, faces a huge challenge in real-time response when deployed online.

In order to meet the performance requirements of online business, we use Paddle Inference base and Paddle Inference combined with Paddle Serving.

On the premise of no loss of accuracy, we compressed the Sentence-BERT from 12 layers to 6 layers, combined with optimization methods such as TensorRT acceleration, and achieved the QPS of 2600, exceeding expectations.

Extend – Retrieve the overall scenario

Above, we refer to the overall scheme of PaddleNLP retrieval scene, which mainly includes three parts: post-training, semantic matching and semantic indexing.

  • Domain pre-training is to continue the pre-training on domain data based on the general pre-training model, so that the pre-training model can learn more domain knowledge.
  • The semantic matching module presents a sorting model scheme for retrieval system in the case of high quality supervised data. In addition, aiming at the problem of high cost of obtaining high-quality annotated data and low data volume, the semantic matching module also has built-in R-DROP data enhancement strategy to further improve the sorting model effect in small data volume scenarios, so as to help the retrieval system achieve better results.
  • The semantic index module provides unsupervised semantic index (SimCSE) and supervised semantic index schemes for unsupervised and supervised data scenarios. Even if there is no supervised data, the unsupervised semantic index scheme can be used to improve the recall effect of retrieval system.

To meet the high performance requirements of industrial application landing deployment, the predictive deployment process also provides high performance prediction capability based on FasterTransformer and easy to use Python API, so that we can quickly implement the model into the actual business.

In the future wanfang business, we will use R-DROP data enhancement strategy and FasterTransformer to further cope with the continuously increased user needs.

If you’d like to learn more about the plan, follow PaddleNLP to catch up on its latest features, or talk to me live:

Making Repo: github.com/PaddlePaddl…