Introduction: Multi-path recall refers to the strategy of using different strategies, features or simple models to recall a part of candidate sets respectively, and then mixing these candidate sets together for the use of the subsequent sorting model. This paper will introduce how the multi-path recall technology on the open search platform can deeply improve the search effect.

background

The so-called “multi-path recall” refers to the strategy of using different strategies, features or simple models to recall a part of the candidate set, and then mixing these candidate sets together for use by the subsequent ranking model.

Ali cloud OpenSearch (OpenSearch) is a one-stop intelligent search business development platform based on the large-scale distributed search engine independently developed by alibaba. Currently, it provides search service support for alibaba group’s core businesses including taobao and tmall. At present, open search provides text search, and the search engine can be improved greatly through word segmentation of text Query and some query analysis processing. However, for some scenarios with high requirements on search effect, such as: In the case of educational search questions, educational photo search questions are obviously different from traditional web pages or e-commerce searches. The first point is that the search Query is particularly long. The second point is that the search Query is the text obtained after the OCR recognition. To solve these problems, one solution is to continue to optimize QP and enhance QP’s ability in text processing. Another scheme is to introduce vector recall, which can recall documents by calculating the distance of vector space, as a supplement to text recall.

Function value

In long Query, long tail Query, Query non-standard and other scenarios, if inaccurate recall and insufficient results occur in text-based retrieval, supplementary vector recall can effectively improve the effect of recalled text, and also provide the ability to expand recall.

Open search provides algorithmic engineering capability of multi-way recall, endow users in different industries with different requirements of multi-way recall function, and has been productized and applied in practice in users in multiple industries. Its advantages are as follows:

1. Provide flexible algorithm capabilities to support technical optimization of text vectorization according to the characteristics of different industries, giving consideration to both effect and performance;

2. Cava scripts are supported to provide more flexible customized sorting and scoring capabilities;

3. Support analyzer with model and analyzer without model, providing vector recall function for users without algorithm ability and users with algorithm ability respectively;

4. Compared with open source products, open search has more obvious advantages in search accuracy and search delay, with the search delay reduced from open source seconds to tens of ms.

Multipath recall architecture diagram

Multiple queries

OpenSearch supports multiple queries. Query policies are configured to Query both text Query and vector Query. It is also possible to Query only text queries or only vector queries. If the text vectorization function is configured, the open search in the text Query will vectorization, generate vector Query, and sort the two results after recall.

Vector analyzer

OpenSearch supports many types of vector parsers, mainly industry generic vector parsers, industry custom vector parsers, and general vector parsers (vector-64, 128-dimensional, 256-dimensional generic). The universal vector analyzer requires users to transform data into vectors by themselves and store them in DOUBLE_ARRAY type, which is suitable for customers with strong algorithm ability.

Query analysis

The algorithm was given to students to customize the vector models of different industries. According to the education industry as an example,

Among them, the special optimization for education search is:

  • BERT model adopts StructBERT developed by Dharma Institute and customized model for education industry
  • The vector retrieval engine adopts the Proxima engine developed by Dharma Institute, which is far more accurate and faster than the open source system
  • Training data can be accumulated based on the customer’s search log, and the effect can be continuously improved
  • Rewrite the semantic vector query, RANK the text term, only participate in score calculation but not recall, improve the quality of recalled top text.

Ordering custom

OpenSearch opens up two stages of sorting: base sorting and business sorting, i.e. rough sorting and fine sorting. Among them, refined support cavA script, more flexible to support the user’s sorting requirements.

The open search in the multi-path recall process will eventually carry out a unified ranking, which currently supports internal ranking and precision ranking model scoring ranking. Internal sorting is directly based on the result of the multiway recall in order from highest to lowest score returned. Precision model scoring requires the user to provide model information, and the results of multipath recall are ranked according to the model scoring.

Multi – way recall practice case

E-commerce/retail search

Community Forum Search

Compare the top title effect before and after access

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.