Alibaba complex search system reliability optimization road

background

Search engine is the core link of transaction link of e-commerce platform, the high availability of search engine directly affects transaction efficiency. As a key system of Xianyu, xianyu search engine has very high complexity and system volume. In addition, all shopping guide scenes of Xianyu rely on search to enable, so the stability and reliability of search services become the measurement standard of the usability of most business scenes of Xianyu. How to ensure the stability and high availability of search services has become a great challenge.

As the core system of idle fish, idle fish search has the following outstanding characteristics:

Large data volume: billions of idle fish products are connected, and hundreds of millions of engine effective products;
Large index: Idle fish unstructured commodities need to cooperate with the algorithm team to predict and extract valuable structured information and establish indexes; Hundreds of index fields have been created, and the amount of index data in the whole engine is T level.
Multiple incremental messages: Daily incremental message QPS can reach hundreds of thousands, and peak QPS can reach millions;
Complex query: In many special service scenarios, the query conditions are demanding and complex. Such as recall GROUP grouping statistics, aggregation/break/heavy, keywords compound operation query, etc.;
High real-time requirements: All idle fish are second-hand goods, and the inventory of the seller is 1; Goods off and off the shelves frequently, the synchronous update of engine data real-time requirements are very high;
Intelligent extension: due to the unstructured characteristics of idle fish products, to ensure the effect and relevance of recall data; The engine needs to be capable of intelligent plug-in extension and can cooperate with algorithm developers;

In view of the above main characteristics of xianyu commodity search engine, this paper introduces in detail the various efforts of Xianyu search in system high availability, hoping to give readers some inspiration.

Idle fish search overall architecture

Before the formal introduction of search stability guarantee scheme, we need to have a simple and general understanding of idle fish search technology;

We have compared many external open source search engines, none of which perfectly support the requirements listed in the background; Xianyu uses the latest search engine platform Ha3 developed by Alibaba. Ha3 is a very efficient, intelligent and powerful search engine, which fully meets the requirements of Xianyu search. Elasticsearch is a quasi-real-time search engine based on Lucene. It is also a popular open source search engine, but it is far from Ha3 in the ability of algorithm extension support/absolute real-time. Ha3 is 4 times higher in QPS and 4 times lower in query latency than ElasticSearch based on 12 million data. The performance and stability of Elasticsearch in large-scale data scenarios are far from that of HA3.

01 Idle fish search system operation process

The following figure is the system structure diagram of idle fish search system, which is mainly divided into online and offline processes.

Index building process

Index building is what we call an offline process whose executors BuildService① are responsible for building plain text commodity data of different storage types into index files in search engine format. The original commodity data can be divided into two types. One is the full commodity data stored in storage, which is produced by DUMP② on a regular basis (usually every day), and the other is the real-time changed data. After the commodity information is changed, the business system synchronizes it to Swift③. Searcher④ updates the index that is eventually distributed to the online service.

Search query process

Search queries are what we call online processes; The idle fish search service application A initiates the search request and arranges the service capability through SP⑤. First, SP initiates QP⑥ algorithm service call to predict user intention and obtain auxiliary sorting information. Then, combined with the results returned by QP and the query parameters of the business system, the query request is initiated to the Ha3 search engine. Ha3 search engine QueryService⑦ Qrs⑧ received the query request, distributed to QueryService Searcher for inverted index recall, statistics, conditional filtering, document scoring and sorting, summary generation; Finally, Qrs returns the results returned by Searcher to SP after integration, and SP returns to the business system after re-processing;

02 Idle fish search system team composition

Idle fish search operation and maintenance system, is a very complex composition; It involves a lot of teams working together;

First, there must be a Ha3 search engine team at the bottom to provide core search engine capability support; Mainly responsible for the construction and maintenance of the core capability of Ha3 search engine; Provide and maintain engine operation and maintenance platform and real-time engine search service. Then the algorithm team customized the Ha3 search engine to optimize the user’s search experience. The unstructured commodity of idle fish is understood by algorithm model, and the structured information is predicted and extracted for use in search engine commodity index. Monitor and maintain QP cluster services; Developed and used Ha3 engine sequencing plug-in to conduct bucket experiment of recall data and verify tuning. Finally, our business engineering team connected the whole search process to monitor and maintain the availability of the whole search link. Mainly maintains the data of search docking, manages the access of Ha3 search engine, arranges SP search services and makes reasonable query plans; And idle fish search unified online query service research and development and maintenance work. This paper also expounds how to guarantee the stability of complex search business system from the perspective of business engineering team.

Stability management

01 Deployment Architecture Optimization

Independent Gateway Deployment

Ha3 engine provides search service API based on HTTP protocol through SP. For such a complex search scenario as Idle fish, if each idle fish upper business uses search service in the form of splicing SP HTTP interface parameters, all upstream businesses need to care about SP splicing syntax, which will increase the development cost dramatically. And if due to special reasons SP syntax adjustment or incompatible upgrade, then all the upper-layer business need to modify the logic, such a design is not reasonable; In order to completely decouple the business system from the search system and improve the ease of use of search services, idle fish search provides simple and consistent distributed services through a unified business search gateway for the use of idle fish search services, and connects with SP to shield the penetration of SP to the upstream business system. At the beginning, idle fish search service is co-built with many unrelated business scenarios in a relatively large underlying application; This deployment mode has great security risks for service modules that require high stability. 1. Each service module will influence each other; There is a certain degree of code coupling, but also involves the competition of machine resources, the risk is relatively high; 2. The application is too large, which seriously affects the efficiency of development collaboration and code quality; Therefore, the idle fish search service is deployed to an independent container group, and the new application A is dedicated to the idle fish search service, which acts as an independent gateway for all businesses to use the search service and connects with the downstream SP search service. Ensure that services are isolated and stable. The front and rear deployment figure is as follows.

Multi-room Dr Deployment

At the beginning, the Ha3 search engine docking of xianyu commodity search service was only deployed in a machine room; When a serious problem occurs in the equipment room, the impact on upstream services is severe, or even a fault may occur. In view of this, the online and offline cluster of idle fish commodity search engine is deployed for disaster recovery. Before expanding in detail, we first have a general understanding of the principle of Ha3 engine DUMP process;

As shown in the figure above, the DUMP process of Ha3 engine can be divided into the following steps:

Prepare source data: Evaluate business requirements and prepare data that needs to be connected to the engine. Most of the general service data are DB data tables, and there are a few ODPS⑨ offline data tables. Most of the data provided by the algorithm team are ODPS offline data tables;
DUMP data: By Ha3 engine team provides operations platform, can be some of the data field access these tables to create a good search engine, subsequent DUMP execution, Ha3 offline engine will pull the access table field data, the formation of a table of engines for image data, in this step, we can use the UDF tool for engine team, Data cleaning/filtering processing;
Data Merge: The engine joins all mirrored table data with the primary key we specify. Finally form a data wide table; Used by the engine to create indexes; After data Join in this step, the final data can be further cleaned/filtered through UDF, and the verified data will be entered into the large-width table.
Create updated index: the Ha3 offline engine will rebuild the index by buildService, using the data of the large and wide table, and align it with the index Schame that we specified in the operation and maintenance platform of the Ha3 engine.

The above processes can be manually triggered by the Ha3 engine o&M platform. After the above processes are executed, a new index will be generated. After the new index cluster service becomes available, the online real-time module will switch the query service to the new index cluster to complete an index update. This whole process is what we call “full”; After the full volume is completed, when the system has a new commodity information change and the corresponding data table is enabled with real-time update (we call it incremental function,DB table is implemented through binlog/ODPS table and Swift message notification), the offline DUMP engine will sense the change and then update the commodity data in the corresponding mirror data table. The change information will be delivered to the upper layer of the engine according to the above steps in the offline DUMP process until the corresponding data in the engine index is successfully updated or discarded by the system rules during the process. This process of updating data in real time is called “incremental”; There is another channel for incremental update: Algorithm students can use a special way to directly update the data to be updated to the Ha3 engine index without passing the DUMP process through Swift incremental message. The commercial quantity of idle fish has increased rapidly, reaching billions at present. Hundreds of index fields are connected. Due to the unstructured nature of idle fish commodities, only a small part of the index fields are available for business use. In addition, most of them are indexes accessed by algorithms, such as a large number of extracted label data and vectorized data, which are very large. Finally, the DUMP processing logic of idle fish commodity search engine is complicated, and the total amount of index data is extremely large, and the amount of incremental messages is also at a very high level, coupled with the status of idle fish commodity inventory. Therefore, the real-time requirement of data update is very high, which brings great constraints to the stability. Index data is the core content of search engine. If the index data entering the engine has problems, or the newly changed data is not updated in the engine index, the quality of search service will be directly affected. During the single-server deployment of search engines, full DUMP fails, or incremental delay, or even stops due to some unstable factors. Once the engine DUMP problem occurs, it is very difficult to restore the engine. In many cases, it even needs to run the whole engine again to solve the problem. However, due to the large volume of idle fish commodity index data, it usually takes more than half a day to do a full volume, so there is no way to quickly stop bleeding, which has a great impact on business. Therefore, the SEARCH engine is deployed in two rooms (M/N rooms) for mutual backup. The two offline DUMP rooms use the same engine configuration and the same data source, and produce the same index data for the two online rooms. The online traffic ratio of the two rooms can also be adjusted in real time as required. When there is an irreversible problem in M machine room, the flow will be switched to N machine room automatically or manually to achieve on-line rapid hemostasis, and then the problem in M machine room will be solved step by step.

The following figure shows the final deployment of the search engine room.

Although the deployment of two engine rooms increases the cost of machine resources, it has the following benefits in addition to the above service Dr Advantages.

The release of engine requirements, before the lack of effective grayscale process; When there is a major change/upgrade of the search engine and a high-risk release occurs, beta test can be carried out in the small flow of the single room first and release to another room after data comparison and verification is passed.
Ordinary single room can support all the search query business flow, when the big promotion or large-scale activities, the two machine rooms mounted at the same time to provide services, so that the search service capacity and capacity can be doubled directly; It avoids the trouble of frequent expansion and shrinkage of single room;
During performance evaluation, you can test an equipment room that is not loaded with traffic. Even if the equipment room is down due to the pressure test, online services are not affected.

02 Traffic Isolation

As mentioned in the section of independent gateway deployment above, Idle fish search provides simple and consistent distributed services through a unified business search gateway for the use of idle fish upper-layer search services. The use of unified microservices inevitably brings the problem of different upstream business priorities and reliability assurance. Idle fish search service supports a wide variety of upstream businesses. In order to measure and manage the traffic/service quality of each business scenario in a unified manner, it is necessary to apply for the use of corresponding business sources when the upstream businesses access idle fish search service. This business source identification will accompany the whole search and query life cycle. It can be directly used for log collection, so that alarms can be monitored based on services, and the health status of services can be detected in real time (the following figure shows a simple monitoring view). Traffic can also be controlled and degraded for specific services.

03 Hierarchical monitoring system

For high stability systems, it is particularly important to be able to sense problems when they occur or are about to occur. Convenient real-time tracking processing to prevent further expansion; At present, the main means is to establish and improve the multidimensional monitoring and alarm system;

Engine base services monitoring

Monitoring can be used to quickly discover problems. If the granularity of monitoring is fine enough, problems can be quickly located. However, sometimes there are false positives or missed reports, so the real monitoring must be combined with the characteristics of each business system, combing out the key link, multi-dimensional 360-degree non-dead Angle monitoring for the key link, and reasonable early warning rules are set, monitoring and early warning will be more effective;

A complete log data acquisition module has been established on the online and offline processes of the idle fish search engine/the core links of important upstream application systems, and the key indicators have been accurately monitored and early warning Settings. Make sure that any problems are perceived in a timely manner. The following figure shows searching for core logs of the service and monitoring alarms.

Online service monitoring that simulates user behavior

As mentioned above, the index volume of Xianyu search engine is relatively large, which requires the cooperation of many teams and the complexity of the search process is relatively high. Moreover, with the help of algorithm students, a lot of AI recognition has been done for unstructured products of idle fish. Moreover, idle fish are all single inventory products, which have high requirements for real-time performance of the engine. Some disaster protection schemes have been done in front; However, the perception of real-time need to be further, in order to timely know the accuracy of data, whether there is update delay, and how long the delay time and a series of health information; The solution is to monitor alarms in real time at the business level. Extract the idle fish are sold more update is more frequent category K, in the background of idle fish business system, through jkeins interval of time (time step can be real-time adjustment), use category K as the keyword and category, according to the commodity index update time descending recall, simulated users polling way send search queries, Recall the first page of products that meet the requirements; Then, compare the difference between the product update time of engine recall data and the current system time. If the time is greater than the threshold (which can be adjusted in real time), it indicates that there is a serious data update delay, and alarms are sent.

04 pressure measurement

Full-link pressure measurement

The search service and the upstream business system of the whole link pressure measurement transformation; In addition, large quantities of pressure measurement data are constructed by using the real online user requests. Under the premise of not affecting the normal online services, the system capacity and resource allocation of the link under the super-large traffic model are verified, the performance bottleneck of the link is found, and the network device and cluster capacity are verified.

Single-link pressure test of the engine

Ha3 online search engine flow. You can perform an online service performance pressure test by playing back the online search traffic during peak hours. Offline process of the Ha3 search engine. You can perform incremental performance test on the engine DUMP by playing back Swift incremental messages within a period of time.

05 Grayscale Release

The unstructured characteristics of idle fish commodities are inseparable from algorithmic empowerment. In our research and development cycle, we maintained deep cooperation with two algorithm teams and quite a number of algorithm students. To idle fish search has brought a leapfrog development, but in the team collaboration and research and development efficiency also brought us great challenges.

Algorithm team, engine team, plus business engineering team, very large search project development team, every week there are a lot of new algorithm model, new engine transformation, new business modules need to be online. A large number of new logic changes are put online directly, causing a lot of problems; First, at the code level, although the pre-release environment has been fully tested, it is difficult to ensure that there is no test omission of edge logic. Even if the pre-delivery test is completely covered, the online and pre-delivery environment is different, and the online heavy traffic environment may expose some hidden code problems; Second, if the code does not have any quality problem, but all the functions of binding, all logic are mixed together, how to assess the effect after a module on-line become great, especially the optimization algorithm model, and try on new business patterns, all need according to the effect of the detailed feedback data indicators to guide the optimization direction of the next step; Therefore, there is an urgent need for a grayscale experimental guarantee system, which can not only be used to coordinate and isolate each module in the whole search business, but also to evaluate the effect of each module independently. And can also improve the efficiency of collaboration, so that each module can carry out quick trial and error, fast iteration;

In order to solve the above very important problems, the business engineering team developed a set of experimental management system for gray scale scheduling management of search experiments. The system functions are shown in the figure above. It has the following characteristics.

The experiment is flexible and convenient, one experiment can contain multiple experimental components, one experimental component can be used for multiple experiments; An experimental component can contain multiple experimental buckets;
The experiments of each page module can be real-time regulated in the system, including the experiment on/off; And the relationship between experiments;
Search the whole link of the buried point of the experiment, statistics of various experimental data reports;
The statistical data is connected to the Idle fish portal and the Babel tower, and the experimental curves of different buckets of each index can be viewed.
Improve experimental iteration speed, improve algorithm/business efficiency, fast trial and error, accelerate the growth of search transaction conversion;

06 Emergency Plan

According to the evaluation analysis or experience, the key points of the potential or possible emergencies in the search service should be prepared in advance; When certain conditions are met, multi-dimensional and multi-level automatic downgrading and current limiting are carried out, or manual intervention is configured.

Whenever an online problem is found, the first step is to stop the bleeding quickly to avoid the problem spreading. Automatic preplan will automatically find problems, automatic circuit breaker, we need to pay close attention to the operation of the system, to prevent rebound; If there is a rebound, and has a great impact on the business, quickly manual intervention to implement the downgrade plan; After hemostasis is completed, detailed investigation will be conducted on the specific cause. When the root of the problem cannot be determined within a short period of time, if there are changes or releases when the problem occurs, the changes or releases will be rolled back at the first time.

For the dependent services at all levels in the system, fusing downgrades and system load protection are adopted. Sentinel[4], a resource call control component independently developed by Alibaba, has been opened source at present. Or you can use the Hytrix demote limiting tool;

07 Troubleshooting

The idle fish search link will be connected to Ali search problem investigation platform, and the input parameter information/output data information of each step of the search real-time query request will be displayed in detail in this tool platform, which is convenient for the investigation and follow-up of various problems and data information comparison;

It can visually display the experimental recall data of each bucket under each query condition, which is convenient to compare the effects of each experiment. And all kinds of detailed information of each recalled product, including business data and algorithm label data, as well as the score of each engine plug-in corresponding to each product, can be viewed in detail; It can also disclose commodity index information according to commodity Id, seller Id and seller Nick. You can check the detailed data of the corresponding product in the engine index. If the data is different from the expected, you can query the abnormal status caused by the processing logic of the offline DUMP step with one key. After accessing this troubleshooting platform, it can intuitively grasp the running status of the engine and search the link state of recall. To quickly discover the root of the problem, instant repair problems have a very important role!

Summary and Prospect

Idle fish were introduced in this paper how to guarantee the stability of the search engine service under complex scene, mainly from the architecture deployment, isolation, capacity assessment, risk perception & control, etc, introduces how to stable support 20 + online search business scenarios, do a recovery quickly found problem come from line, efficient forecast risk aversion case 50 +, Greatly improve the user experience of the search service, to ensure that the idle fish search year without failure;

After the above management scheme, the stability of idle fish search system has been greatly guaranteed. At the same time, we will continue to deepen our efforts to make the search ability more available and easier to use, so as to make the upstream business more smooth. I hope to bring you some thinking and inspiration.

This article is from “Ali Technology”, a partner of the cloud community. For relevant information, you can pay attention to “Ali Technology”.