“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!”

Baidu’s search system is one of baidu’s oldest and largest, and its use has been firmly rooted in People’s Daily lives. There is an interesting practice: many people use Baidu search to verify that their network is smooth. This practice shows baidu search system in everyone’s mind is “stable” representative, and the fact is really so. Why does Baidu search system have such high availability? What technology is behind it? Previous technical articles rarely introduced. This article is based on the familiar Baidu search system itself, for you to introduce its usability governance on the “stability problem analysis” aspect of the use of fine technology, with history as a clue to introduce stability problems in the process of analysis of difficulties, the way to break the problem, the method of innovation. Hope to bring some inspiration to the readers, but also hope to arouse the resonance and discussion of like-minded people.

The full text contains 7741 words and is expected to take 17 minutes to read.

Chapter 1 Dilemma

In large-scale microservice systems, if a failure does not occur, it is due to good luck. But never expect a failure not to occur; it must be treated as if it were normal. The basic pattern that follows from failure occurrence to release is abstracted as follows.

Usability governance is mainly improved from three perspectives: 1. Strengthening system resilience; 2. 2. Improve stop-loss means, improve the effectiveness of stop-loss, and accelerate the efficiency of stop-loss; 3. Accelerate the cause location and removal efficiency.

Each of the above three points is a topic. Due to the limited space, this paper will only start from [3].

Baidu search system fault locating and lifting, is a very difficult thing, may also be the most challenging thing in the company. The difficulties are reflected in the following aspects.

Extremely complex systems vs. extremely stringent usability requirements

Baidu search system is divided into online and offline two parts. The offline system captures resources from the whole Internet every day, establishes index database, and forms inversion, forward and abstract three kinds of important data. The online system then receives queries from the user based on this data and finds the content for the user at breakneck speed. See the figure below.

The Baidu search system is extremely large. Let’s get a sense of its scale with a few numbers:

Baidu search system is converted into hundreds of thousands of machines, the footprint of the distribution system in absorbs the N large areas, search service system contains hundreds of service, include the amount of data to reach tens of PB level, day change to reach hundreds of thousands of orders of magnitude, daily fault to hundreds of different species, hundreds of people to participate in research and development, the search system The system faces billions of user search requests every day.

Although the system is very large, Baidu’s requirements for usability are extremely strict. The usability of Baidu search system is more than 5 9. What is this concept? If measured by the time the service is available, with five nines available, the system is unavailable for just over five minutes a year, while with six nines available, it is unavailable for about half a minute a year. So, it can be said that Baidu search is non-stop.

When a query reaches baidu search system, it has to go through tens of thousands of nodes. The figure below shows a small fraction of the total number of nodes experienced by a Query, approximately one thousandth of the total number of nodes experienced. In this complex path, the probability that all nodes are normal is extremely small, and abnormal is normal.

The complex system means that data collection and analysis at the fault site is a huge project.

Various kinds of stability problems

Baidu search system has always pursued “full”, “new”, “fast”, “accurate”, “stable” five word formula. Daily faults are mainly reflected in the “fast” and “stable” aspects, which can be classified into three categories:

PV loss fault: The most serious fault is that the query result is not returned to the user on time and correctly.
Search effect failure: the expected page is not shown in the search results; Or not ranked in a reasonable position in search results; The search results page is slow to respond.
Capacity fault: Due to various external or internal reasons, the redundancy required by high availability of the system cannot be ensured, or the capacity water level exceeds the critical point, resulting in a crash or breakdown. Therefore, the system is not estimated, generated, or rectified in a timely manner.

What remains constant behind these diverse and diverse problems is the automated abstraction of data acquisition and processing requirements and manual analysis experience.

Chapter 2 Introduction, localization: breaking the game

Before 2014, fault locating and troubleshooting struggled with data. At that time, there were two types of data available. One is online logging of search service. The second is some scattered metrics. On the one hand, these two kinds of data are not accurate enough, the utilization efficiency is low, and the problem tracing has a dead Angle; On the other hand, their use is highly manual and minimally automated. Let me give you an example.

The analysis of the rejection problem was firstly carried out by using the script deployed on the central controller to capture the logs of each module of single PV and display them to a rejection analysis platform (which was a relatively powerful rejection analysis tool at that time), as shown in the figure below; Then manually read to capture the original log for analysis. Although this process has certain automation capability, the PV collection volume is small and the data volume is insufficient, so many rejection reasons cannot be accurately located. Data tiling display relies on the reading of experienced students, and the analysis efficiency is extremely low.

The former is more urgent in the dead-end and efficiency of problem tracing. Problem tracking without blind spots calls for more observable data to be collected. Obtaining this data is easy in a non-production environment, and there is a loss of query speed that can be tolerated in a non-production environment, but that loss of speed is not affordable in a production environment. Under the guidance of the theoretical foundation “Dapper, a large-scale Distributed Systems Tracing Infrastructure”, we built the kepler1.0 system, which is based on query sampling, Produces the call chain and partial annotations (KV data that is not in the call chain during query processing). At the same time, we improved our metrics system based on the industry’s open source Prometheus solution. They have produced great application value immediately after going online and opened up the imagination space of observability construction and application of search system.

2.1 kepler1.0 profile

The following figure shows the system architecture.

Phase mission: Kepler1.0 is to improve the observability of the search system, based on open source mature solutions combined with components in the company to achieve the construction from 0 to 1, quickly complete the blank of observability, and have the ability to query the call chain of query processing process and path service instance logs according to queryID.

Introduced: as you can see from the architecture of kepler1.0, it completely references zipkin in terms of data paths, storage architecture, and so on

Localization: when zipkin was introduced, the data collection SDK only supported c++. In order to meet the observability requirements of non-c ++ modules, take into account the multi-language maintenance cost of SDK and the intrusion of trace, a resident process was adopted to collect trace data compatible with c++ SDK through log output format, that is, the log collection module in the figure.

2.2 Preliminary exploration of general Metrics acquisition scheme

The following figure shows the system architecture.

Phase mission: The search began to explore the container mixing technology of large-scale online service cluster around 2015. At this time, the monitoring system in the company had weak support for multi-dimensional index aggregation, and the traditional capacity management method based on machine dimension index could not meet the requirements of container mixing scenario.

Introduction to: The mature metrics scheme of open source is introduced into the search online service mixed cluster, and the container index exporter conforming to Prometheus protocol is realized, and the flexible multidimensional index query interface of Prometheus and rich visualization capability of Grafana are supported. The underlying data system for searching the capacity management dependency of online business mixed department cluster is built.

Localization: The container index Prometheus – Exporter interconnects with the search online PaaS system, exports the service meta information as Prometheus label, realizes the index index and aggregation capability of the container meta information, and meets the capacity management requirements in the container mixing scenario. The correlation between metrics and PaaS meta-information is a major result of the initial exploration of cloud-native metrics systems.

2.3 Initial application effect

Scenario 1: Rejection, effect problem

Periodic pain points: Manual analysis is strongly dependent on logs, and some specific queries can be accurately retrieved from massive call chains and log data. Scanning online machine logs through SSH is inefficient, and there may be stability risks due to the full I/O of the home disk for online services.

Solution: Start the forced sampling collection for the rejection of normal random sampling and reproducible effect problems, and use queryID to directly query and call the chain and log from the platform for manual analysis, basically meeting the trace requirements at this stage.

Scenario 2: Speed problem

Periodic pain points: Only log data, lack of call chain precise timestamp; The call chain generated by a query is long, fan out is large, and logs are scattered and difficult to collect. It is almost impossible to recover a complete chronological process from the log. This results in a black-box state of speed optimization.

Solution: Fine timestamps of the call chain are completed, making full timing recovery of the Query possible. Through the call chain, optimization points such as the long tail phase at the program level or the hot instance at the scheduling level can be found. Based on this, improvement projects such as TCP Connect asynchronization and business callback blocking operation lifting are hatched and landed.

Scenario 3: Capacity problem

Periodic pain points: lack of information on multi-dimensional indicators (lack of container indicators, and disconnection between indicators and PaaS system); Lack of effective means of convergence, processing, combination, contrast, mining and visualization.

Solution: The multi-dimensional index data acquisition system at container level is built to search online, which provides an important basic output source for container capacity management applications and takes a step to explore the cloud origin of index system. The following is a screenshot of the consumption audit function performed by container indicators after the project goes online.

Chapter 3 Innovation: release of application value

Although Kepler1.0 and Prometheus opened the door to observable construction, the capabilities made it difficult to get more value for use at low cost.

3.1 motivation

Based on open source solution to realize the collection in the resource cost, delay, data cannot meet the coverage and the search service and flow scale, which affected the stability of the completeness of problem solving, especially in the search results level performance problem is particularly serious, such as not stable retrieval search results abnormal problem, the key in the index level was expected recall problem, etc.

Whether the stability problem is solved or not is always the starting point and foothold of observability construction, and the uncompromising data construction is always the top priority. Starting in 2016, search has led observability innovations and pushed them to the extreme where problems can be solved.

3.2 Full collection

Because the search system is so large, kepler1.0 can only support sampling rates of up to 10%. In practice, there is a contradiction between resource cost and thoroughness of problem solving.

(1) Most of the faults of the search system are Query granularity. Many cases cannot be reproduced steadily, but it is necessary to analyze the reason why the historical search results of a particular query are abnormal. At that time, only backed up logs could meet the data traceback requirements of any historical Query, but it faced the challenge of high collection costs. In addition, many queries do not hit the sampling of kepler1.0, and the detailed tracing data is not triggered, so the analysis cannot start. Being able to see tracing and logging information for any historical specific query is a desire of almost all students.

(2) The company’s internal storage service has low cost performance and low maintainability. It requires huge resource costs to cover the above problems by expanding the sampling rate, which cannot be met in practice.

The industry did not have a good solution to this contradiction. Therefore, we realized the kepler2.0 system through technological innovation. The system decouples the two kinds of data, tracing and logging, realizes the extreme optimization for each data characteristic through single responsibility design, and exchanges the tracing and logging capabilities of full query at the cost of extremely low resource overhead and minimal time increase. Tens of PB day level log and hundreds of billions of call chain can realize second query. Solves most of the problems of troubleshooting.

3.2.1 Full log index

First, we introduce the full log index, which corresponds to the log index module in the figure above.

The logs of search services are backed up on machines online for a considerable period of time. Previous solutions have focused on outputting the log text to the bypass system, however, ignoring the fact that online clusters are naturally a ready-made, zero-cost storage place for the log text. Therefore, we innovatively proposed a set of solutions, the core design concept is summed up in one sentence: in-situ index.

In Beidou, a log index is defined through a quad, which is called location. It consists of four fields: IP (the machine where the log is located) +inode (the file where the log is located) +offset (the offset where the log is located) + Length (the length of the log). These four fields total 20 bytes and are only related to the number of logs, and have nothing to do with the length of logs, thus realizing the low-cost index of massive logs. Location The log-Indexer module (deployed on the machine for searching online services) collects logs and creates indexes for the original logs. The indexes are stored in the disks of the container where the logs reside.

The following figure shows the logical format of the log index stored locally in Beidou.

During the query, the inode, offset, and length are sent to the machine where the index IP is located (that is, the machine where the original log is located). Through the log reading module of the machine, the original log is returned based on the inode, offset, and length in the time complexity of O(1), avoiding file scan. It reduces unnecessary CPU and I/O consumption and reduces the impact of log query on service stability in production environment.

At the same time, in addition to supporting the location index, we also support flexible indexes, such as search terms, user id and other business meaning of the field as secondary indexes, convenient problem tracing can not get queryID, can support the query according to the information in other flexible indexes; In terms of how indexes are used, in addition to log queries, we have also built a streaming architecture through index push to support application requirements for log streaming analysis.

Here’s another question: When querying the logs for a query, do you still need to broadcast the query request to all instances? The answer is: no. We optimize the query process by using the callgraph full call chain assist described below to determine on which instances the query logs are located, so as to achieve fixed-point sending and avoid broadcasting.

3.2.2 Full call chain

In the scheme provided by Dapper, there are two types of data, namely call chain and annotation. After re-examination, we find that the essence of annotation is logging, which can be expressed through logging. The call chain can not only meet the needs of problem analysis, but also because it has a neat and consistent data format and easy to create and compress, to achieve cost-effective utilization of resources. Therefore, the callgraph system (the red part of the kepler2.0 architecture) came into being with the characteristics of the simplest and purest data. The core mission of the full call chain is to store the call chain data that searches all queries and query it efficiently at a reasonable cost of resources.

In the data logical model of tracing, the core element of the invocation chain is span, and a span consists of four parts: span_id of the parent node, SPAN_id of the local node, ip&Port of the child nodes accessed by the local node, and start & end timestamps.

The innovation of the core technology of the full call chain lies in two points :(1) self-developed span_id derivation generation algorithm, (2) customized compression algorithm combined with data features. Compared with kepler1.0, the storage cost is 60% optimized. These two technologies are described below.

3.2.2.1 Span_id derivation generation algorithm

Note: There are two spans of 0 and 1 in the figure below. Each span consists of two parts: client side and server side. Each box is the actual data written to the storage of the trace system.

Left: Kepler1.0 random number algorithm. In order to concatenate the client and server of a span and restore the parent-child relationship between multiple spans, the server side of all spans must save parent_span_id. So the two spans actually need to write four pieces of data to the storage.

Right: Kepler2.0 derivation algorithm, span_id from the root node from 0, each call downstream accumulates the IP of the downstream instance as its span_ID and passes it to downstream, downstream instances recursively continue to accumulate on this span_id, this ensures that the span_id of all calls to a query is unique. The instance only needs to save its span_id and downstream IP, and can restore a span client and server according to the algorithm. In this way, only two pieces of data need to be written and parent_span_id does not need to be saved in the data. Therefore, the storage space is saved and the collection capability of full call chain is realized.

In the right figure, the invocation chain of IP1 :port1 to IP2 :port simulates the scenario of accessing the same instance ip2:port2 multiple times. This scenario exists widely in the search business (for example, a query in the fusion layer service will request the same sorting service instance twice; In the scheduling layer, upstream requests and downstream exceptions are retried to the same instance, etc.), and the derivation algorithm can guarantee the uniqueness of span_ID generated in query, thus ensuring the integrity of the call chain data.

3.2.2.2 Data Compression

Combined with data features, multiple compression algorithms are adopted.

(1) Business level: Customized compression is carried out according to the characteristics of business data, rather than the general algorithm of no-brain compression.

(a) timestamp: use the difference relative to base and the pfordelta algorithm. The fan-out service multi-child node timestamp is compressed so that only the first start timestamp and its offset are saved. For example, in the scenario of high fan-out and short latency for searching online services, the storage offset saves 70% of the storage time than the direct storage of two time stamps.

(b) IP: The search Intranet service IP addresses are all on the 10.0.0.0/24 network segment. Therefore, only the last 3 bytes of the IP address are saved, and the first 10 bytes are omitted, saving 25% for each IP address.

(2) Protobuf layer: Protobuf is used in the final persistent storage of data at the service layer, and the serialization feature of Protobuf is flexibly used to save storage.

(a) varint: All integers are compressed and stored with variable length instead of the original fixed-length 64-bit, so as to achieve no storage waste for IP, port and timestamp offset data less than 64 bits.

(b) Use “Packed” as a repeated type with “IP” and “timestamp”. By default, packed is not turned on. As a result, every repeated field has a field number saved once, which is very wasteful. In the case of a fan out link with an average fan-out ratio of 40, turning on Packed saves 25% of the storage space (40 byte field number).

Finally, the logical format (above) and physical format (below) of a span are as follows:

3.2.3 Application Scenario Benefits

3.2.3.1 Time Travel: A key result of a particular Query in history has an unexpected recall problem at the database level

Because the recall layer index library is the largest searched service cluster, kepler1.0 only supports a 0.1% sampling rate on the index library service, which makes it difficult to track the effects of a particular index library type and sharding failure. The full call chain collection better solves this dilemma.

Real case: PC search query= The result of Baidu Encyclopedia is not displayed in Hangzhou. First, the tool is used to query the fragment no. 9 of database A where the URL of the result is located. Further, the full call chain is used to check that the fragment No. 9 is lost in all requests of database A from this query (the fragment is still discarded by the scheduling policy because of timeout after retry). All copies of the fragment fail to provide services. Expected Result After the service is repaired the fragment is recalled normally.

3.2.3.2 Chain analysis: Stateful service leads to the effect of “missing the secondary vehicle”

Complexity of stateful service effects: Take the most common cache service as an example. If there is no cache, simply use the queryID of the exception to locate the cause of the exception through the call chain and log. In this case, the query that hits the dirty cache is only the “victim”. Its call chain and log cannot be used for the final location of the problem. It needs to be analyzed with the call chain and log of the previous query that wrote the cache. We call them “troublemakers.”

Limitations of Kepler1.0: Kepler1.0 sampling algorithm is a random proportional sampling, whether the “trouble-maker” and “victim” query hit the sampling is an independent event, because the “trouble-maker” comes first, when the “victim” is affected by the effect, it can no longer reverse time to trigger the former sampling, resulting in the two query in the “time” dimension trace chain interrupt. Tracing also subsequently fell into a dilemma.

The breaking method of Kepler2.0: Based on the realization of “vertical correlation” (the full call chain and log information in the process of a query), the ability of “horizontal correlation” is built with the full call chain, which supports the chain tracing requirement of multiple associated Queries on the time series. The TraceId of the current query is recorded in the cache result when the cache is written, and the queryID in the cache result can be used to find the “troublemaker”. With the full call chain function, you can analyze and locate the reason why the “troublemaker” writes dirty cache. In addition, the user interface is specially designed for the ease-of-use of time sequence tracing. For example, the queryID of the write cache in the log is flushed red. Clicking this field can directly jump to the call chain of the corresponding Query and the log query page.

summary

Above, the ultimate data construction to solve the problem of the dead end of the investigation, at this time the problem analysis efficiency becomes the main contradiction, the next part we bring you baidu search how to abstract the manual analysis experience, to achieve automation, intelligent failure, so as to ensure the stability of Baidu search. To be continued, stay tuned…

This author | ZhenZhen; LiDuo; XuZhiMing

Recruitment Information:

Baidu Geek, a public account with the same name, said, ‘You can enter the internal push to join the search architecture department. We look forward to your joining!’

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Baidu search stability analysis of the story (1)

Chapter 1 Dilemma