The full text is 3286 words, and the expected reading time is 7 minutes

The background,

Baidu advertising business system is built on a distributed system and is oriented to business services. Every day, PV of various interfaces is called tens of billions of times, resulting in TB-level monitoring data, which poses great challenges to the design of monitoring system. Quantile values are highly sensitive to interface performance and are of great value in performance analysis.

1.1 What are quantile values

Quantile values are values that rank in a certain percentage of a set of data. For example, when starting up, 360 prompts “your computer defeats 80% of the users in the country”, which means that the startup time ranks 20% of all computers in the country.

1.2 Why quantile value?

Quantile values are very important in interface performance analysis. Because many extreme requests are concentrated above the 99% quantile value, they are small in number but have a large impact that cannot be observed through averages. 99% of normal requests average out the extreme 1%, making the system seem responsive. But these few extreme requests can lead to a very bad user experience for 1% of users.

2. Common calculation scheme of quantile value

2.1 Flow Calculation

Data samples are collected in real time and uploaded to Spark and Flink computing clusters for streaming computing.

  • Advantages: Up to 100% accurate quantile value can be obtained;

  • Disadvantages: Importing full data to a computing cluster in a scenario with tens of billions of users consumes huge resources and costs a lot. Do not apply to monitoring scenarios.

2.2 Offline Computing

Offline calculation, the data is directly imported into the data warehouse, and then the quantile value is calculated in batches by regular tasks.

The APM online monitoring scenario has high requirements on real-time performance and is not suitable for offline computing architecture.

2.3 Pressure measuring end calculation

Use JMeter, LoadRunner and other pressure measurement tools to collect data samples in the process of pressure measurement, real-time sorting, real-time calculation.

  • Advantages: high precision, strong real-time;

  • Disadvantages: But limited to single application, pressure phase. The scale is limited, and there is no long-term change trend and other analysis capabilities.

2.4 AD hoc query calculation

Monitoring tools, such as Prometheus, collect samples of instance data and calculate quantile values in real time when they need to be queried.

  • Advantages: saving computing resources, on-demand computing, strong flexibility;

  • Disadvantages: It consumes storage resources, especially in the scenario of tens of billions of yuan. The selection of storage data structure greatly affects resource cost and AD hoc query response time, and has a decisive impact on technical feasibility.


Divide and conquer computing architecture

3.1 Divide and rule

Divide and conquer is the basic idea of big data computing. But how to merge the quantile values is the key problem of divide and conquer. The character of quantile value determines that quantile value cannot be simply divided and divided and then merged. In the same way that the average cannot be averaged again, it has no mathematical significance to average the quantile values of multiple instances of the same application and cannot represent the quantile values of the whole cluster.

Since quantile values themselves cannot be merged, can raw data be merged? Of course you can. Since raw data can be merged, can a summary of raw data be merged? That’s ok.

Abstract is an abstraction of the distribution of the original data sample, which can be equivalent to the original data sample within a certain error. The common data Digest algorithms are histogram, T-digest, GK algorithm and so on.

3.2 Aggregation at the collection end

Based on data summarization, we can maintain a data distribution locally (known as a summary) in each instance, update the data distribution each time we get a sample of request-time data, and implement aggregation locally. Locally aggregated data distributions are uploaded to the data warehouse at regular intervals.

“This is the first aggregation”, which can compress millions of requests per hour of a single instance into a data distribution occupying tens of kilobytes, greatly reducing the size of the data, making it possible to host scenarios of ten billion.

3.3 Convergence layer merging

The data distribution uploaded by each instance is stored in the data warehouse without calculation. Until a user queries the quantile value data of an interface, all the data distributions under the interface are screened and merged into a data distribution, which is then used to calculate the quantile value.

“This is secondary aggregation,” assuming that there are 100 instances on an interface that are collected hourly, then there are only 2400 data distributions per day. Merging 2400 data distributions can be completed in 0.1s.

Two-level aggregation can spread computing overhead over tens of thousands of instances with little impact on business performance and no additional computing resources. Combined with the advantages of ad-hoc computing, second-level queries can be achieved, and a high degree of flexibility is highly compatible with APM scenarios.

Fourth, concrete implementation

The overall architecture is shown as follows:

▲ Overall Architecture

The principle is divided into the following steps:

  1. “Interface interception” intercepts when the interface is called to obtain the interface response time.

  2. “On-end aggregation” aggregates the intercepted response time into the data distribution. After the aggregation time interval is reached, the local data distribution is uploaded to the data warehouse, and then the data distribution is emptied.

  3. “Secondary aggregation” At the data warehouse, the data distribution uploaded by each instance is secondary aggregated according to the interface name. The data distribution is combined into the total number of interfaces according to each interface, so as to calculate the total number of interfaces.

Step1. Interface interception

In the interface interception phase, it is necessary to intercept each interface of the application, execute the monitoring logic before and after the interface execution, so as to calculate the response time of the interface, and then send the response time to the data distribution for subsequent aggregation logic. The specific logic is shown below:

▲ Interface Interception

The specific techniques of interface interception can be realized through manual embedding and Java bytecode enhancement.

Step2. Configure the aggregation

The data distribution is a summary of a sample of interface response time data and is used to approximate quantile values in subsequent sessions. Data distribution requires a data structure that meets the following characteristics:

  • Describe the distribution structure of the original data sample

  • New data samples can be received to adjust data distribution

  • It conforms to the associative law, that is, multiple data distributions can be combined into a total data distribution

  • Data distribution can be used to estimate approximate quantile values of the original data sample

  • Use constant size memory space, and the accuracy is controllable

There are many kinds of specific implementation of data distribution, such as histogram, T-digest, Q-Digest, DK algorithm. The accuracy of quantile value depends on the selection of data distribution algorithm, and we choose T-digest structure in practice.

After each interception of the interface call, the corresponding time of the call is obtained, and then the data distribution of the interface is updated. The process is as follows:

▲ Update distribution

Each application instance contains several (tens to hundreds) interfaces. We need to record the distribution of response time data separately for each interface. We use a tabular data structure to store the data distribution of the various interfaces. Aggregate items store data structures:

Instance meta information:
  • “App Name: App1”

  • “Example: instance1 (10.20.30.40)”

  • “Start time of aggregation: 2020-10-10 18:00:00”

  • “Polymerization period: 1H”

  • “Next submission time: 2020-10-10 19:00:00”

Aggregate information:

When the data sample is collected, find the row corresponding to the interface in the above table according to the interface name (if not, create a new row), and then merge the data sample (i.e. the response time of this call) into the data distribution corresponding to the interface name. The process is as follows:

The aggregation

The application instance records the data distribution of each interface and its response time locally. Send the entire contents of the above table to the data warehouse and empty the local table every time it arrives. The process is as follows:

upload

Data distribution Data is binary data in memory and needs to be serialized during upload. The implementation of serialization is generally recommended as follows:

  • The “use the little endian” data warehouse is mostly implemented in C++. If you choose a DATA warehouse implemented in C/C++, serialization in the small endian format is required. Because the x86 architecture is based on the small endian format, serialized data is kept in a consistent format for easy C/C++ parsing. If a Java implementation or another big-endian warehouse (such as Hive) is used, the small-endian format is not required. In practice, we use the Apache Doris number warehouse developed by baidu, and use C++ to develop the quantile value quadratic aggregation operator.

  • Compression Interface response time sample data tends to have strong bias and a large number of zeros exist in the data distribution data structure. Enabling compression can significantly compress data volume and reduce network transmission and storage resource usage. The specific compression algorithm can be GZIP, BZIP2, etc. In practice, we use GZIP algorithm, and the compression ratio is about 30%-40%.

  • “Encoding” Because the histogram data itself is in binary format, encoding it into text format through algorithms such as BASE64 makes it easier to transfer, store, and debug. In practice, BASE64 encodings have a 20% or so increase in volume, but the relative cheapening of storage gives an acceptable advantage.

Step3. Secondary polymerization

A data warehouse stores a response time histogram of all interfaces in a distributed system. The surface structure of the large body is as follows:

To calculate the fractional value P of the response time of an interface, we need to perform a secondary aggregation of the data distribution aggregation items uploaded by each instance to calculate the fractional value P. Then, the fractional value represents the total fractional value of the interface in the entire application cluster (all instances).

▲ Secondary polymerization

Here the quantile value P is specified in real time by the user when querying. That is, every time a quantile value is computed, any quantile value can be computed instead of a preset quantile value. Our T-Digest structure stores about 1GB of data per day with a maximum accuracy of about 0.1%.


Five, technical advantages

The overall structure is simple, high performance, highly controllable cost, easy to maintain, stable, low risk, how fast and save to meet the online system monitoring quantile value calculation scenario. At the same time, it is also an excellent reference scheme for other big data quantile demand scenarios.

Recommended reading:

When technology refactoring meets DDD, how to achieve business and technology win-win?

Interface documents automatically change? Baidu programmer development efficiency MAX secret

Tech reveal! Baidu search medium low code exploration and practice

Baidu intelligent cloud combat – static file CDN acceleration

Simplify the complex – Baidu intelligent small program master data architecture practice summary

Baidu search in Taiwan mass data management cloud native and intelligent practice

Baidu search “mixed” join information, how to rely on AI to solve?

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding