The invention relates to a system and a method based on real time division digit calculation

Performance analysis and other scenarios have a strong demand for real time division bits. In the calculation of cumulative duration, the duration of different time periods can be simply added up, while the quantile cannot first calculate the quantile values under different dimensions and then aggregate them directly. This feature brings great challenges to real-time calculation. Based on TDigest data structure, we use high-performance storage such as Redis and Doris to pre-calculate the quantile value index of all possible queries, which can not only calculate the index quickly, but also guarantee the query efficiency. This system has already carried out the output of Baidu’s internal kernel performance, network performance and other business scenarios, and can effectively meet the needs of business analysis of high timeliness.

Full text 3663 words, estimated reading time 10 minutes.

I. Problem description and technical challenges

In practical work, we found that many business scenarios for a numeric indicators in real time the demand of the quantile statistics, general requirements calculation result has a high accuracy and low computational delay, realize the development of this kind of demand for data RD work bring certain challenges, the main technical challenges include the following three aspects:

Unable to sort the full amount of data: Because the data is processed item by item in the real-time computing scenario, the full amount of data cannot be sorted, and then the quantile of the full amount of data cannot be obtained

The calculation logic is complex and the calculation delay is high: even in the scenario where sorting can be done, the sorting operation with high complexity will also bring high calculation delay, which cannot meet the low delay requirements of real-time calculation

The quantile results cannot be aggregated: the results of two calculated quantiles cannot be directly accumulated and combined to get a new result as the result of summation, which poses a challenge to the way the quantile results are stored

To solve the above problems, based on the TDigest data structure, we implement the quantile calculation method in the real-time computing environment, encapsulate as the basic components and provide API interfaces, which can provide real-time and accurate quantile calculation under different business scenarios (kernel performance, search performance, PUSH, etc.).

II. Infrastructure and solutions

In this section, we will introduce the quantile calculation method under the streaming computing scenario from the common data structure for calculating the quantile, our infrastructure for implementing the quantile calculation, and the solution:

2.1 Common data structures for quantiles

TDigest calculates the quantile

TDigest is a simple, fast, accurate and parallelizable approximate percentile algorithm used by Spark, ES, Kylin and other systems. The core idea of TDigest is to aggregate discrete data points into multiple different centroid by clustering method, and calculate the quantile by linear interpolation method, which is the simplest interpolation algorithm.

Generally speaking: the traditional method is to sort the discrete data, and the score number is directly received in the sorting result. TDigest, on the other hand, clusters the discrete data into multiple centroid, and then “sorts” the centroid approximately, and finally obtains the quantile by interpolation method.

As shown in the above, the discrete data points (colorless data points) for a number of different clustering center of mass data points (color), each of which around the center of mass of data points determines the weight of the center of mass of the size of the center of mass of the (figure), finally, to sort all of the mass center, you can use the linear interpolation method to calculate the corresponding quantile, The distance and weight relationship between data points and the center of mass are shown in the figure below.

In particular, there is an important compression parameter at the time each TDigest is created, which is used to trade off computational accuracy against space complexity:

When the larger COMPRESSION parameter was set, the more centroid was obtained by clustering, and the higher quantile accuracy was obtained by the difference method

When the larger COMPRESSION parameter is set, the larger storage space occupied by TDigest data structure will be, and the higher space complexity of quantile calculation will be

Setting an appropriate compression parameter can not only improve the calculation accuracy, but also reduce the storage space as far as possible, so as to meet the actual requirements of the business

In order to help you select appropriate parameters when calculating the quantile, we choose the data volume of millions (i.e., the quantile of 100W random variables). The calculation accuracy and space complexity under different parameters are shown in the following table:

According to the data shown in the above table, we will make the following three explanations:

The MergingDigest data structure was used in this test. The space occupied by this structure was related to the value of compression parameter, and had nothing to do with the statistical data amount.

With the increase of data volume, the value of compression should be appropriately increased, which can effectively improve the accuracy of calculation

2.2 Infrastructure built by quantile

Due to the fact that real-time quantile calculation is a common statistical method, similar requirements will be proposed in many business scenarios, and different quantiles will be calculated for the statistical indicators concerned by the demand side.

In order to save the labor cost and shorten the time cycle of iterative development, we encapsulate the common basic components based on the TDigest data structure, so as to realize the development of real time division bit statistics quickly under different business scenarios.

As shown in the figure above, the infrastructure and execution of the common components of real time bit calculation are divided into the following key steps:

1) Read the raw data requiring statistical quantiles from the upstream business party

2) According to the grouping rules required by the business side, aggregate the aggregated data into TDigest data structure according to the grouping, and store the aggregated results in Redis, or merge them with the corresponding data already in Redis, so as to obtain accurate calculation results

3) Get the calculated result of the quantile from the TDigest structure and return it upward

Above all, we provide the API by encapsulating the basic components and upward, to realize the general, flexible, and the application of the transparent quantile calculation method, to ensure real-time performance at the same time, achieve high accuracy, low space complexity quantile computation, is now in the performance platform, search, and PUSH factory ground application among multiple business requirements

2.3 Overall implementation plan

Based components, based on the introduction of real time digits in most business scenarios in the factory, usually from the message queue for applications submitted to the original data, through a series of analytical and calculated, and the calculated results stored in OLAP engines or DB such Doris, total demand query and generate corresponding reports, this is a generic solution.

Based on the above analysis, we can obtain the basic architecture of a quantile real-time computation operation, and its architecture model is shown in the figure below:

As shown in the figure above, in an on-site environment, the basic architecture commonly used for a real-time-bit calculation task consists of the following key steps:

1) Read the basic data reported by the business party from the message queue, and analyze the data according to the business logic;

2) Using the FlatMap method, expand one data into multiple fields (detailed content will be introduced in Section 3);

3) Group the data by different keys according to the query dimension of the business design

4) Merge the data of each key into a TDigest data structure

5) Merge the aggregated data with the data stored in Redis, and write the merged results back to Redis

6) Finally, according to the data aggregation structure, the corresponding quantile is obtained from the corresponding TDigest structure of each group

Through the above steps, the calculation method of real time hours with high real-time performance, high accuracy and low space complexity can be realized, which can meet the requirements of most real time hours services. In more business scenarios, appropriate adjustments may be required according to the actual requirements.

3. Solve the problem that quantiles cannot aggregate

3.1 Description of problem

In a real business requirement, we might need to retrieve the quantiles of statistics by different time, query dimensions, and so on. However, the results of the calculated two quantiles cannot be aggregated.

For example, for the user visit time of Shubai APP, we can SUM the visit time of each hour in a certain day, so as to obtain the total visit time of that day. However, if we record the 80 quantiles of the length of access in each hour, we cannot aggregate these quantiles, that is, we cannot calculate the 80 quantiles of the length of access in that day. This phenomenon is called the “incompatibilities” of the quantile

Therefore, in practical applications, if the business needs to perform any aggregation, query and other operations on the indicator quantiles in different times and dimensions, it will pose new technical challenges for the calculation and storage of quantiles.

3.2 Quantile aggregation scheme

According to the above problem, we put forward in advance according to all the query dimension calculation solution polymerization, the possible query dimension for each kind of combination, we calculate in advance quantile and store, so in the process of query retrieval polymerization corresponding query dimension calculation results directly, in solves the quantile problem “is polymerized” at the same time, It also avoids the time cost caused by repeated aggregate calculation, shortens the query time and improves the user experience.

Next, let’s take a simple example to explain the concrete method of aggregation calculation. Let’s assume that in a business scenario, there are three fields in the query dimension that users are interested in: APP version (APP \_version), manufacturer, and client operating system version (OS \_version). Therefore, users may use 2^3=8 aggregate query methods for any combination of dimensions. Therefore, we count the quantiles separately by permutation and combination of all possible aggregate query methods in enumeration. Assume that part of the data read from upstream is shown in the following table:

In addition, if an aggregate query is performed on a certain field, and the value of the field is marked as the keyword “ALL”, then this data corresponds to 2^3=8 possible aggregate query modes. In order to simulate 8 different permutations and combinations of dimensions, we use binary permutations and combinations to make each field strictly correspond to a bit in binary data: if the value of this bit is 0, then the field content is the original value reported (the actual value in the above table); If the value of this bit is 1, then the value of the corresponding field is marked as the keyword “ALL”. In addition, the corresponding relation of each bit from right to left in the binary data to the field is:

Bit 1 corresponds to OS \_version

Position 2 corresponds to Manufacturer

Bit 3 corresponds to app\_version

As a result, how the aggregate query for any field is arranged is shown in the following table:

In this way, we enumerate all possible dimension combination queries through binary permutations and combinations. In the actual calculation process, the FlatMap operator of the streaming calculation can be used to expand a single data into multiple data according to the above perperation and combination mode, and then group and aggregate the data, calculate the quantile, and store the final calculation results in the storage engine such as Doris for users to query. At this point, all possible aggregated query methods have been included in the calculation results, and the business party can directly query the final quantile results according to needs without additional aggregated calculation operations, which can effectively improve the query efficiency and guarantee the user experience.

Four, conclusion

The above content is a brief introduction of the core technology, infrastructure and technical difficulties of the real time division bit calculation method from a macro perspective. If you have any questions or suggestions, please feel free to communicate.

This author | son Yang, responsible for baidu performance platform development of real-time data, main research direction for the intelligent prediction of flow calculation, etc

Recruitment information

The R&D Department of Baidu APP Technology Platform is responsible for the construction of Baidu APP and Baijiahao technology platform, and is also responsible for the construction of a series of benchmark platforms such as PUSH, message, interaction, transaction, log, performance, audit and B terminal. Welcome to join us and look forward to your arrival!

Whether you are back end, front end, or big data, there are a number of positions waiting for you here, welcome to submit your resume, [contact information of the same name public account Baidu Geek said, input the internal push] Baidu APP Technology Platform R&D Department is looking forward to your joining!

Read the original

|The invention relates to a system and a method based on real time division digit calculation