Hello, I’m Li Xiaobing.

ElasticSearch is a distributed open source search and analysis engine that can not only perform full-text matching search, but also perform aggregation analysis.

Today, we take a look at the percentiles analysis, which is common in aggregation analysis. N data in order of numerical size, the value at p% position is called the p-percentile. For example, ElasticSearch records the maximum TP99 (99% of total requests).

The approximate algorithm

Percentile analyses such as TP99 are easy to perform when data volumes are small or centrally stored in the same location. However, when the amount of data keeps increasing, aggregation analysis of data needs to make trade-offs in three aspects: data amount, accuracy and real-time, and only two of them can be satisfied.

As you can see, there are three options:

  • Limited data calculation: with high accuracy and real-time performance, it is inevitable that it cannot process data of a larger magnitude. For example, MySQL performs statistical analysis on stand-alone data.
  • Offline computing: Large data volume and high accuracy are selected, resulting in poor real-time performance. For example, Hadoop can provide accurate analysis on PB level data, but it may take a long time.
  • Approximate calculation: large amount of data and real-time performance are selected, but certain accuracy will be lost, such as 0.5%, but relatively accurate analysis results are provided.

Elasticsearch currently supports two approximation algorithms, cardinality and Percentiles.

Cardinality is used to calculate the cardinality of a field, that is, the number of distinct or unique values for that field. Cardinality is based on the HyperLogLog (HLL) algorithm, which HLL HLS the data first and then calculates the probability of the number of digits in the result of the hash to get the cardinality. For details on the HLL algorithm, see Redis HyperLogLog in Detail.

Percentiles are the focus of this article.

percentile

ElasticSearch can use percentiles to analyze the percentiles of the latency fields in the logs index, as shown below.

Percentiles by default returns a set of preset percentile values: [1, 5, 25, 50, 75, 95, 99]. They represent common percentile values of interest, with extreme percentiles on either side of the range and others in the middle. As shown in the figure below, we can see that the minimum delay is around 75ms and the maximum delay is around 600ms. In contrast, the average delay is around 200ms.

As with cardinality cardinality, calculating percentiles requires an approximation algorithm.

Maintaining an ordered list of all values in memory can compute various percentiles for a small amount of data, but this kind of algorithm is not practical when there are billions of data distributed over dozens of nodes.

Percentiles therefore use the TDigest algorithm, which is an approximation algorithm with varying accuracy for different percentiles. More extreme percentile ranges are more accurate, such as the 1% or 99% percentile, than the 50% percentile. This is a good feature because most people only care about extreme percentiles.

TDigest

TDigest is a simple, fast, accurate, parallelized, approximate percentile algorithm used by systems such as ElastichSearch, Spark, and Kylin. TDigest mainly has two implementation algorithms, one is buffer-and-merge algorithm, the other is AV L tree clustering algorithm.

TDigest uses Sketch, which is commonly used in approximation algorithms. It uses a part of data to depict the features of the whole data set, just like our daily Sketch. Although there is a gap between it and the real thing, it looks very similar to the real thing and can show the features of the real thing.

Let’s take a look at how TDigest works. For example, if there are 500 numbers between -30 and 30, we can use the probability density function (PDF) to represent this data set. The y value of a certain point on the function is the probability of the occurrence of its X value in the overall data set. The sum of the areas of the whole function is exactly 1, which can be said to describe the distribution of data in the data set (the familiar schematic diagram of the normal distribution shows this function).

With the PDF function corresponding to the dataset, the percentile of the dataset can also be expressed as the area of the PDF function. As you can see in the figure below, the 75% percentile is the x-coordinate when the area is 75%.

We know that the points in the PDF function curve correspond to the data in the data set. When the data is small, we can use all the points in the data set to calculate the function, but when the data is large, we can only replace all the data in the data set with a small amount of data.

Here, we need to group the data sets, divide the adjacent data into a group, and replace this group of numbers with Mean and Weight. These two numbers together are called Centroid, and the Centroid is then used to calculate the PDF, which is the heart of the TDigest algorithm.

As shown in the figure above, the average of the centroid number is taken as the x value and the number as the Y value. The PDF function of this data set can be roughly drawn by this group of centroid numbers.

And then, to compute the percentile, you just have to find the center of mass number from the center of mass number, and the average of that is the percentile number.

Obviously, the larger the centroid number, the more data it represents, the more information is lost, and the less accurate it is. As shown in the figure above, too large centroid number will lose too much accuracy, while too small centroid number will consume too much resources such as memory, which cannot achieve the effect of high real-time performance of the approximation algorithm.

Therefore, TDigest uses the compression ratio (the higher the compression ratio, the more data the centroid number represents) to control the amount of data represented by each centroid number in percentiles. The centroids on the sides are smaller and more accurate, while the centroids in the middle are larger. So that the 1% or 99% percentile is more accurate than the 50% percentile.

Source code analysis

ElasticSearch uses t-Digest, the open source implementation of TDigest, and its github address is github.com/tdunning/t-… The TDigestState class of ElastichSearch sees its encapsulation of the T-Digest implementation.

T-digest provides two implementations of the TDigest algorithm:

  • MergingDigestCorresponding to the buffer-and-merge algorithm described above
  • AVLGroupTreeClustering algorithm corresponding to AVL tree.

MergingDigest is used for scenarios where the data set is already sorted and calculates the centroid number based directly on the compression ratio, while AVLGroupTree needs to use the AVL tree to confidently judge the data based on its “proximity” and then calculate the centroid number.

MergingDigest’s implementation is relatively simple. Its algorithm name is buffer-and-merge, so MergingDigest uses two arrays, tempWeight and tempMean, to represent the centroid number group and merges data with the saved centroid number. Then if the weight limit is exceeded, a new centroid number is created, otherwise the average and number of the current centroid number are changed.

Compared with MergingDigest, AVLGroupTree has one more step to search the data closest to the centroid number through AVL binary balanced tree. After finding the nearest centroid number, AVLGroupTree also merges the two, and determines whether the weight exceeds the upper limit, and then performs modification or creation operations.

Next, let’s look directly at the Add method of AVLGroupTree.

When ElasticSearch processes a data set, it continuously adds data from the data set to the centroid number by calling add and then calling quantile to calculate the percentile.

Afterword.

Welcome to continue to pay attention to programmer Li Xiaobing, the subsequent will continue to bring you about data storage, data analysis, distributed related articles. In the next article we’ll come back to the implementation of ElasticSearch’s other aggregation analysis operations.

Personal blog, welcome to play

reference

  • Blog.bcmeng.com/post/tdiges…
  • Blog.bcmeng.com/pdf/TDigest…
  • Github.com/tdunning/t-…
  • Op8867555. Making. IO/posts / 2018 -…