background

OpsMind low code development platform monitoring module, in order to support the B station monitoring data management scene, the developer did a lot of optimization work in the distributed layer. To better understand the space usage of each metric and the time series distribution of each storage node, you need to estimate the number of time series (cardinality) of each metric (error allowed) so that OpsMind can balance the load of each storage node.

Why HyperLogLog

The form of the OpsMind system metric, fully compatible with Prometheus (which extends from Prometheus), is a metric for monitoring, It can be regarded as the number of combinations of all labels of this indicator (fingerprint of labels). In the statistical process of each index sequence set, how to eliminate repetition is a key. When the index sequence is short, it is simply implemented using a HashMap (such as in Golang, where the map is used directly for de-weighting). However, in most monitoring scenarios, as the number of monitored objects increases, the number of labels combinations will become more and more, even reaching millions. If the HashMap method is used, the memory usage on the server side will be higher and higher. For example, take Golang map as an example. The fingerprint (uint64) of labels is the key. When the number of keys reaches 100w, the memory usage is about 7.6M. The overall memory footprint is about 74GB. However, the metric number managed by station B is far more than 1W, so it is obvious that Golang Map cannot be used to de-repeat the time series number.

type Fingerprint uint64
type MetricLablesSet map[Fingerprint]struct{}
Copy the code

At this point, the main problem to be solved is to reduce memory usage (reduce space complexity) as far as possible. Another way to do this is through bitmap cardinality statistics. The implementation principle of a bitmap is simple: bits are used to mark the existence of certain elements. For example, in the indicator base statistics scenario, we can mark the position of the Fingerprint hash value in the bit array (modulo the value with the length of the bit array) to remove the weight. Therefore, to store information about the existence of 100w fingerprints, at least 122KB of memory space is required. 1w metrics occupy at least 1GB of space. Although bitmap greatly reduces the memory footprint compared to HashMap, 1GB (which in reality is much larger) is still insufficient. In practical scenarios, a large bit array is reserved for each metric. If the metric has a small ordinal number, resource utilization is low. Even if there are other algorithms that can improve resource utilization (such as: Roaring bit-Map), the problem of high resource occupation still cannot be solved in the scenario with the large data volume of STATION B.

Can the memory footprint continue to decrease while satisfying efficiency? Can space utilization continue to improve? The answer, yes, is the algorithm described in this article: HyperLogLog (the alternative is Bloom Filter, which is beyond the scope of this article).

HyperLogLog algorithm introduction

HyperLogLog utilizes an overview algorithm and has a statistical error that is acceptable in metric base statistics scenarios. About the principles of reading can be carried into everyday lifeBernoulli experimentFor example, if you flip a coin, the probability of getting heads is 1/2, keep flipping until you get heads, and count the number of flipsKAnd let’s call this process of flipping a coin many times until we get heads a Bernoulli processnSubbernoulli process, you can getnThe number of flips that appear heads.., the maximum value is denoted as.

The following conclusions can be drawn:

  1. The second Bernoulli experiment, the number of flips is not greater than
  2. Sub Bernoulli process, at least one of the flips is zero

For conclusion 1,The number of flips in the second Bernoulli process is not greater thanThe probability of is:

For statement 2, I have at least one flipThe probability of second is:

Then we can get:

  • When n is much less thanWhen,, Conclusion 1 is not valid;
  • When n is much greater than nWhen,, conclusion 2 is not valid;

And then finally you can go throughTo estimateValue:

This is equivalent to knowing the maximum number of flips in a given turn, so you can estimate the total number of flips. The basic core of HyperLogLog algorithm is based on this. The number of elements is estimated by counting the maximum position of the first 1 in the binary representation of the hash value of stored elements, and the size of the cardinality of the storage unit is estimated. Here, the position of the first 1 is analogous to the number of flips for heads in a coin toss scenario, and the base of the memory cell is the number of flips.

But in the coin flip scenario, if you get heads on the first turn, on the third turn, you plug it into the formula and you predict that there’s a total of headsRound; Obviously there is a large error. HyperLogLog is introduced to improve the situation of large errorThe average points barrelAs well asDeviation correctionThe way. Divide the stored hash values into different hash valuesIn buckets, respectivelyAnd finally getThe harmonic average of each barrel. The formula is as follows:

Is the number of buckets,Is the correction constant.

Among themThe value of is:

switch (m) {
   case 16:
       alpha = 0.673;
   case 32:
       alpha = 0.697;
   case 64:
       alpha = 0.709;
   default:
       alpha = 0.7213 / (1 + 1.079 / m);
}
Copy the code

To the metric base statistics scenario, selectThe number of buckets is 16384 (similar to Redis), each bucket size is 6 bit length; The fingerprint hash of the metric is used to obtain a uint64 number. The first 14 bits of the metric are used to determine the bucket location, and the remaining bits are used to perform Bernoulli process. The memory space occupied by all buckets is.

In order to reduce resource occupation, some HLL implementations introduce the concepts of “sparse storage structure” and “dense storage structure”. When the capacity of sparse storage reaches a certain scale, it will be merged into dense storage structure. Interested readers can search for relevant implementation by themselves.

conclusion

In metric base statistics scenarios, using HLL to estimate the time series ensures a low and stable memory footprint while keeping the error rate within a low acceptable range.