If we need to design a Redis statistical page UV implementation scheme, what is the possible implementation scheme? You may easily think of a solution is to use a Set object to store the id of each user visiting the page, because the Set structure naturally supports deduplication, so the size of the Set retrieved using scard is the page UV. However, if the page UV is very large, using the Set structure for storage can be a waste of space. Redis provides HyperLogLog data structures to solve such statistical problems, saving memory footprint and ensuring statistical results within acceptable errors.

HyperLogLog in Redis

HyperLogLog in Redis provides two directives pfadd and pfcount. Pfadd is used to add counts and pfcount is used to get counts. For the statistical page UV scenario above, when the user visits the page, the pfadd command is used to add the user ID to the HyperLogLog, and finally execute pfcount to obtain the page UV.

HyperLogLog algorithm introduction

In fact, HyperLogLog is an algorithm that provides inaccurate recalcitrance. The following characteristics exist:

  1. Can use very little memory to count huge amounts of data, inRedisIn the implementation ofHyperLogLog, only need 12K memory to be able to count2 ^ 64A data.
  2. There is some error in counting, and the error rate is low as a whole. The standard error is0.81%.
  3. The error can be reduced by setting auxiliary calculation factors.

Bernoulli test

Bernoulli’s test is part of mathematical probability theory, and its allusion comes from flipping a coin.

A coin has two sides, so every time you flip it up, you get a 1/2 chance of getting heads or tails. The probability of getting heads the first time is 1/2, the probability of getting heads 2 times in a row is 1/4, and the probability of getting heads K times in a row is 2 to the minus K. So let’s say we keep flipping a coin until it comes up heads, and we record it as a complete experiment, and we could flip a coin once and get heads, or we could flip a coin four times and get heads. No matter how many flips you get, if you get heads, it’s a trial. This experiment is called Bernoulli’s experiment.

So for n Bernoulli trials, let’s say that each Bernoulli trial has k flips. So the first Bernoulli test, let’s call it k1, and so on, and the NTH test is kn. Where, for these n Bernoulli trials, there must be a maximum number of flips k_max.

Finally, combined with maximum likelihood estimation method, it is found that there is estimation correlation between n and k_max: n * 2^(-k_max)=1, that is, n = 2^(k_max).

To estimate the optimization

We just donThe subBernoulli experiment is called a round experiment. whennWhen it’s large enough, the margin of error is relatively small, but it’s still not small enough. But if we do more rounds, and then we take each roundk_maxIs the average of. And then we estimate itnIn this way, the error will be relatively reduced. Here is the LogLog estimation formula:

The DVLL of the above formula corresponds to N. Constant is the correction factor, and its specific value is uncertain, which can be set in branches according to the actual situation. M is the number of rounds. The R with a bar on the head is the average :(k_max_1 +… + k_max_m)/m.

This is done by increasing the number of test rounds and then takingk_maxAlgorithmic optimization of the average is LogLog’s approach. whileHyperLogLogandLogLogThe difference is that it does not use the average, but a harmonic average.The advantage of the harmonic mean over the mean is that it is less susceptible to large values. Here’s an example:

Program implementation

We already know that in the case of the coin toss, n can be estimated from the k_max that appears in a Bernoulli experiment.

Converts data to a string of bits

The hash function converts the data to a string of bits. For example, if you enter 5, it converts to: 101. The purpose of this is to match a coin flip, where 0 is tails and 1 is heads. If a number is eventually converted to 1,0010,000, then from right to left, we can say that the first time a 1 appears, it’s heads. Based on the above estimation conclusion, we can estimate how much data is stored according to the largest position k_max where 1 appears after transformation in the stored data, namely, n = 2^(k_max).

Points barrels

Barrel by barrel is how many rounds. To be specific, when converting the input data hash to a bit string, the bucket number is determined first, and then the first bit value with 1 is set to the bucket.

Redis HyperLogLog implementation

inRedisIn theHyperLogLogIn, is divided into2 ^ 14One bucket. Each bucket has6abit, the memory occupied is =2^14*6/8/1024=12K.

When we execute orderspfadd key value“, will pass firsthashFunction willvalueconvert64Bit bit string. Of these, the first 14 are used to identify buckets (which happen to have2 ^ 14A bucket), after50Bits are used to get the first occurrence1Number of digits of (from right to left), because the maximum is50, so use6abitYou can write it.

Different values will be set to different buckets, if they are in the same bucket, but the first digit that appears 1 is different. Then compare whether the new value of the bucket is greater than the original value, if so, replace. Otherwise, keep it the same.

In the end, 2^14 buckets hold the maximum value k_max at which 1 occurs, and when pfcount is called, it is possible to calculate how many times the key is set valu, which is the statistical value, 2^14*2^(k_max average), using the estimation described earlier.

reference

  1. Juejin. Cn/post / 684490…
  2. Juejin. Cn/post / 684490…

Original is not easy, feel that the article is written well small partners, a praise 👍 to encourage it ~

Welcome to my open source project: a lightweight HTTP invocation framework for SpringBoot