Introduction and application of Redis HyperLogLog

Introduction to the

Redis added the HyperLogLog structure in version 2.8.9

HyperLogLog is an algorithm for cardinality statistics, that is, counting the deweighted elements of a set

When the number of input elements does not exceed 2^64, the maximum memory required to calculate the cardinality is 12KB. The structure uses an approximation algorithm with a standard error of 0.81%.

Method analysis of calculating cardinality

Use the general set calculation

With the increase of statistical elements, memory consumption increases rapidly and resources are consumed seriously, which is not suitable for large data statistics

bitmap

A bit array is used to indicate whether each element is present. Each element corresponds to one bit, and the total memory required is N bits. Can greatly reduce memory footprint and bit operation quickly.

If you want to count the cardinality value of 100 million data, about memory 100000000/8/1024/1024 ≈ 12M is needed, which has a significant effect on reducing the memory occupation. However, the cardinal value of one object needs 12M, and if 10,000 objects are counted, it needs nearly 120G, which is also not widely used in big data scenarios

But this is an exact calculation

Probability algorithm

Current probabilistic algorithms used for radix counting include:

Linear Counting(LC) : An early cardinality estimation algorithm, LC is not excellent in terms of spatial complexity. In fact, the spatial complexity of LC is the same as that of the simple bitmap method above (but with a reduced level of constant term), which is O(Nmax).
LogLog Counting(LLC) : LogLog Counting saves more memory than LC, and the space complexity is only O(log2(Nmax)).
HyperLogLog Counting(HLL) : HyperLogLog Counting is an optimization and improvement based on LLC. With the same space complexity, the error of cardinality estimation is smaller than that of LLC

Algorithm specification in plain English

If we generate an 8-bit hash string for a data set, the probability that we will get 00000111 is very low. In other words, the probability that we will generate a large number of consecutive zeros is very low. The probability of producing five zeros in a row is 1/32, so when we get this string, we can estimate that the base of this data set is 32

Application scenarios

Application Scenario Features

Inexact statistics, HyperLogLog is an approximation algorithm with a 0.81% standard error
The amount of data is huge, not big on the use, overqualified, waste of space
You can only count the set cardinality (de-weight) number, but there is no way to know the specific content, and naturally there is no way to return the set elements

Example Application Scenarios

Statistical IP number
The statistical number of UV
Statistics the number of different terms searched by users every day

The command that

pfadd

PFADD key element [element …]

Time complexity: O(1) to add every element

Add elements to compute space

The return value

If both key and Element are specified, return 1 if the HyperLogLog’s myopia cardinality changes after the command is executed, and 0 otherwise
If only key is specified, then 0 is returned if the key exists, and 1 is returned if the key is created otherwise

Example: PFADD HLL a B C D e F G

PFCOUNT

PFCOUNT key [key …]

Time complexity

When called with a single key, the average constant time of O (1) is very short.
O (N) N is the number of keys when using multiple keys, and the constant time is much larger when calling with multiple keys

Function description: calculation of all the passed key corresponding to the set of elements to redo cardinality

Return value: approximate cardinality count (standard error approx. 0.81%)

Example: pfcount HLL hll2

Special instructions

As a side effect of calling this function, HyperLogLog may be modified because the last 8 bytes encode the latest computed cardinality for caching purposes. So PFCOUNT is technically a write command

Performance is also excellent when PFCOUNT is called with a single key, and PFCOUNT uses a cache to remember the previously computed cardinality, a change that rarely changes because most PFADD operations don’t update any registers

HyperLogLogs need to merge computations when a PFCOUNT calls multiple keys, which is slow, and the cardinality of the merge cannot be cached, so a PFCOUNT with multiple keys may take an order of milliseconds and should not be abused

PFMERGE

PFMERGE destkey sourcekey [sourcekey …]

Time complexity: O (N) merges N HyperLogLogs, but with constant time

If the destkey does not exist, an empty HyperLogLog is created. If the Destkey exists, the HyperLogLog is regarded as one of the SourceKeys

Return value: only OK is returned

Example: PFMERGE hll3 hll1 hll2

Refer to the article

Redis command description

Redis new data structure: the HyperLogLog

Redis: Application scenarios of HyperLogLog

Magic HyperLogLog algorithm