HyperLogLog is an algorithm for cardinality statistics.

So let’s see what the cardinal number is. For example, a dataset {1, 3, 5, 7, 5, 7, 8} would have a cardinality of {1, 3, 5, 7, 8} and a cardinality (not repeating elements) of 5.

If, now, you need to count the UV of a web page, then it will involve de-duplication, which is a good scenario for using HyperLogLog.

That’s the set, right? I can use set to find elements that are not repeated.

Yes, you can, but when the amount of data is very large, does your set take up too much memory? HyperLogLog works fine because it takes a certain amount of space to calculate the cardinality. With only 12KB, you can calculate the cardinality of nearly 2^64 different elements.

But note that under this order of magnitude, there will be an error rate of 0.81%, so it depends on whether the business can accept such an error rate. For UV scenarios like the one above, this error rate is negligible.

A, pfadd

Add all element parameters to the HyperLogLog data structure.

pfadd mypf 1 2 3 a b c 3 4 5 c d a
Copy the code

Second, the pfcount

Returns the cardinality estimate for the given HyperLogLog.

pfcount mypf
Copy the code

As you can see, 9 is returned, which means the number of elements that are not repeated is 9.

Third, pfmerge

Multiple HyperLogLog are merged into one HyperLogLog, and the cardinality estimate of the combined HyperLogLog is calculated by union of all the given HyperLogLog.

pfmerge mypftotal mypf3 mypf4
Copy the code

Merge mYPF3 and mYPF4 into mypfTotal.