HyperLogLog

HLL is often used to de-count (with some error), such as the number of page views per day (only one visit per user per day is counted as UV). Note that this is UV.

If it is PV, it is easier to configure an independent counter for each webpage. Add the date of the day to the key suffix of this counter, so that each request is executed once by incrby instruction, and the increment can be directly. Finally, all PV can be counted.

But the UV is different, need to go to the heavy, generally to heavy may consider using the set set to do, to store all the day to visit the page a user ID, for example, when a request to come over, we will use the sadd user ID in can, through scard can take out the size of the collection, the figure is the UV data on this page, The plan seemed feasible.

But if the page view is very large, a page may have tens of millions of UV, it will need a large set to statistics, very waste of space, if there are many pages, then it will consume more storage space,

General page visited does not need to be very accurate statistics, such as 105 and 1.06 million the difference is not large, HLL is a solution that can be done to heavy count at the same time, also save a lot of space, although not precise, but the difference is not unusual, the standard error is 0.81%, the precision can meet the demand of the UV statistics commonly.

pfadd/pfcount

HLL provides two instructions: pfadd/pfcount, one is to increase the count, and the other is to get the count. Pfadd is the same as sadd in the set. The user ID is added directly to the set. Pfcount is the same as scard.

127.0.0.1:6379> pfadd HLL user1 The following are the same: (integer) 1 127.0.0.1:6379> pfadd HLL user2 (integer) 1 127.0.0.1:6379> pfadd HLL user3 (integer) 1 127.0.0.1:6379> pfadd HLL user4 (integer) 1 127.0.0.1:6379> pfadd HLL user4 Filter (integer) 0 127.0.0.1:6379> pfadd HLL user4 (integer) 0 127.0.0.1:6379> pfcount HLL Note Successfully deleted (integer) 4 127.0.0.1:6379> pfadd HLL user5 user6 user7 # Add three new user ids (integer) 1 127.0.0.1:6379> pfcount HLL # Count success (integer) 7Copy the code

As can be seen from the above results, the effect is good and the result is correct. There is no so-called error rate. This is because our data is still too small for the moment.

It can be seen that we added 10W data at one time, and the total amount obtained was 0.277% different from the expected one, which is the error, not so precise.

Run the same data again, and you’ll find the same results as above, proving that HLL can indeed be de-duplicated,

Code address: github.com/qiaomengnan…

pfmerge

In addition to pfadd/pfcount, HLL also provides an instruction called PFmerge, which adds multiple PF count values together to form a new PF value.

What does it do? For example, there are two similar content pages on the website, if one day the two pages need to merge, UV traffic also need to merge, at this time you can use PFMerge the two pages of the count, put together.

127.0.0.1:6379> pfadd hll1 a b c # hll1 (integer) 1 127.0.0.1:6379> pfadd hll2 e f g # hll2 (integer) 1 127.0.0.1:6379> pfcount Total number of hll1 # hll1 pages (integer) 3 127.0.0.1:6379> pfcount total number of hll2 # hll2 pages (integer) 3 127.0.0.1:6379> pfmerge hll3 hll1 hll2 # Merge the total number of the two pages into hll3 OK 127.0.0.1:6379> pfcount hll3 # Check the hll3 result (integer) 6Copy the code

According to the following demo at the same time, it can be found that when two UVs are merged, they will also be removed. For example, two pages have ABC, and when merged together, they will also be filtered.

127.0.0.1:6379> pfadd hll1 a b c

(integer) 1

127.0.0.1:6379> pfadd hll2 e f g

(integer) 1

127.0.0.1:6379> pfadd hll2 a b c

(integer) 1

127.0.0.1:6379> pfcount hll1

(integer) 3

127.0.0.1:6379> pfcount hll2

(integer) 6

127.0.0.1:6379> pfmerge hll3 hll1 hll2

OK

127.0.0.1:6379> pfcount hll3

(integer) 6

Copy the code

Space occupied

HLL needs to occupy 12KB of storage space, which may not have the advantage of space cost in the case of a very small number of users, but in the case of a large number of users, 12KB saves a lot of storage space. Compared with Set storage scheme, the space used by HLL is only a drop in the bucket.

Redis has optimized the storage of HLL. When the count is relatively small, the storage space is stored by sparse matrix, which occupies a small space. Only when the count gradually increases and the space occupied by sparse matrix gradually exceeds the threshold, it becomes dense matrix at one time, occupying 12KB space.