It’s Friday, and you’re happily fishing when the product manager e-mails you with a requirements document. The requirements are: the company needs to count the daily IP addresses of visitors to the website, and this count is a long-term behavior, short months, long years.

This is easy to do with Redis’s collection type: generate a collection type key per day, use SADD to store daily visitor IP addresses, and use SCARD to easily get the number of daily visitor IP addresses.

Soon you’ve typed the code and passed the test, and the feature is up and running. After going online and running for a period of time, it was found that the server where Redis is located started to alarm. The reason was that some keys occupied too much memory. You found that these keys were set keys for storing visitor IP addresses. You just pat your head and know you’ve dug yourself a big hole.

Suppose it takes up to 15 bytes to store an IPv4 ADDRESS, and the site has up to 1 million visitors per day. These collection keys will use 0.45GB of memory a month and 5.4GB of memory a year, and that’s just an estimate for IPv4 and even more for IPv6. SADD and SCARD both have O(1) time complexity, but their memory consumption is unacceptable.

Check out the Redis website and discover that Redis also provides a data type, HyperLogLog, that meets the requirements of the product while taking up less memory.

HyperLogLog algorithm

HyperLogLog is a probabilistic algorithm created specifically to calculate the cardinality of a set, which can calculate the approximate cardinality of a given set.

The approximate cardinality is not the actual cardinality of the set. It may be smaller or larger than the actual cardinality, but the error between the estimated cardinality and the actual cardinality will be within a reasonable range. For those statistics that do not require very precise statistics, HyperLogLog algorithm can be used.

The advantage of HyperLogLog is that the memory required to calculate approximate cardinality does not vary with the size of the set. No matter how many elements the set contains, the memory required for HyperLogLog’s computation is always fixed and very small.

Each HyperLogLog type of Redis uses only 12KB of memory space to count close to: 264 elements with a standard error of 0.81%.

If the HyperLogLog type is used to implement the above functions, only 360KB of memory is used in a month with 1 million visitors per day.

PFADD

The PFADD command allows you to count one or more collection elements given.

PFADD key element [element...]

Depending on whether a given element has already been counted, the PFADD command may return 0 or 1:

  • If all of the given elements have been counted, the PFADD command returns 0, indicating that the approximate cardinality calculated by HyperLogLog has not changed.
  • The PFADD command returns 1 if there is at least one element in a given element that has not been counted before, causing the approximate cardinality calculated by HyperLogLog to change.

Such as:

Redis > PFADD letters A b cinteger1 redis> PFADD letters Ainteger) 0
Copy the code

It is also ok to specify the key without specifying the element when the command is called; if the key exists, no action is done, and if not, a data structure is created (return 1).

PFCOUNT

Using the PFCOUNT command, you can obtain the approximate cardinality calculated by HyperLogLog for the collection. Returns 0 if the given key does not exist.

PFCOUNT key [key...]

Such as:

redis> PFCOUNT letters
(integer) 3
Copy the code

When multiple Hyperloglogs are passed to PFCOUNT, the PFCOUNT command will union all hyperloglogs and then return the approximate cardinality.

redis> PFADD letters1 a b c
(integer) 1
redis> PFADD letters2 c d e
(integer) 1
redis> PFCOUNT letters1 letters2
(integer5)Copy the code

PFMERGE

Using the PFMERGE command, you can perform union calculation for multiple Hyperloglogs and save the calculated HyperLogLog to a specified key.

PFMERGE destKey sourceKey [sourceKey...]

If the specified key already exists, the PFMERGE command overwrites the existing key.

redis> PFADD letters1 a b c
(integer) 1
redis> PFADD letters2 c d e
(integer) 1
redis> PFMERGE res letters1 letters2
OK
redis> PFCOUNT res
(integer5)Copy the code

The PFMERGE command is similar to the PFCOUNT command. In fact, the PFCOUNT command performs the following operations when calculating the approximate cardinality of multiple Hyperloglogs:

  1. Internally, the PFMERGE command computes the union of all the given HyperLogLog and stores the union into a temporary HyperLogLog.
  2. Execute PFCOUNT on the temporary HyperLogLog to get its approximate cardinality.
  3. Delete the temporary HyperLogLog.
  4. Returns the resulting approximate cardinality.

When a program needs to call the PFCOUNT command on multiple Hyperloglogs, and the call may be repeated several times, consider replacing this call with the corresponding PFMERGE command call: The program minimizes unnecessary union calculations by storing the results of union calculations in a specified HyperLogLog instead of recalculating the union each time.

The business scenario

HyperLogLog is suitable for the following scenarios: counting (monthly and annual statistics) and deduplication (spam detection).