Redis third shot, give you 100 million keys, how efficient statistics?

preface

Have you ever used Redis on a large scale? Or is it just a tool for caching? One of the most commonly used scenarios in Redis is the collection. For example, in the following scenario:

In the check-in system, a day corresponds to a series of user check-in records.
In the e-commerce system, a product corresponds to a series of reviews.
In a dating system, a user’s list of friends.

The characteristic of collection in Redis is nothing more than a Key corresponding to a series of data, but the function of data is usually for statistics, such as:

In the friend-making system, the daily new friends and mutual friends of both parties need to be counted.
In the e-commerce system, the latest comments in the review list need to be counted.
In the check-in system, the number of users who have checked in for a consecutive month needs to be counted.

In large Internet applications, the amount of data is huge, not to mention millions, tens of millions, or even 100 million, such as e-commerce giant Taobao, dating giants wechat, Weibo; Office giant nail nail, which one of the users is not hundreds of millions?

Statistics can be more convenient only when appropriate sets are selected for different scenarios.

Aggregate statistics

Aggregation statistics refer to the result of the aggregation of multiple elements, such as counting the intersection, union, and difference sets of multiple sets

If you need to aggregate statistics on multiple sets, the Set is a good choice. In addition to the fact that there is no duplicate data, Redis also provides an API for this

intersection

In the above example, the mutual friends of the two parties in the dating system are the intersection of the aggregation statistics.

In Redis, userID is used as key and userID is used as value.

Counting mutual friends of two users requires only the intersection of two sets.

SINTERSTORE userid:new userid:20002 userid:20003
Copy the code

After the preceding commands are executed, the intersection of userID :20002 and userID :20003 will be stored in the key userID :new.

Difference set

For example, suppose that the friend system needs to count the new friends added every day, then it needs to take the difference set of the friends set in the next two days. For example, the friend set on November 2020/11/1 is set1, and the friend set on November 2020/11/2 is set2. At this time, it only needs to make the difference set of Set1 and Set2.

How should the structure be designed? The diagram below:

Userid :20201101 This key records the friends set of userid on Date 2020/11/1.

The difference set is as simple as executing the SDIFFSTORE command as follows:

SDIFFSTORE  user:new  userid:20201102 userid:20201101  
Copy the code

At this point, user:new will be the set of friends added on November 2, 2020/11/2.

A better example of this is the people you might know feature on Twitter, which uses difference set, which is the friends of your friends minus your mutual friends to be the people you might know.

And set

In the same example of difference set, suppose you want to count the total number of new friends added in 2020/11/01 and 2020/11/2, you just need to make a union of the new friends added in these two days. The command is as follows:

SUNIONSTORE  userid:new userid:20201102 userid:20201101
Copy the code

In this case, the new set userid:new is the new friends added in two days.

conclusion

The computation complexity of the cross union of sets is very high. If the data amount is large, it may cause the blocking of Redis.

So how do you avoid blocking? The suggestions are as follows:

inRedisSelect a slave from the cluster that is responsible for aggregation statistics so that the primary and other slaves are not blocked
The data is handed over to the client for aggregation statistics.

Order statistics

In some e-commerce websites, you can see the latest commodity comments are always on them. How does this work?

The latest comment list contains all the comments, which requires the collection to store the elements in order. In other words, the elements of a set must be stored in order. This is called an ordered set.

Among the four types of collections in Redis, the List and Sorted sets are ordered.

But what’s the difference between a List and a Sorted Set? Which one to use?

A List is Sorted by the order in which the elements are Sorted, while a Sorted Set can be Sorted by element weight. For example, the weight can be determined according to the time when the elements are inserted into the set. The weight of the elements inserted first is small, and the weight of the elements inserted later is significant.

For this example, it’s clear that both will work, the List paging query command LRANGE and the Sorted Set paging query command ZRANGEBYSCORE.

In most cases, though, it’s not just Sorted by chronological order, but by some specific criteria. In this case, just Sorted by a unique algorithm that generates the corresponding weights.

Binary state statistics

Binary states are either zero or one; In a sign-in (1) and non-sign-in (0) scenario, only two states need to be recorded, which is a typical binary state statistic.

Binary state statistics can be obtained using Redis’ extended data type Bitmap. The underlying implementation is a String type, which can be viewed as an array of bits. For details, see………

In check-in statistics, 0 and 1 only account for one bit, even though there are only 365 bits of check-in data per year. Greatly reduced storage space.

A Bitmap provides the GETBIT/SETBIT operation, which uses an offset value to read or write a bit in a bit array. Note, however, that the Bitmap offset starts at 0, which means that the minimum offset is 0. When using SETBIT to write to a bit, the bit is set to 1. Bitmap also provides a BITCOUNT operation to count the number of 1s in the bit array.

How are the keys designed? The key can be userID :yyyyMM, that is, the unique ID plus the month. Assume that the employee ID is 10001, and you need to count the check-in and punching records of 2020/November.

First, run the following command to set the value, assuming that November 2 punched in:

SETBIT userid:10001:202011 1 1 
Copy the code

The BitMap starts with index 0, so number 2 is index 1, and a value of 1 indicates successful punching.

Step 2: Check whether the user punched in on November 2.

GETBIT userid:10001:202011 1 
Copy the code

Step 3: Count the clocking times in November. The command is as follows:

BITCOUNT userid:10001:202011
Copy the code

So the problem comes, need to count the total number of users in your check-in system for 20 consecutive days, how to deal with? Let’s say there are 100 million users.

For example, how to count the number of people who punch in consecutive days from 2020/11/01 to 2020/11/20?

The Bitmap also supports bitwise and, or, xOR operations on multiple bitmaps at the same time.

The idea is that we can use the date of each day as a key, and the corresponding BitMap can store 100 million users that day. The diagram below:

At this time, we only need to perform bitwise and operation on the Bitmap between 2020/11/1 and 2020/11/20, and the value corresponding to each bit position in a Bitmap obtained finally represents the case of clocking in for 20 consecutive days. Only when clocking in for all consecutive 20 days, the value of the bit is 1. The diagram below:

Finally, you can use the BITCOUNT command to do the statistics.

You can try to calculate the memory overhead by using one 100 million bit Bitmap every day, which accounts for about 12MB of memory (10^8/8/1024/1024). The memory overhead of 20 days of Bitmap is about 240MB, which is not too much memory pressure. In practice, however, it’s a good idea to set an expiration time for Bitmap and have Redis automatically delete unnecessary checkins to save memory.

If the binary state is involved, such as whether the user exists, whether the check-in card exists, whether the goods exist, etc., Bitmap can be used to effectively save memory space.

Base statistics

Cardinality statistics count the number of non-repeating elements in a set.

For example: e-commerce sites usually need to statistics of each page UV to determine the weight, UV page must be heavy, in Redis type Set support to heavy, the first time certainly think of is Set.

However, there is a problem here, the underlying use of Set hash table and integer array, if the UV of a web page reaches tens of millions (an e-commerce site more than one page), then for the memory consumption is very large.

Redis provides an extended type, HyperLogLog, for cardinality statistics. It takes about 12KB to compute 2^64 elements

Is it exciting? However, there is an error in HyperLogLog, about 0.81%. If you need accurate statistics, you still need to use Set. For this type of web page UV, enough.

When counting web page UV, only the unique ID of the user needs to be stored in HyperLogLog, as follows:

PFADD p1:uv 10001 10002 10003 10004
Copy the code

If duplicate elements exist, they are automatically de-duplicated.

Statistics are also simple, using the PFCOUNT command, as follows:

PFCOUNT p1:uv
Copy the code

conclusion

This article introduces several types of statistics and which collections should be used to store them. To facilitate understanding, the author has summarized the support and advantages and disadvantages in a table, as shown in the following figure:

Set and Sorted sets support aggregation of intersection and union operations, but Sorted sets do not have differential operations.

A Bitmap can also perform and, XOR, or aggregation operations on multiple bitmaps.

Both List and SortedSet support sort statistics, but the List is Sorted by insertion order, and the SortedSet supports weight, which is more flexible than List sorting.

For binary status statistics and to determine whether an element exists, Bitmap is recommended to save memory space.

For cardinality statistics, HyperLogLog is recommended to save memory space when there is a large amount of data and precision is not required. For accurate cardinality statistics, it is best to use the Set.

In addition, the author has completed two columns of the article Mybatis advanced, Spring Boot advanced, the column has been organized into a book, there is a need for the public number back code monkey technology column complex keywords Mybatis advanced, Spring Boot advanced free access.

Redis third shot, give you 100 million keys, how efficient statistics?

preface

Aggregate statistics

intersection

Difference set

And set

conclusion

Order statistics

Binary state statistics

Base statistics

conclusion

Related Posts

Spring Boot 2.4 New Features reduces memory usage by 95%

Using SAP ABAP transaction code ST12 to study the performance of SAP CRM product search – product search performance

“Spring Boot 2.4 New Features” one-click build Docker image