In the business scenario of mobile applications, we need to keep information that a key is associated with a data set and that the data in the set is statistically sorted.

Common scenarios are as follows:

  • Given a userId, determine the user login status;
  • Check in of 200 million users in the last 7 days: count the total number of users who check in continuously in the last 7 days;
  • Count the number of new users each day and the number of retained users the next day.
  • Count the number of Unique visitors (UV) to the site
  • List of latest comments
  • Music charts by number of streams

Usually, we are dealing with huge numbers of users and visits, such as millions, tens of millions of users, or tens of millions, or even hundreds of millions of access information.

Therefore, we have to choose collection types that are very efficient at counting large amounts of data (e.g., billions).

How to choose the appropriate data set, we first need to understand the commonly used statistical model, and use reasonable data to solve practical problems.

Four types of statistics:

  1. Binary state statistics;
  2. Aggregation statistics;
  3. Sort statistics;
  4. Cardinality statistics.

In this article, we will use the extended data types Bitmap and HyperLogLog in addition to String, Set, Zset, List, and hash.

The instructions in this article can be run through the online Redis client at try.redis. IO /.

remarks

The more you share and the more you give, the more you create value upfront and the more you don’t get back, the more you pay back in the long run.

Especially at the beginning of the cooperation with others, do not care about the short-term return, there is no great significance, more to exercise their vision, perspective and problem-solving ability.

Binary state statistics

What is binary state statistics?

That is, the values of the elements in the set are only 0 and 1. In the case of the check-in card and whether the user is logged in, only the check-in (1) or not (0), logged in (1) or not logged in (0) should be recorded.

If we use Redis String type implementation (key -> userId, value -> 0 indicates offline, 1 – login) in the scenario of determining whether a user is logged in, if we store the login status of 1 million users, if we store it in the form of String, I need to store a million strings, which is a lot of memory.

For binary state scenarios, we can use Bitmap to achieve this. For example, we use a bit to indicate the login status, 100 million users only occupy 100 million bit memory ≈ (100000000/8/1024/1024) 12 MB.

The approximate space usage is :($offset/8/1024/1024) MBCopy the code

What is a Bitmap?

The underlying data structure of a Bitmap is a String SDS data structure to hold an array of bits. Redis uses the eight bits of each byte array. Each bit represents the binary state of an element (either 0 or 1).

You can think of a Bitmap as an array of bits, each of which can only store zeros or ones. The subscript of an array is called an offset in a Bitmap.

For intuitive display, we can understand that each byte in the BUF array is represented by a row, each row has 8 bits, and the 8 cells respectively represent the 8 bits in this byte, as shown in the figure below:

A Byte consists of eight bits. Therefore, a Bitmap greatly saves storage space. That’s the advantage of Bitmap.

Determine the user login status

How do you use a Bitmap to tell if a user is on the line?

The Bitmap provides GETBIT and SETBIT operations. The Bitmap reads and writes bits at the offset position of the bit array through an offset value. Note that the offset value starts from 0.

Only a key = login_status is required to store the login status collection data of the user. The user ID is set as offset. If the user is online, it is set to 1, and if the user is offline, it is set to 0. Use GETBIT to check whether the user is online. Half a million users only need 6 MEgabytes of space.

SETBIT command

SETBIT <key> <offset> <value>
Copy the code

Set or clear the bit value (0 or 1) of the key offset.

GETBIT command

GETBIT <key> <offset>
Copy the code

Get the bit value of the key’s value at offset. If the key does not exist, return 0.

If we want to determine the login status of the user whose ID is 10086:

Step 1: Execute the following command to indicate that the user is logged in.

SETBIT login_status 10086 1
Copy the code

The second step is to check whether the user is logged in. The return value of 1 indicates that the user is logged in.

GETBIT login_status 10086
Copy the code

Step three, log out and set the value corresponding to the offset to 0.

SETBIT login_status 10086 0
Copy the code

Monthly check-ins of users

In the check-in statistics, each user’s daily check-in is represented by 1 bit bit, and only 365 bit bits are needed for a yearly check-in. A maximum of 31 days a month requires only 31 bits.

For example, how to calculate the clocking situation of the user whose number is 89757 in May 2021?

The key can be designed as uid:sign:{userId}:{yyyyMM}, and the value -1 for each day of the month can be used as offset (because offset starts at 0, so offset = date -1).

In the first step, execute the following command to record the user punching in on May 16, 2021.

SETBIT uid:sign:89757:202105 15 1
Copy the code

Step 2: Determine whether user 89757 punched in on May 16, 2021.

GETBIT uid:sign:89757:202105 15
Copy the code

The third step is to count the number of times the user punched in May, using the BITCOUNT command. This instruction counts the number of bits with a value of 1 in a given bit array.

BITCOUNT uid:sign:89757:202105
Copy the code

So that we can achieve the user every month of the clock situation, isn’t it great.

How do you count the first time you punch in this month?

Redis provides the BITPOS key bitValue [start] [end] instruction, which returns data representing the offset position in the Bitmap where the first value is bitValue.

By default, the command checks the entire bitmap, and the user can specify the range to check with the optional start and end arguments.

So we can obtain the first clocking date of userID = 89757 in May 2021 by executing the following command:

BITPOS uid:sign:89757:202105 1
Copy the code

Note that we need to return value + 1 to represent the day of the first punch, since offset starts at 0.

Total number of consecutive checked-in users

How to calculate the total number of users who clocked in for 7 consecutive days after recording the data of 100 million users?

We use the date of each day as the key of the Bitmap and the userId as the offset. If we punch a card, we set the bit of offset to 1.

Each bit of data in the set corresponding to the key is the record of a user’s punching in that date.

There are seven of these bitmaps, and if we could do and for each of these seven bitmaps.

The same UserID offset is the same. If the bit of a UserID offset in the seven bitmaps is 1, the user has punched in the card for seven consecutive days.

The result is saved to a new Bitmap, and we get the total number of users who clocked in for 7 consecutive days by counting the number of bits = 1 by BITCOUNT.

Redis BITOP operation destKey [key…] This instruction is used to perform bit operations on one or more key = key bitmaps.

Opration can be “and”, “OR” NOT “OR” XOR “. When BITOP processes strings of different lengths, the missing part of the shorter string is treated as a 0.

An empty key is also seen as a sequence of strings containing zeros.

Easy to understand, as shown in the figure below:

Three bitmaps. Perform the and operation on the corresponding Bitmap and save the result to the new Bitmap.

The operation instruction performs AND operations on the three bitmaps AND saves the results to the destmap. Next, perform BITCOUNT statistics on the destMap.

Bitmap :01 bitmap:02 Bitmap :03 // Collect statistics on the number of bits = 1. BITCOUNT DestmapCopy the code

The memory overhead of the next 100 million Bitmap is about 12 MB (10^8/8/1024/1024). The memory overhead of the 7-day Bitmap is about 84 MB. At the same time, we’d better set the expiration time for Bitmap and let Redis delete expired card data to save memory.

summary

The idea is the most important, when we encounter statistical scenarios that only need the binary status of the statistical data, such as whether the user exists, whether the IP is blacklist, and check-in statistics can be considered to use Bitmap.

Only one bit is needed to represent zeros and ones, greatly reducing memory usage when collecting large amounts of data.

Base statistics

Cardinality statistics: Counting the number of non-repeating elements in a set, often seen in counting independent users (UV).

The most direct way to realize cardinality statistics is to adopt the data structure of Set. When an element never appears, an element is added to the Set. If it does, then the set remains the same.

When you have a lot of page views, you need a huge Set to count them, which will waste a lot of space.

Besides, the data don’t need to be accurate. Is there a better way?

That’s a good question. Redis provides HyperLogLog data structures that are used to solve statistical problems in various scenarios.

HyperLogLog is an imprecise deweighting cardinality scheme, its statistical rules are based on probability, the standard error is 0.81%, such accuracy is enough to meet the needs of UV statistics.

The principle of HyperLogLog is too complex, if you want to understand the step:

  • www.zhihu.com/question/53…
  • En.wikipedia.org/wiki/HyperL…

Website of UV

Implemented by Set

A user who visits a site multiple times a day only counts once, so it’s easy to think of Redis’ Set.

When user number 89757 accesses “Why is Redis so fast”, we put this information into the Set.

SADD Redis Why so fast: UV 89757Copy the code

When user 89757 visits the “Why Redis So fast” page multiple times, the Set deduplication feature ensures that the same user ID will not be recorded repeatedly.

Through SCARD command, statistic “Why is Redis so fast” page UV. The directive returns the number of elements in a collection (that is, the user ID).

SCARD Redis Why so fast: UVCopy the code

Implemented by hashing

The user ID is used as the key of the Hash set. When accessing the page, the HSET command is executed to set the value to 1.

The value of this userId will only be set to “1” even if the user repeats the access and repeats the command.

Finally, using the HLEN command to count the number of elements in the Hash set is UV.

As follows:

HSET Redis cluster: UV userId:89757 1 // Statistics UV HLEN Redis clusterCopy the code

HyperLogLog king scheme

Code old wet, Set is good, if the article is very popular to reach ten million level, a Set saved tens of thousands of user ID, page more memory consumption is too large. The same is true for Hash data types. Do how?

Take advantage of the HyperLogLog advanced data structures provided by Redis (beyond the five basic data types of Redis). This is a type of data set used for cardinality statistics, where the amount of space required to calculate cardinality is fixed even if the amount of data is large.

Each HyperLogLog takes up to 12KB of memory to calculate the cardinality of 2 to the power of 64 elements.

Redis optimizes the storage of HyperLogLog. When the count is relatively small, the storage space adopts coefficient matrix and occupies very little space.

Only when the count is large and the space occupied by the sparse matrix exceeds the threshold will it be transformed into a dense matrix, occupying 12KB space.

PFADD

Add each user ID that accesses the page to HyperLogLog.

PFADD Redis Principle of primary/secondary synchronization: UV userID1 userID 2 useID3Copy the code

PFCOUNT

Use PFCOUNT to obtain the UV value of the Redis master/slave synchronization principle page.

PFCOUNT Redis master/slave synchronization principle: UVCopy the code

PFMERGE usage scenarios

In addition to the PFADD and PFCOIUNT above, HyperLogLog also provides PFMERGE, which combines multiple HyperLogLog together to form a new HyperLogLog value.

grammar

PFMERGE destkey sourcekey [sourcekey ...]
Copy the code

Usage scenarios

For example, in the website, we have two pages with similar content, and the operation says that the data of the two pages needs to be merged.

This is where the PFMERGE comes in, where the same user visits both pages only once.

As shown below: Redis and MySQL Bitmap sets respectively save the user access data of two pages.

PFADD Redis data user1 user2 user3 PFADD MySQL data user1 user2 user4 PFMERGE database Redis data MySQL data PFCOUNT database // Return value = 4Copy the code

Multiple HyperLogLog were merged into one HyperLogLog, and the cardinality of the merged HyperLogLog was close to the union of all input HyperLogLog visible sets.

User1 and user2 both accessed Redis and MySQL, only once.

Order statistics

Among the four Redis collection types (List, Set, Hash, and Sorted Set), List and Sorted Set are Sorted.

  • List: Sorted by the order in which the elements were inserted into the List. Usage scenarios are usually message queues, latest lists, and leaderboards.
  • Sorted Set: Sorted according to the score weight of each element. We can decide what the weight of each element is. Usage scenarios (leaderboards by streams, likes).

List of latest comments

Code old wet, I can use the List insert order sort to achieve a List of comments

For example, wechat public number of the background reply List (do not bar, for example), each public number corresponds to a List, this List saves the public number of all user comments.

Each time a user reviews, LPUSH key value [value… Insert into the List header.

LPUSH code elder byte 1 2 3 4 5 6Copy the code

Then we use the LRANGE key star stop to get the elements in the specified range of the list.

Brother > LRANGE code byte 0 1) 4 "6" 2) "5" 3) "3" "4" 4) 5) "2"Copy the code

Note that not all up-to-date lists can be implemented with a List, because for lists that are updated frequently, List type paging can cause List elements to duplicate or miss.

For example, the List of current comments = {A, B, C, D}, left is the latest comment, D is the earliest comment.

LPUSH code byte D C B ACopy the code

Show the latest 2 comments on the first page, and get A and B:

0 1 1 1) "A" 2) "B"Copy the code

According to the logic we want, the second page can get C, D by LRANGE code byte 2, 3.

If A new comment E is generated before the second page is displayed, the comment E is inserted into the List header by LPUSH code byte E, List = {E, A, B, C, D}.

Now execute LRANGE code brother byte 2 3 to get the second page comment found that B appeared again.

1) "B" 2) "C"Copy the code

The reason for this is that lists are sorted by the position of the elements, so once A new element is inserted, List = {E, A, B, C, D}.

The original data is moved back one bit in the List, causing all the old elements to be read.

summary

Only lists that do not require paging (such as fetching only the first five elements of the List each time) or that are updated infrequently (such as statistically once a day in the early hours of the morning) are suitable for implementation with the List type.

For lists that require paging and are frequently updated, use the Sorted Set type.

In addition, if you need to search for the latest List by time range, the List type cannot be implemented. Instead, you need to implement the Sorted Set type. For example, you need to query the order List by time range.

list

In the case of the latest List, both the List and the Sorted Set can be implemented. Instead of just using the Sorted Set, it also has the ability to Set the score weight to make sorting more flexible.

The reason is that the Sorted Set type takes up a number of times more memory than the List type. In the case of a small number of lists, you can use the Sorted Set type.

For example, if we want a music chart for the week, we need to update the number of streams in real time, and we need to display it in pages.

In addition, sorting is based on the number of plays, and the List is not sufficient.

We can save the music ID to the Sorted Set, Set score to the number of streams per song, and Set Score = Score +1 for each play of the music.

ZADD

For example, we added blue and White Porcelain and The Wrong Field to musicTop:

ZADD musicTop 100000000 Blue and White Porcelain 8999999 Flower field wrongCopy the code

ZINCRBY

Blue and White Porcelain score + 1 with the ZINCRBY command every time it plays.

> < p style = "max-width: 100%Copy the code

ZRANGEBYSCORE

Finally, we need to obtain musicTop ten music charts, the current maximum musicTop is N, which can be obtained by following the command:

ZRANGEBYSCORE musicTop N-9 N WITHSCORES
Copy the code

Brother 65: But how do we get this N?

ZREVRANGE

You can use the ZREVRANGE key start stop [WITHSCORES] command.

Elements are sorted in descending order of score value (from largest to smallest).

Members with the same score are arranged in a reverse lexicographical order.

> ZREVRANGE musicTop 00 WITHSCORES 1) "0" 2) 10000000Copy the code

summary

The Sorted Set gets exactly Sorted data using the ZRANGEBYSCORE command, even though the elements in the collection are updated frequently.

In scenarios where the latest list or ranking list needs to be displayed, if data is frequently updated or Sorted, you are advised to use Sorted Set.

Aggregate statistics

This refers to counting the aggregated results of multiple set elements, such as:

  • Count the shared data of multiple elements (intersection);
  • Counting the unique elements of one of two sets (difference set statistics);
  • Counts all elements of multiple sets (union statistics).

Code old wet, what kind of scene will use intersection, difference set, union?

The Set type of Redis supports adding, deleting, modifying and querying in the Set. The underlying Hash data structure is used. Both add and remove are O(1) time complexity.

And support multiple sets of intersection, union, difference set operations, using these set operations, to solve the above mentioned statistical problems.

Intersection – Mutual friends

For example, mutual friends in QQ are the intersection of aggregation statistics. We use the account as the Key and the friends of that account as the value of the Set.

Simulate the friends set of two users:

SADD user: The SADD user: the SADD user: the SADD user: the SADD user: the SADD user: the SADD userCopy the code

Counting mutual friends of two users requires only the intersection of the two sets, as shown in the following command:

SINTERSTORE user: common friends User: code and byte user: big guyCopy the code

After the command is executed, the intersection data of “user: codebrother byte” and “user: big guy” sets is stored in the set of “user: mutual friends”.

Difference set – Number of new friends per day

For example, to calculate the daily number of new registered users of an App, you only need to take the difference set from the total number of registered users in the last two days.

For example, the total registered users of 2021-06-01 are stored in the set key = user:20210601, and the total users of 2021-06-02 are stored in the set key = user:20210602.

Perform the difference set calculation and store the result in the user:new set.

SDIFFSTORE  user:new  user:20210602 user:20210601
Copy the code

At this point, the user: New collection will be the number of new users added on 2021/06/02.

In addition, there is a function on QQ that may know people, which can also be realized by using difference set, that is, subtract your mutual friends from the friends set of your friends, that is, the people you may know.

Union – Total new friends

In the example of difference set, the total number of new users added in the two days 2021/06/01 and 2021/06/02 is counted, only the two sets need to perform the union.

SUNIONSTORE  userid:new user:20210602 user:20210601
Copy the code

In this case, the new set userid:new is the new friends added in two days.

summary

The calculation complexity of the difference Set, union Set and intersection of Set is high. In the case of large data volume, if these calculations are directly performed, the Redis instance will be blocked.

Therefore, you can deploy a cluster for statistics and have it do the aggregation calculations, or you can read the data to the client and do the aggregation statistics on the client side, thus avoiding the blocking that prevents other services from responding.

Phase to recommend

Redis core: The only secret that can not be broken

Redis Log: A killer for fast recovery in case of downtime

Redis High Availability: Do you call this master-slave data synchronization?

Redis high Availability: You call this the Sentinel cluster principle

How much data can a Cluster support?