# Redis combat: skillfully use Bitmap to achieve hundreds of millions of massive data statistics

Posted on Jan. 31, 2023, 1:25 p.m. by 金郁婷
Category: The back-end

In a mobile application business scenario, we need to store information that a key is associated with a data set.

Common scenarios are as follows:

• Give a userId to determine user login status;
• Display the check-in times and first check-in time of the user in a certain month;
• Check in status of 200 million users in the last 7 days, and collect the total number of users who check in continuously in the last 7 days.

Usually, the number of users and visits we are facing is huge, such as millions or tens of millions of users, or tens or even hundreds of millions of people accessing information.

Therefore, we have to choose a collection type that is very efficient at counting large amounts of data, such as billions.

How to choose the right data set, we should first understand the commonly used statistical model, and use reasonable data to solve practical problems.

Four types of statistics:

1. Binary state statistics;
2. Aggregate statistics;
3. Rank statistics;
4. Cardinality statistics.

This article will start with the binary state statistics type as the practical part of the series. This article will use the extended data type Bitmap other than String, Set, Zset, List and Hash to achieve.

The instructions mentioned in this article can be debugged via an online Redis client at try.redis. IO /.

## remarks

Share more, give more, create more value for others upfront without expecting anything in return, and in the long run your efforts will return you many times over.

Especially at the beginning of cooperation with others, don't worry about short-term returns, it doesn't have much significance, more is to exercise their vision, perspective and problem solving ability.

# Binary state statistics

What is binary state statistics?

In other words, there are only 0 and 1 values of elements in the set. In the scenario of check-in and login, only check-in (1) or non-check-in (0), login (1) or non-login (0) are recorded.

If we use the String type of Redis (key - userId, value - 0 indicates offline, 1 - login) in the scene of determining whether the user logs in, if we store the login status of 1 million users, if it is stored in the form of String, I need to store a million strings, which is a lot of memory.

Why is String memory expensive?

In addition to recording the actual data, the String type requires additional memory to record data length, space usage, and so on.

When the saved data contains strings, the String type is stored using a simple dynamic String (SDS) structure, as shown below:

• Len: 4 bytes, indicating the used length of buF.
• Alloc: represents the actual length allocated by buF in 4 bytes, usually len.
• Buf: byte array that holds the actual data. Redis automatically adds a "\0" to the end of the array, incurs an extra byte of overhead.

So len and alloc are additional overhead in SDS in addition to buF holding the actual data.

There is also an overhead of the RedisObject structure, because Redis has many data types, and different data types all have the same metadata to record (such as last accessed time, number of references, etc.).

Therefore, Redis uses a RedisObject structure to uniformly record this metadata while pointing to the actual data.

For binary state scenarios, we can use bitmaps to achieve this. For example, we use a bit to represent the login state, and 100 million users only occupy 100 million bit memory ≈ (100000000/8/1024/1024) 12 MB.

``\$offset/8/1024/1024) MBCopy the code``

What is a Bitmap?

The underlying data structure of Bitmap uses SDS data structure of String type to store the bit array. Redis uses 8 bits of each byte array, and each bit represents the binary state of an element (either 0 or 1).

A Bitmap can be thought of as an array of bits. Each unit of the array can store only 0 or 1. The subscript of the array is called offset in the Bitmap.

For intuitive demonstration, we can understand that each byte of the BUF array is represented by a line, each line has 8 bit bits, and 8 cells respectively represent 8 bit bits in the byte, as shown in the figure below:

Eight bits make up a Byte, so bitmaps save a lot of storage. This is the advantage of Bitmap.

How do you use a Bitmap to determine if a user is online in a mass of users?

Bitmap provides GETBIT and SETBIT operations, which use an offset value to read and write bits in the offset position of the bit array. Note that offset starts from 0.

Only one key = login_status is required to store the data set of user login status. Set the user ID as offset. Set it to 1 if online and 0 if offline. Use GETBIT to check whether the user is online. Half a million users only need 6 MB of space.

SETBIT command

``````SETBIT key offset value
Copy the code``````

Sets or clears the key's value at the offset bit (0 or 1 only).

GETBIT command

``````GETBIT key offset
Copy the code``````

Gets the value of the key's bit at offset, or 0 if the key does not exist.

User ID = 10086;

First, execute the following command to indicate that the user has logged in.

``````SETBIT login_status 10086 1
Copy the code``````

In the second step, check whether the user has logged in. The return value 1 indicates that the user has logged in.

``````GETBIT login_status 10086
Copy the code``````

Step 3: Log out and set the value of offset to 0.

``````SETBIT login_status 10086 0
Copy the code``````

## Monthly check-ins of users

In check-in statistics, each user's daily check-in is represented by 1 bit, and only 365 bits are needed for a year's check-in. There are only 31 days in a month, and you only need 31 bits.

For example, how will the user with statistical number 89757 clock in May 2021?

The key can be designed as uid:sign:{userId}:{yyyyMM}, and the value -1 for each day of the month can be used as offset (offset = date -1 because offset starts from 0).

In the first step, execute the following command to record the user's clocking on May 16, 2021.

``````SETBIT uid:sign:89757:202105 15 1
Copy the code``````

The second step is to determine whether user number 89757 clocked on May 16, 2021.

``````GETBIT uid:sign:89757:202105 15
Copy the code``````

The third step is to count the number of times the user clocked in May, using the BITCOUNT command. This instruction counts the number of bits in a given array of bits whose value is 1.

``````BITCOUNT uid:sign:89757:202105
Copy the code``````

In this way, we can realize the user's monthly clock situation, isn't it great?

How to count the first time you clocked in this month?

Redis provides the BITPOS key bitValue [start] [end] directive, which returns data representing the position of offset in the Bitmap where the first value is bitValue.

By default, the command detects the entire bitmap, and the user can specify the range to detect with the optional start and end arguments.

So we can get the date when userID = 89757 was first clocked in May 2021 by executing the following command:

``````BITPOS uid:sign:89757:202105 1
Copy the code``````

Note that we need to return value + 1 because offset starts at 0.

## Total number of consecutive check-in users

In the record of 100 million users for 7 consecutive days of clocking data, how to figure out the total number of users for 7 consecutive days?

We take the date of each day as the key of the Bitmap, the userId as the offset, and set the bit of the offset position to 1 if the card is punched.

The data of each bit of the set corresponding to key is the record of a user's punching in the date.

There are seven such bitmaps, if we can "and" the corresponding bits of the seven bitmaps.

The same UserID offset is the same. If the bit of a UserID in the corresponding offset position of the seven bitmaps is 1, the user has punched the card continuously for seven days.

The result is saved to a new Bitmap, and we count the number of bits = 1 by BITCOUNT to get the total number of users who punched in for 7 consecutive days.

Redis provides BITOP operation destkey key [key... This directive is used to perform bit operations on one or more bitmaps with key = key.

Opration can be and, OR, NOT, OR XOR. When BITOP processes strings of different lengths, the missing part of the shorter string is treated as a 0. An empty key is also considered a sequence of strings containing zeros.

Easy to understand, as shown below:

Three bitmaps. Perform the "and" operation on the corresponding Bitmap, and save the result to the new Bitmap.

The operation instruction is to perform AND operation on the three bitmaps AND save the result to destMap. BITCOUNT statistics are then performed on destMap.

``BITOP AND destmap Bitmap :01 bitmap:02 bitmap:03 // Count the number of bits = 1 BITCOUNT destmapCopy the code``

Simply calculate the memory overhead for the next 100 million bits of Bitmap, which is about 12 MB of memory (10^8/8/1024/1024), and the memory overhead for a 7-day Bitmap is about 84 MB. At the same time, we'd better set the expiration time for Bitmap and let Redis delete the expired clocking data to save memory.

## summary

Thinking is the most important. When we encounter statistical scenarios that only need the binary state of statistical data, such as whether the user exists, whether the IP is blacklisted, and the statistics of check-in and punch cards, we can consider using Bitmap.

You only need one bit to represent 0 and 1. The memory footprint is greatly reduced when collecting massive amounts of data.

# conclusion

Phase to recommend

Redis core: The only unbreakable secret

Redis: A killer app for fast recovery without downtime

Redis High Availability: You call this master-slave data synchronization principle?

Redis HIGH Availability: You call this the Sentinel cluster principle

How much data can a Cluster support?

Search