As the most popular NOSQL cache database, Redis has become the preferred cache tool in most scenarios by virtue of its excellent performance and rich data structure.

Since Redis is a pure memory database, memory usage can be significant when storing large amounts of data. In some scenarios, memory usage can be significantly reduced, up to 80% to 99%, by selecting appropriate data structures for storage.

Use zipList to replace a lot of key-values

Let’s take a look at the scenario. In Dsp advertising system and mass user system, we often encounter such a requirement that we can quickly find the ID of the user according to a unique identity of the user. For example, query the USER ID based on the MAC address, UUID, or MD5 of the mobile phone number.

The key is a long string, such as the 32-bit MD5 or UUID.

If we don’t process it and store it directly in key-value form, we can simply test it, insert 10 million pieces of data into Redis, 1550000000-1559999999, The form is key (MD5 (1550000000)) → value(1550000000).

Then use info Memory to look at the memory footprint.

It can be seen that these 10 million pieces of data occupy a total of 1.17G of redis memory. When the amount of data becomes 100 million, the measured data occupies about 8 G.

For the same batch of data, let’s store it in a different way. Let’s look at the result first:

After we use zipList, the memory footprint is 123M, which is about 85% of the reduction in space footprint. How does this work? Let’s take a look at the underlying storage of Redis.

1 redis data structure and coding method

2 How does Redis store strings

String is the most commonly used data structure in Redis. The default string in Redis is different from the string in C. It is an abstract type called simple dynamic string SDS.

In terms of the underlying storage of strings, Redis uses three methods: int, embstr, and RAW.

For example set k1 ABC and set k2 123 would be embstr and int respectively. When the value length is greater than 44 (or 39, depending on the version) bytes, raw is used.

An int is a fixed-length structure of eight bytes (note that it is equivalent to a Long in Java) and can only be used to store long operations.

Embstr is dynamically expanded. If the capacity of emBSTR exceeds 1 MBIT/s, 1 MBIT/s emBSTR is dynamically expanded.

Raw is used to store strings larger than 44 bytes.

Specifically, in our case, key is a 32-byte string (embstr) and value is a long integer (int), so if we can change the 32-bit MD5 into an int, we can directly reduce the memory usage of the key by 3/4.

This is the first optimization point.

3 How does Redis store Hash

From the figure of 1.1, we can see that Hash data structure has two encoding methods: 1 is hashTable and 2 is zipList.

A hashTable is an array and a linked list. The Java hashmap has a load factor of 0.75 to reduce hash collisions. Similarly, Redis hash has a similar expansion load factor. Without going into details, just to give the impression that hashTable encoding takes at least 25% more space than the data it stores. It would look something like this:

ZipList, a compressed linked list, looks something like this:

As you can see, zipList’s biggest feature is that it is not a hash structure at all. Instead, it is a long string, storing key-values in a long string in sequence. If you’re looking for a key, you just iterate through the long string.

So obviously zipList takes up a lot less space than hashTable. But it takes more CPU to query.

So when to use hashTable, zipList? It can be found in the redis.conf file:

ZipList is used when the number of internal field-values of the hash structure does not exceed 512 and the number of bytes of the value does not exceed 64.

According to the actual measurement, when the number of values is 512, the performance of hashTable is almost the same as that of simple hashTable. When the number of values is less than 1024, the performance is only slightly reduced and can be ignored in most cases.

The memory footprint of zipList is significantly lower than that of hashTable.

This is the second optimization point.

4 Use zipList instead of key-value

From the above knowledge, we can draw two conclusions. Using int as a key will save a lot more space than string. Using zipList in hash saves a lot of space compared to key-value.

So let’s change the original 10 million key-values.

The first step:

We want to put 10 million key-value pairs into N buckets, each of which is a Redis hash data structure, with no more than the default 512 elements in each bucket (if we change the configuration file, say 1024). To avoid hash changes encoding from zipList to hashTable.

10 million divided by 512 = 19531. In the future, all keys will be hashed into all buckets as much as possible, but due to the uncertainty of the hash function, it may not be completely evenly distributed. So let’s leave some space, let’s say I allocate 25,000 buckets, or 30,000 buckets.

The second step:

The hash algorithm is used to determine which bucket to place the key in. Here we use the efficient and balanced well-known algorithm CRC32. The hash algorithm can turn a string into a long number. After obtaining the CRC32 of the MD5 key, we can determine which bucket to put the key in.

Step 3:

In the second step, we determine the outer key of the redis hash structure that the key will be stored in. For the inner field, we choose another hash algorithm to avoid two completely different values. After crc32 (key) % COUNT, the field will be the same again. Hash conflict occurs when a value is overwritten. For the inner field, we use the BKDR hash algorithm (or just Java hashCode), which also yields a long integer. The value store remains the same.

Step 4:

Load data. The original data structure is the key – value, 0 eac261f1c2d21e0bfdbd567bb270a68-1550000000.

Now the data structure is hash, the key is 14523, the field is 1927144074, and the value is 1550000000.

Based on field measurement, the hash is balanced after 10 million data is stored in 25,000 buckets, and each bucket contains about 300 field-value key pairs. Theoretically, as long as the same value is not generated after two hash algorithms, the original value can be found entirely by relying on key-field. This can be confirmed by calculating the total amount. In fact, when the number of buckets is large and the number of values under each bucket is not large, the probability of continuous collision is extremely low. In actual measurement, no obvious collision occurs when 5 billion mobile phone numbers are stored.

Test query speed:

After storing the 10 million data, we conducted a query test, using key-value type and hash type to query 1 million data respectively to see the impact on the query speed.

The key-value time can be 10653, 10790, 11318, 9900, 11270, or 11029 milliseconds

The hash-field takes 12042, 11349, 11126, 11355, and 11168 milliseconds.

As you can see, using hash storage as a whole increases the query time of 1 million by less than 500 milliseconds. The impact on performance is minimal. But the memory footprint went from 1.1g to 120M, resulting in nearly 90% memory savings.

conclusion

A large number of key-values occupy too many keys. In order to handle hash collisions, redis needs to occupy more space to store these key-values.

If the keys are different in length, such as some 40 bits and some 10 bits, due to alignment problems, this will result in huge memory fragmentation, which is even worse. Therefore, keeping the length of the key uniform (such as int, fixed length of 8 bytes) will also help with memory usage.

Md5 of string takes 32 bytes. By using the hash algorithm, 32 is reduced to a long integer of 8 bytes, which significantly reduces the space footprint of the key.

ZipList has a significantly lower memory footprint than hashTable, is very compact and has little impact on query efficiency. Use zipList to avoid storing more than 512 field-value elements in hash structures.

If a value is a string or object, use byte[] to store the value, which can greatly reduce memory usage. For example, Google’s Snappy compression algorithm can be used to convert strings into byte[], which is very efficient and has a high compression rate.

Append, Setrange, etc. should not be used in order to reduce the memory fragmentation caused by redis preallocating and expanding strings (doubling each time). I’m just going to replace it with set.

Disadvantages of the scheme:

The Hash structure does not support timeout for a single field. However, deletion can be controlled through code, and there is no such concern for long-lived data that does not need to time out.

This compression method is not suitable for scenarios that require very precise data because the probability of hash conflict is small.

Based on the above solution, I rewrote the redisTemplate of springBoot source code to provide a CompressRedisTemplate class that can be used directly as a redisTemplate, which automatically converts key-value to hash for storage. To achieve the above objectives.

Later, we’ll look at how Redis’s unusual data structures reduce the memory footprint from 20GB to 5M, based on more extreme scenarios such as counting unique visitors.

Welcome to pay attention to the public number [code farming blossom] learn to grow together I will always share Java dry goods, will also share free learning materials courses and interview treasure book reply: [computer] [design mode] [interview] have surprise oh