From today on, we will enter the “practice”. Next, we will spend 5 lessons on “data structures”. I’ll save memory overhead and save huge amounts of data and statistics data types and the underlying data structure, also around the typical application scenarios (e.g., address location query, time series and message queue access database, speaking, reading and writing), to share with you use Redis data types, and the module extension function to meet the demand.

Today, we’ll take a look at the memory consumption problem of String types and the solution to choosing data types that save memory overhead.

Let me share with you a need THAT I once encountered.

At that time, we want to develop a picture storage system, which requires that the system can quickly record the picture ID and the ID when the picture is saved in the storage system (can be directly called the picture storage object ID). At the same time, the image storage object ID can be quickly found according to the image ID.

Because the number of images is huge, we use 10 digits to represent the image ID and the image storage object ID. For example, the image ID is 1101000051, and its corresponding ID in the storage system is 3301000051.

photo_id: 1101000051
photo_obj_id: 3301000051
Copy the code

It can be seen that the picture ID corresponds to the image storage object ID exactly one by one, which is a typical “key-single value” mode. A single value means that the value in a key-value pair is a single value, rather than a set. This is exactly the same as the “key-value data” format provided by String.

Moreover, strings can hold binary streams, like “balm,” by converting the data into binary byte arrays.

So, our first solution is to store the data in strings. We save the picture ID and the picture storage object ID as the key and value of the key-value pair respectively. The picture storage object ID uses the String type.

To start, we saved 100 million images and used about 6.4GB of memory. However, as the amount of image data increased, so did our Redis memory usage, and as a result, we encountered problems with large memory Redis instances that were slow to respond due to RDB generation. Clearly, String is not a good choice, and we need to look further for a data-type solution that saves memory.

In this process, I deeply studied the underlying structure of String, found the reason for its high memory overhead, and gained a new understanding of the String type of “all-purpose” : String type is not suitable for all occasions, it has an obvious disadvantage, that is, it consumes a lot of memory space when storing data.

I also took a closer look at data structures for collection types. I have found that collection types have an underlying implementation structure that is very memory efficient, but the data schema that collection types hold, which is a series of values for a key, is not suitable for storing single-valued key-value pairs directly. Therefore, we use the method of second-level coding to achieve the collection type to save single-valued key-value pairs, and the memory space consumption of Redis instances is significantly reduced.

In this lesson, I’ll share with you some of the lessons I’ve learned in solving this problem, including where String memory is consumed, what data structures can save memory, and how to use collection types to store single-valued key-value pairs. If you’re having trouble with strings, try today’s solution.

Next, let’s take a look at where String memory is being consumed.

Why is String memory expensive?

In our example, we saved 100 million image information using about 6.4GB of memory, and a record of the image ID and image storage object ID took an average of 64 bytes.

The problem is that a set of picture ids and the record of the object ids they store actually require only 16 bytes.

So let’s analyze it. The picture ID and the picture store object ID are both 10 digits, and we can represent them with two 8-byte longs. Since an 8-byte Long can represent up to 2 to the power of 64, it can definitely represent 10 digits. But why is String 64 bytes?

In fact, in addition to recording the actual data, strings also require additional memory space to record data length, space usage, and other information, also known as metadata. When the actual data storage is small, the space overhead of metadata is relatively large, which is a bit “distracting”.

So how exactly does a String hold data? Let me explain.

When you save a 64-bit signed integer, String stores it as an 8-byte Long integer, which is often called the int encoding.

However, when you save data that contains characters, the String type is stored in a Simple Dynamic String (SDS) structure, as shown below:

  • Buf: byte array that holds the actual data. To indicate the end of the byte array, Redis automatically adds a “\0” to the end of the array, which incurs an extra byte of overhead.
  • Len: 4 bytes, indicating the used length of buF.
  • Alloc: also 4 bytes, representing the actual allocated length of buF, which is generally greater than len.

As you can see, in SDS, BUF holds the actual data, and len and alloc themselves are the additional overhead of the SDS structure.

In addition to the SDS overhead, there is also an overhead from the RedisObject structure for strings.

Because there are many data types in Redis, and different data types have the same metadata to record (such as the last access time, the number of references, etc.), Redis uses a RedisObject structure to uniformly record this metadata and point to the actual data.

A RedisObject contains 8-byte metadata and an 8-byte pointer that further points to the actual location of the specific data type, such as the memory location of the SDS structure of type String, as shown in the diagram below. The exact structure details of a RedisObject will be covered later in the course, but for now you just need to understand its basic structure and metadata overhead.

Redis also designed the memory layout for Long integers and SDS to save memory space.

On the one hand, when holding an integer of type Long, the pointer in a RedisObject is assigned to the integer data directly. In this way, no additional Pointers are needed to point to the integer, saving the space overhead of Pointers.

On the other hand, when string data is stored and the string is less than or equal to 44 bytes, the metadata, Pointers, and SDS in a RedisObject are a contiguous area of memory, thus avoiding memory fragmentation. This layout is also known as EMBSTR encoding.

Of course, when the string is larger than 44 bytes, the amount of data in SDS begins to grow. Redis no longer arranges SDS and RedisObject together, but allocates independent space for SDS and points to SDS structures. This layout is called raw encoding.

To help you understand the three encoding modes of INT, EMbstr, and RAW, I drew a diagram as follows:

Now that we know the additional metadata overhead that RedisObject involves, we can now calculate the memory usage of the String type.

Because the 10-digit picture ID and picture storage object ID are integers of type Long, they can be stored directly with int-encoded RedisObject. The RedisObject metadata portion of each int is 8 bytes, and the pointer portion is directly assigned to an 8-byte integer. At this point, each ID uses 16 bytes, making a total of 32 bytes. But what happened to the other 32 bytes?

As I said in Lecture 2, Redis uses a global hash table to hold all key-value pairs, and each entry in the hash table is a dictEntry structure that points to a key-value pair. There are three 8-byte Pointers in the dictEntry structure, pointing to key, value and the next dictEntry respectively, with a total of 24 bytes, as shown in the figure below:

However, these three Pointers are only 24 bytes, so why do they take up 32 bytes? This brings us to Jemalloc, the memory allocation library used by Redis.

Jemalloc allocates memory to a power of 2 that is larger than N but closest to N, according to the number of bytes we request. This reduces the number of frequent allocations.

Let me give you an example. If you ask for 6 bytes of space, Jemalloc actually allocates 8 bytes of space; If you request 24 bytes of space, Jemalloc will allocate 32 bytes. So, in the scenario we just described, the dictEntry structure takes up 32 bytes.

Well, at this point, you should be able to understand why it takes 64 bytes to hold the picture ID in String and the picture store object ID.

You see, a String saves only 16 bytes of valid information, but 64 bytes of memory is required, and 48 bytes are not used to save the actual data. If you want to save 100 million images, 100 million image ID records would require 6.4GB of memory, of which 4.8GB is used to store metadata. The extra memory is expensive. So, is there a way to save more memory?

What data structures can save memory?

Redis has an underlying data structure called ziplist, which is a very memory saving structure.

Let’s review the composition of a compressed list. The table header has three fields zlBytes, zLTail, and zllen, representing the length of the list, the offset at the end of the list, and the number of entries in the list, respectively. There is also a Zlend at the end of the compressed list, indicating the end of the list.

Compressed lists save memory because they hold data in a series of consecutive entries. The metadata of each entry consists of the following parts.

  • Prev_len: indicates the length of the previous entry. Prev_len has two values: 1 byte or 5 bytes. If the value is 1 byte, it indicates that the length of the last entry is less than 254 bytes. Although a 1-byte value can represent values ranging from 0 to 255, the default value of zlend in the compressed list is 255. Therefore, 255 is used to indicate the end of the compressed list by default, and 255 is not used anywhere else to indicate length. Therefore, the prev_len value is 1 byte if the length of the previous entry is less than 254 bytes; otherwise, the prev_len value is 5 bytes.
  • Len: indicates its length, which is 4 bytes.
  • Encoding: Encoding mode, 1 byte.
  • Content: Saves actual data.

These entries are placed side by side in memory, eliminating the need for additional Pointers to concatenate, thus saving space occupied by Pointers.

Let’s take saving image storage object IDS as an example to analyze how compressed lists save memory space.

Each entry holds an image storage object ID (8 bytes). In this case, only 1 byte is required for prev_len of each entry because the length of the preceding entry is only 8 bytes, less than 254 bytes. As a result, the memory size of the image storage object ID is 14 bytes (1+4+1+8=14), and the actual allocation is 16 bytes.

Redis implements collection types like List, Hash, and Sorted Set based on compressed lists, which saves on dictEntry overhead. When you use String, a key-value pair has a dictEntry, and it uses 32 bytes of space. However, when using the set type, a key corresponds to a set of data, which can save a lot of data, but also only one dictEntry, which saves memory.

This scheme sounds good, but there is a problem: in the case of a set type for storing key-value pairs, a key corresponds to a set of data, but in our scenario, a picture ID corresponds to a picture’s storage object ID. How do we use the set type? In other words, in the case of a key corresponding to a value (i.e., a single-valued key-value pair), how do we store such a single-valued key-value pair using a collection type?

How do I store single-valued key-value pairs with collection types?

When saving single-valued key-value pairs, you can use a Hash based second-level encoding method. We can store single-valued data in the Hash set by splitting it into two parts, the first part as the key and the second part as the value.

Take the image ID 1101000060 and the image storage object ID 3302000080 as examples. The first 7 bits of the image ID (1101000) can be used as the Hash key, and the last 3 bits of the image ID (060) and the image storage object ID can be used as the key and value of the Hash value respectively.

Following this design, I inserted a list of image ids and their stored object IDS into Redis, and used the info command to check the memory overhead. I found that adding a record increased the memory footprint by only 16 bytes, as shown below:

127.0.0.1:6379> info memory # memory usED_memory :1039120 127.0.0.1:6379> hset 1101000 060 3302000080 (integer) 1 127.0.0.1:6379> info memory # memory used_memory:1039136Copy the code

In the case of strings, each record consumes 64 bytes. In this case, it uses only 16 bytes and uses a quarter of the memory we need to save.

However, you might wonder: “Must the first 7 bits of the image ID be the Hash key and the last 3 bits be the key in the Hash value?” In fact, the ID length used in the secondary coding method is particular.

In Lecture 2, I introduced two underlying implementations of the Redis Hash type, compressed lists and Hash tables.

So when do you use compressed lists and when do you use Hash tables? In fact, the Hash type sets two thresholds for storing data in a compressed list. Once these thresholds are exceeded, the Hash type will store data in a Hash table.

The thresholds correspond to the following configuration items:

  • Hash-max-ziplist-entries: Indicates the maximum number of elements in a hash set saved in a compressed list.
  • Hash-max-ziplist-value: indicates the maximum length of a single element in the hash set when saved in a compressed list.

If we write more elements to the Hash set than hash-max-ziplist-entries, or if the size of a single element exceeds hash-max-ziplist-value, Redis automatically converts the Hash implementation from a compressed list to a Hash table.

Once converted from a compressed list to a Hash table, the Hash type is always stored in the Hash table and never reverted back to the compressed list. In terms of saving memory space, hash tables are not as efficient as compressed lists.

To make full use of the compact memory layout of compressed lists, we typically control the number of elements stored in the Hash set. Therefore, in the previous second-level encoding, we only used the last 3 bits of the image ID as the key of the Hash set, ensuring that the number of Hash elements does not exceed 1000, and we set hash-max-ziplist-entries to 1000. In this way, Hash collections can always use compressed lists to save memory.

summary

In this class, we broke the misperceptions of String, in the past, we believe that the String is “tiger balm”, what the occasion, however, in the preservation of key/value pair itself at a less memory space (such as mentioned in this lesson the ID and the pictures of storage object ID), the String types of metadata overhead is dominant, This includes the memory overhead of RedisObject structure, SDS structure and dictEntry structure.

In this case, we can use a compressed list to save the data. Of course, when using Hash as a collection type to store single-valued key-value pairs, we need to split the single-valued data into two parts, respectively, as the key and value of the Hash set, just as we used the secondary encoding to represent the image ID in the previous example. Hopefully you can use this method in your own scenarios.

Finally, I want to give you one more tip: if you want to know how much memory is being spent on different types of key-value pairs, you can enter the length of your key-value pair and the data type used in this url to get an idea of the actual memory consumption. I suggest you use this little tool, it can help you save a lot of memory.

Each lesson asking

As usual, here’s a quick question for you: Besides String and Hash, what other types do you think might be appropriate for the example of saving images in this lesson?

Feel free to post your thoughts and answers in the comments section, so we can discuss and share today’s content with your friends.