Interviewer: Tell me about the underlying structure of Redis for storing strings

preface

This article introduces the underlying data structure of the Redis string type.

The traditional string representation of C language is called C string, but Redis does not use this structure as the storage structure of the default string, but builds an abstract type of simple Dynamic string (SDS) by itself, and uses SDS as the default string representation of Redis.

In Redis, C strings are only used as string literals for things that do not require modification of string values, such as printing logs.

SDS

Redis uses SDS to identify string values when it needs more than just a string literal, but a string value that can be modified. For example, in Redis’s database, key-value pairs containing string values are implemented by SDS at the bottom.

If you run the following command on the client:

set msg "hello"
Copy the code

Then Redis will create a new key-value pair in the database, where:

The key of a key-value pair is a string object, and the underlying object is an SDS that holds the string “MSG”
The value of a key-value pair is also a string object, and the underlying implementation is an SDS that holds “Hello”

In addition to holding string values in the database, SDS is also used as a buffer: the AOF buffer in the AOF module and the input buffer in the client state are implemented by SDS.

Next explain why Redis uses SDS instead of C strings for storage.

define

The structure of an SDS is as follows:

Struct SDSHDR {// Record the number of bytes used in the buF array, equivalent to the length of the string held by SDS int len; // Record the number of unused bytes in the buF array. Char buf[]; }Copy the code

An example of AN SDS is shown above:

The value of the free attribute is 0, indicating that the SDS has not allocated unused space
The len attribute value of 5 means that the SDS holds a 5-byte string
The buf property is an array of type char. The first five bytes of the array hold five characters. The last byte holds the null character ‘\0’.

SDS follow C string to an empty string at the end of the practice, save the null character 1 byte space do not calculate in the len attribute of SDS, and allocate an additional 1 byte to null character space, and operations such as adding null character to the end of the string is a function of SDS, so the null character is completely transparent for users of SDS.

The following is another EXAMPLE of an SDS, with the same structure as the previous SDS. They both hold the string value “Redis”. The difference is that this SDS allocates 5 bytes of unused space to the BUF array, so its free property is 5

SDS is different from C string

Traditionally, C uses character arrays of length N+1 to represent strings of length N, and the last element of the array is always the null character ‘\0’. The following figure shows a C string with the value “Redis”.

The simple string representation method used by C language cannot meet the requirements of Redis on the security, efficiency and function of strings. Next, the differences between C string and SDS will be compared in detail, and the reasons why SDS is more suitable for Redis than C string will be explained.

Get the length of the string

Because a C string does not record its own length, to obtain the length of a C string, the program must traverse the entire string, counting every character it encounters until it encounters a null character representing the end of the string, an operation of O(N).

Unlike C strings, obtaining an SDS length is O(1) because SDS records the length of SDS itself in the LEN attribute.

By using SDS instead of C strings, Redis reduces the complexity required to get string lengths from O(N) to O(1), which ensures that getting string lengths does not become a performance bottleneck for Redis. For example, because string keys are implemented using SDS underneath, even if we repeatedly execute STRLEN on a very long string key, there is no impact on system performance because STRLEN is only O(1) in complexity.

Prevent buffer overflow

In addition to the complexity of getting the length of a string, another problem with C strings not recording their length is that they tend to cause buffer overflow.

For example, suppose you have two C strings s1 and s2 that are next to each other in memory. S1 holds the string “Redis” and S2 holds the string “MongoDB”.

If we change the contents of S1 to “Redis Cluster” and forget to allocate enough space for S1 before append, s1 will overflow into S2 after append, causing S2 to be unexpectedly modified.

Unlike C strings, SDS’s space allocation strategy completely eliminates the possibility of buffer overflows: When SDSAPI needs to modify SDS, THE API will first check whether the SDS space meets the requirements for modification. If not, THE API will automatically expand the SDS space to the size required for modification, and then perform the actual modification operation. Therefore, using SDS does not need to manually modify the SDS space. There will also be no buffer overflow problems described earlier.

Reduced memory reallocation times

Because a C string does not record its own length, the underlying implementation of a C string containing N characters is always an array of N+1 characters long (an extra character space is used to hold null characters). Because of this correlation between the length of the C string and the length of the underlying array, every time a C string is increased or shortened, the program always reallocates the array that holds the C string:

If a program performs an operation that grows a string, such as append, it needs to increase the size of the underlying array by reallocating memory before performing this operation — forgetting this step can result in a buffer overflow.
If the program is performing operations that shorten the string, such as truncation (trim), then after this operation, the program needs to reallocate memory to free up space that is no longer used by the string — forgetting this step can cause a memory leak.

To avoid this defect of C strings, SDS disassociates the length of the string from the length of the underlying array by using unused space: in SDS, the length of the BUF array is not necessarily the number of characters plus one. The array can contain unused bytes, and the number of bytes is recorded by the FREE attribute of SDS. SDS realizes two optimization strategies of space pre-allocation and inert space release through unused space.

Space preallocation

Space preallocation is used to optimize string growth operations for SDS: When the SDS API modifies an SDS and requires space expansion for the SDS, the program allocates not only the space required for the modification, but also additional unused space for the SDS.

The amount of extra allocated unused space is determined by the following formula:

If the SDS is less than 1MB after modification, the program allocates the same amount of unused space as len, and the value of SDS Len is the same as the value of free. For example, if len of SDS becomes 13 bytes after modification, then the program allocates 13 bytes of unused space, and the actual length of the BUF array of SDS becomes 13+13+1=27 bytes (an extra byte is used to hold null characters).
If the SDS is greater than or equal to 1MB in length after modification, the program allocates 1MB of unused space. For example, if len of SDS changes to 30MB, the program allocates 1MB of unused space, and the actual length of the BUF array of SDS is 30MB+1MB+1byte

With a preallocated space strategy, Redis can reduce the number of memory reallocations required to perform consecutive string growth operations.

Inert space release

Lazy space free is used to optimize SDS string shortening operations: When the SDS API needs to shorten the string saved by SDS, the program does not immediately use memory reallocation to reclaim the shortened extra bytes, but instead uses the free property to record the number of bytes and wait for future use.

Through a lazy space free strategy, SDS avoids the memory reallocation required to shorten strings and optimizes future growth operations.

SDS also provides apis that allow us to actually free up SDS unused space when we need it, so we don’t have to worry about wasted memory due to lazy space free strategies.

Binary security

The characters in a C string must conform to some encoding (such as ASCII), and the string must contain no null characters except the end of the string. Otherwise, the first null character read by the program will be mistaken for the end of the string. These limitations allow C strings to hold only text data. It cannot save binary data such as pictures, audio, video, and compressed files.

If a data format has a null character to split multiple words, such as “Redis Cluster”, then the format cannot be saved with a C string, because the C string function only recognizes “Redis” and ignores “Cluster”.

Although databases are generally used to store text data, binary data is also commonly used in databases, so SDS apis are binary-safe to ensure that Redis can be used ina variety of different scenarios. All SDSapis treat SDS data stored in the BUF array as binary, and the program does not restrict, filter, or make assumptions about how data is read as it is written.

That’s why we call the BUF property of SDS a byte array — Redis doesn’t use this array to hold characters, it uses it to hold a series of binary data.

By using binary-safe SDS instead of C strings, Redis can store binary data in arbitrary formats as well as text data.

Compatible with some C string functions

Although SDS apis are binary-safe, they follow the convention of null-ending C strings: these apis always set the end of data stored by SDS to a null-character, and always allocate an extra byte to accommodate the null-character when allocating space for the BUF array. This is so that SDS that hold textual data can reuse some of the functions defined by the <string.h> library.

If we want to compare an SDS string with a C string, Redis doesn’t have to write a function to compare two string values, and reuses C’s library to avoid unnecessary code duplication.

summary

The differences between C strings and SDS are as follows:

If you’re interested in Redis, check out the rest of this column.