Chapter 2: Simple dynamic strings

2.1 Definition of SDS

Each SDSHDR structure represents an SDS value:

struct sdshdr{
	// Records the number of subsections used in the BUF array
	// is equal to the length of the string saved by SDS
	int len;
	// Records the number of unused subsections in buF
	int free;
	// Subsection array, used to hold strings
	char[] buf;
}
Copy the code

Example:

freeProperties for0: indicates that this SDS has not allocated any unused space
lenProperties for5: indicates that this SDS holds a string of five bytes, i.e., length 5
bufProperty is an array of the char class, the first five bytes of which are stored separately'R', 'e', 'D', 'I', 's'Five characters, and the last byte holds the null character'\ 0'.

SDS follows the convention that C strings end with empty strings, reserving 1 byte space for empty characters from the LEN attribute of SDS, and allocating an extra 1 byte space for empty characters.

Let’s look at another example:The biggest difference from SDS before: herefreeProperty has a value of 5, and the figure uses five Spaces to represent five bytes of unused space

What is the role of this free property in SDS?

2.2 Differences between SDS and C strings

C uses an array of characters of length N+1 to represent strings of length N, with the last value of the array being \0.

Such as:The simple representation of string in C language cannot meet the requirements of Redis on the security, efficiency and function of string.

Let’s take a look at the differences between C strings and SDS, and explain why SDS is better for Redis than C strings.

2.2.1 Constant Complexity Obtain the length of the string

As we all know, in the case of a C string, if we want to get its length, we have to iterate over it until we encounter a null character representing the end of the string. This operation takes O(N) time.

Examples are as follows:Unlike the C string, our SDS len attribute represents the length of the SDS itself, so fetching an SDS is 5 bytes long.Setting and updating the SDS length is done automatically by the SDS API at runtime without any manual length changes.

By using SDS instead of strings, Redis reduces the complexity of the time required to get string length from O(N) to O(1), which ensures that getting string length does not become a performance bottleneck for Redis.

For example, if we use STRLEN on a very long string, there is no performance impact because STRLEN is O(1).

2.2.2 Preventing buffer overflow

In addition to the complexity of retrieving the length of a string, another problem with C strings not recording their length is that they tend to cause Buffer overflow.

For example: We now want to concatenate the dest and SRC strings using char *strcat(char *dest, const char * SRC)

Since C strings do not record their length, strcat assumes that the user has allocated enough memory for dest to hold all the contents of the SRC string when executing this function. If this assumption is not true, a buffer overflow will occur.

As another example, suppose we have two C strings s1 and s2 that are adjacent to each other in memory, where S1 holds the string “Redis” and S2 holds the string “MongoDB”, as shown in the following figure:If a programmer executes:Strcat (s1, "Cluster");

Change the contents of S1 to “Redis Cluster”, but he forgot to allocate enough space for S1 before performing strcat, otherwise s1’s data will overflow into S2’s space, resulting in s2’s saved contents being accidentally modified. As shown in figure:Unlike C strings, SDS’s space allocation strategy completely eliminates buffer overflows.

When the SDS API needs to modify the SDS, the API will check whether the SDS space meets the requirements for modification. If not, the API will automatically expand the SDS space to the size required for modification, and then perform the actual modification operation. So using SDS does not require you to manually modify the size of SDS, nor does it cause buffer overflows.

Here’s an example: In our SDS API, there is also a sdSCat function that performs concatenation. It can concatenate a C string to the end of a given SDS string, but before concatenating, SDSCat will check whether the given SDS has enough space. If not, SDSCat will expand the SDS space. Then perform the splice operation.

For example, we execute:Sdscat (s, "Cluster")SDS values are shown in the figure:Before the implementation of SDSCat, it will check whether the length of S before the operation is enough. If it is found that the length of S is not enough, SDSCAT will expand the capacity first and then perform the operation of splicing, as shown in the figure:Note that sdSCat not only splices the SDS as shown in Figure 2-10, but also allocates 13 bytes of unused space to the SDS, and the string after splicing is exactly 13 bytes long. This is not a coincidence, it is related to the space allocation strategy of SDS.

2.2.3 Reduce the number of memory reallocation caused by string modification

Because our C string does not record its own length, for an N-character C string, the underlying implementation of the C string is always an N+1 character array (the extra one is used to hold null characters). If we want to perform an operation on the C string, either growing or shortening it, the program will perform a memory reallocation on the C string:

If the program performs an operation to grow a string, such as concatenation (append), then the program needs to expand the size of the underlying array through memory reallocation —– if forgotten, a buffer overflow will occur.
If the program performs an operation to shorten the string, such as trimming (trim), then after this operation, the program needs to reallocate memory to free up space that is no longer used in the string —– if forgotten, a memory leak will occur.

Because memory reallocation involves complex algorithms and may require system calls, it is usually a time-consuming operation:

In general programs, if changing the length of a string is infrequent, it is acceptable to perform a memory reallocation for each change.
However, Redis, as a database, is often used in situations where speed requirements are strict and data is frequently modified. If memory reallocation is required every time the string length is changed, time complexity will increase and performance will be greatly lost.

To avoid this defect of C strings, SDS disassociates the string length from the underlying array length by using no space:

In SDS, the length of the BUF array is not necessarily the number of characters plus one. The array can contain unused bytes, and the number of bytes is recorded by the FREE property of SDS.

SDS realizes two optimization strategies of space pre-allocation and inert space release through unused space.

2.2.3.1 Space preallocation

Space preallocation is used to optimize string growth operations for SDS: When the SDS API modifies an SDS and requires space expansion for the SDS, the program allocates not only the necessary space for the modification, but also additional unused space for the SDS.

The formula for the amount of extra space allocated is as follows:

If the length of SDS (len) is less than 1MB after modification of SDS, the program allocates the same amount of unused space as len, which is SDSlenProperty values andfreeThe value of the property is equal.

If the SDS will be greater than or equal to 1MB after modification, 1MB of unused space will be allocated. For example, if len of SDS is changed to 30MB, the program allocates 1MB of unused space, and the actual length of the BUF array of SDS is 10MB + 1MB + 1byte.

With a space preallocation strategy, Redis can reduce the number of memory reallocations required to perform consecutive string growth operations.

SDS turns what used to be a certain N memory reallocation into a maximum of N memory reallocation.

2.2.3.2 Releasing lazy space

Lazy space free is used to optimize SDS string shortening: When the SDS API needs to shorten the string saved by SDS, the program does not immediately use memory reallocation to reclaim the shortened extra bytes, but instead uses the free property to record the number of bytes for later use.

For example, we remove all characters from SDS that appear in the C string

sdstrim(s, "XY"); // Remove all 'x' and 'y' from SDS stringCopy the code

Our SDS will be modified as shown in Figure 2-15:As you can see, the FREE property of SDS has changed to 8, indicating that there are currently 8 free character Spaces in buF.

Why don’t we just delete the free memory and mark it with a free property?

As we know, when the SDS string grows, we first check whether the SDS string has free space to prevent buffer overflow. If there is space, the concatenation is done directly, and if there is no space, a new space is created to put a new string.

The reason we use the free attribute is if we do a concatenation after truncation. If we delete the free space after the truncation, we will have to re-allocate memory during the splicing, resulting in a waste of time and performance. Instead of deleting free space, we can simply reduce our free property.

2.2.4 Binary Security

C Characters in a string must conform to the ASCII code specification, and the string must not contain null characters except at the end of the string, otherwise the program will consider this as the end of the string.

These limitations allow our C strings to hold only text data, not binary data such as pictures, audio, and video.

Such as:In this case, our program can only identify “Redis”, not “Cluster”.

For SDS, however, there is no problem storing any data, because when SDS reads, len is used to determine whether the string ends.

By using binary-safe SDS instead of C strings, our Redis can store binary data in arbitrary formats as well as text data.

2.2.5 Compatible with some C string functions

For SDS, it is compatible with some C functions to a certain extent, such as:

2.2.6 summary

C string and SDS string difference:

C string	SDS string
Get the complexity of the length of the string O(N)	Get the complexity of the length of the string O(1)
The API is not secure and may cause buffer overflows	The API is secure and does not cause buffer overflows
Changing the length of a string N times requires N memory reallocations	Changing the length of a string for N times requires a maximum of N memory reallocations
Only text data can be saved	Can save text or binary data
You can use all of the <string.h> library functions	You can use some of the <string.h> library functions