preface


SDS is one of the basic data structures of Redis. Different from C string (ending in /0, which cannot guarantee binary security), SDS is an abstract type defined by Redis. It has many advantages such as

  • Get string length time complexity O(1)
  • Efficient capacity expansion mechanism can also prevent buffer overflow
  • Lazy free space to improve performance by reducing the number of memory reallocations
  • Binary security
  • Compatible with some C string functions

The body of the


1. Obtain the source code

Obtained from our official website, source code of SDS string is mainly in SDs.c as shown below

2. SDS data structure

The following code is a definition of a data structure, and you can see that there are five definitions for strings. Only the first structure is different. The next four are similar

// SDS data structure
struct __attribute__(((packed__)) sdshdr5 {
    unsigned char flags; /* The lower three bits indicate the type, and the higher five bits indicate the string length */
    char buf[];	// An array of strings allocated dynamically by the malloc function
};
struct __attribute__(((packed__)) sdshdr8 {
    uint8_t len; // Record the used length of buF
    uint8_t alloc; // Record the total buF length
    unsigned char flags; /* The lower three bits indicate the type; the higher five bits are not used */
    char buf[];	// An array of strings allocated dynamically by the malloc function
};
struct __attribute__(((packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
Copy the code

The above parameters, len, alloc, and flags, are string descriptors and can be considered headers. The real string store is the buf[] array, which is dynamically allocated by the malloc function

  • Since the structure defines len, the length of the string can be obtained at constant time complexity, and the end of the string can be determined by len rather than /0, ensuring binary safety

The reason to define multiple data structures is because of the memory waste problem, when we store strings, whether they are keys or values, they are allocated as SDS strings. If we assign a small string that takes up a large header, it’s not appropriate

So the question is, how do you tell which data type to use for a string?

The answer is flags, which is used to distinguish between different data types. The flags bit char type is 1 byte, but there are five data types, so three bits are enough to represent eight, so use the lower three bits. In the first data structure, the high five digits of flags are also used to indicate string length

The following is an analysis of the differences between the first and second data structures. Three, four and five have the same structure as two

sdshdr5

For the first structure shown below, sdSHDR5 type data structure, the data is directly represented by the high five digits of flags

sdshdr8

For the second data structure. First of all, uint8_t is an abstract data type encapsulated by SDS, where 8 represents 8 bits, so uint8_t len; This declaration means that len is 8 bits or 1 byte. Sdshdr16, sdshdr32, sdshdr64 are the same as 8, except that len and alloc are preceded by different abstract data types

For the definition of a data structure, notice that there is a key __attribute__ ((Packed)), which GCC uses to align the structure with 1 byte. Leaving the keyword out is usually aligned with the least common multiple bytes of the variable.

The following code demonstrates the packed keyword effect

#include <stdio.h>
#include <string.h>

int main(a){
      struct __attribute__(((packed__)) sdshdr32{
          long len;
          long alloc;
          unsigned char flags;
          char buf[];
      };
 
 	  	struct sdshdr32 s;   
    	printf("sizeof sds32 is %lu".sizeof(s));
      return 1;
}
Copy the code

The code emulates the SDS32 abstract type, with the long type accounting for 8 bytes (in my case, Mac OS, unix-like system long accounting for 8 bytes; in Windows, long accounting for 4 bytes, which will be aligned with 4 bytes) and char accounting for 1 byte. The following shows the result of not adding packet and adding the packet keyword. It can be seen that there are 24 bits in total without packed. The packed modifier takes up 17 bytes


3. Create a string

Here is the code to create the string, specifying the parameters

  • *init: pointer to a string. If “SDS_NOINIT” is passed, the space to which the sh pointer points is not initialized. If NULL is passed, the space to which the sh pointer points is initialized to 0
  • Initlen: indicates the length of the string
sds sdsnewlen(const void *init, size_t initlen) {
    void *sh;		
    sds s;
    char type = sdsReqType(initlen);	// Select the type based on the length of the string
    
   	if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8;
    int hdrlen = sdsHdrSize(type);	// Calculate the length required for different headers
    unsigned char *fp; /* flags pointer. */

  	//sh pointer to the entire structure, using malloc to allocate memory, +1 because of the '/0' terminator
    sh = s_malloc(hdrlen+initlen+1);
  
  	// SDS_NOINIT is the macro definition of "SDS_NOINIT"
    if (init==SDS_NOINIT)
        init = NULL;
    else if(! init)// This function initializes the space pointed to by sh to 0
        memset(sh, 0, hdrlen+initlen+1);
    if (sh == NULL) return NULL;
  
    // the s pointer is a pointer to buf, where the sh pointer is used to offset the length of the head to calculate the position of the pointer
    s = (char*)sh+hdrlen;
  
    // Get the flags pointer by offsetting the S pointer by 1, since memory is in bytes
    fp = ((unsigned char*)s)- 1;
  
  	// Assign structure members by type
    switch(type) {
        case SDS_TYPE_5: {
            *fp = type | (initlen << SDS_TYPE_BITS);
            break;
        }
        case SDS_TYPE_8: {
            SDS_HDR_VAR(8,s);
            sh->len = initlen;
            sh->alloc = initlen;
            *fp = type;
            break;
        }
        case SDS_TYPE_16: {
            ...
        }
        case SDS_TYPE_32: {
            ...
        }
        caseSDS_TYPE_64: { ... }}if (initlen && init)
        memcpy(s, init, initlen);
    s[initlen] = '\ 0';// Return s pointer to buf, so it is compatible with some c functions
    return s;
}
Copy the code

This is the process of creating a string. The main points to note are:

  • Sh points to an SDS structure, while the actual returned S pointer points to buF
  • Creating a string selects a different structure depending on the length of the string
  • The address of a pointer to a structure member is usually computed by an offset

There is also an interesting function above, calculate the header function, the source code is as follows. For those of you who know the principle of netmask, you can quickly understand that we can use IP and netmask to perform logic and operations to calculate the subnet where the IP resides. SDS calculates the header type in a generic way, where the macro defines the SDS mask as 7, which is binary 111, corresponding to the lower three digit type. Sds_type_5-64 macros are defined as 0,1,2,3,4 respectively. The same logic and operations are performed to obtain the corresponding types

#define SDS_TYPE_MASK 7
static inline int sdsHdrSize(char type) {
    switch(type&SDS_TYPE_MASK) {
        case SDS_TYPE_5:
            return sizeof(struct sdshdr5);
        case SDS_TYPE_8:
            return sizeof(struct sdshdr8);
        case SDS_TYPE_16:
            return sizeof(struct sdshdr16);
        case SDS_TYPE_32:
            return sizeof(struct sdshdr32);
        case SDS_TYPE_64:
            return sizeof(struct sdshdr64);
    }
    return 0;
}
Copy the code

4. Release memory

Two ways to free memory

// Call free to free memory by offsetting the s pointer to the header
void sdsfree(sds s) {
    if (s == NULL) return;
    s_free((char*)s-sdsHdrSize(s[- 1]));
}

// Set len to 0. The old value is still in memory, but the data can be overwritten without reallocating memory
void sdsclear(sds s) {
    sdssetlen(s, 0);
    s[0] = '\ 0';
}
Copy the code

5. SDS expansion policy

How do I prevent buffer overflows when the string length changes? , how to ensure high performance? This is the problem solved by SDS expansion strategy

There are many functions that can be used to expand capacity. Here is a concatenation string function to illustrate

First, the function s is the original string structure. The t pointer points to the second string, len being its length. What it does is

  • Computes the current string length
  • With the new string length, call sdsMakeRoomFor to get the expanded S pointer
  • String concatenation is binary safe because the curlen length does not contain “/0”
  • Finally, the s pointer is returned
sds sdscatlen(sds s, const void *t, size_t len) {
    size_t curlen = sdslen(s);

    s = sdsMakeRoomFor(s,len);
    if (s == NULL) return NULL;
    memcpy(s+curlen, t, len);
    sdssetlen(s, curlen+len);
    s[curlen+len] = '\ 0';
    return s;
}
Copy the code

The capacity expansion strategy is realized by the sdsMakeRoomFor function, and the source code is as follows

sds sdsMakeRoomFor(sds s, size_t addlen) {
    void *sh, *newsh;
    size_t avail = sdsavail(s);	// Calculate the remaining space
    size_t len, newlen;
    char type, oldtype = s[- 1] & SDS_TYPE_MASK;
    int hdrlen;

    // If the remaining space is sufficient, the system does not need to expand the space
    if (avail >= addlen) return s;

    len = sdslen(s);
    sh = (char*)s-sdsHdrSize(oldtype);
    newlen = (len+addlen);		// Add the new string length to the original string length
    if (newlen < SDS_MAX_PREALLOC)		//SDS_MAX_PREALLOC defines the macro as 1M in size
        newlen *= 2;
    else
        newlen += SDS_MAX_PREALLOC;

    type = sdsReqType(newlen); // Calculate the type required according to the length

    if (type == SDS_TYPE_5) type = SDS_TYPE_8;

    hdrlen = sdsHdrSize(type);
    if (oldtype==type) {		// If the type does not change, directly s_realloc can be expanded
        newsh = s_realloc(sh, hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        s = (char*)newsh+hdrlen;
    } else {
        // If the type changes, malloc will need to reallocate memory
        newsh = s_malloc(hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        memcpy((char*)newsh+hdrlen, s, len+1);	// Move the BUF content to the new location
        s_free(sh);
        s = (char*)newsh+hdrlen;
        s[- 1] = type;// Assign to flags
        sdssetlen(s, len);	// Assign a value to len
    }
    sdssetalloc(s, newlen);// Assign to alloc
    return s;
}
Copy the code

The code above illustrates several cases of the capacity expansion mechanism

  • If the remaining space is sufficient, do not expand the capacity
  • If the new length is larger than the remaining space, the capacity is doubled if the new length is smaller than 1M, and the capacity is increased if the new length is larger than 1M
  • If the type is unchanged, recalloc will be used to expand the size. If the type is changed, malloc will need to reallocate the memory

The obvious advantage of this expansion mechanism is that it reduces the number of memory reallocations required to modify strings, which greatly improves performance

conclusion


An analysis of the SDS source code shows why Redis is so efficient. In general, SDS dynamic strings control the details well. For example, the definition of data structure, as well as compiler level optimization, the selection of data structure strategy, are greatly saving memory space. The allocation, release, and expansion of memory are considered in all cases