instructions

When it comes to Redis data structures, it’s easy to think of five common Redis data structures: strings, lists, hashes, sets, Sorted sets, and how they work. But they are exposed data structures that Redis uses for API operations, and what are the underlying data structures that make them up

  • Simple Dynamic String (SDS)
  • The list
  • The dictionary
  • Jump table
  • The integer set
  • List of compression

Redis is available on GitHub github.com/antirez/red…

Simple Dynamic String (SDS)

Redis is written in C, but instead of using a string representation of C (C is a string array terminated by the null character \0), Redis builds an abstract type of simple Dynamic String (SDS) as the default string representation of Redis

In Redis, key-value pairs containing string values are implemented by SDS

The definition of SDS

The structure of SDS is defined in sds.h file. After Redis 3.2 version, there are some changes in the definition of SDS, from one data structure to five data structures. Different structures will be selected according to the length of the content stored in SDS to achieve the effect of memory saving

/ / 3.0
struct sdshdr {
    // Record the number of bytes used in the BUf array, that is, the length of the string held by SDS
    unsigned int len;
    // Record the number of unused bytes in the BUF data
    unsigned int free;
    // An array of bytes to hold strings
    char buf[];
};

/ / 3.2
/* Note: sdshdr5 is never used, we just access the flags byte directly. * However is here to document the layout of type 5 SDS strings. */
struct __attribute__(((packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
Copy the code

After version 3.2, the data structure is selected based on the length of the string

static inline char sdsReqType(size_t string_size) {
    if (string_size < 1<<5)  / / 32
        return SDS_TYPE_5;
    if (string_size < 1<<8)  / / 256
        return SDS_TYPE_8;
    if (string_size < 1<<16)   // 65536 64k
        return SDS_TYPE_16;
    if (string_size < 1ll<<32)  // 4294967296 4G
        return SDS_TYPE_32;
    return SDS_TYPE_64;
}
Copy the code

Take a look at sdSHDR8 version 3.2 for an example

  • len: Records the number of bytes in use'\ 0'), the complexity of obtaining the length of SDS is O(1)
  • alloc: Records the total number of bytes allocated in the current byte array (excluding'\ 0')
  • flags: Indicates the property of the current byte array, yessdshdr8orsdshdr16The definition of flags value can be seen in the following code
  • buf: byte array used to hold strings, including closing whitespace characters'\ 0'
// Flags value definition
#define SDS_TYPE_5  0
#define SDS_TYPE_8  1
#define SDS_TYPE_16 2
#define SDS_TYPE_32 3
#define SDS_TYPE_64 4
Copy the code

The blank space of the byte array above represents the unused space, which is the space optimization strategy of Redis to leave room for the operation of string to ensure security and improve efficiency

SDS and C strings

C language uses a character array of length N+1 to represent a string of length N, and the last element of the character array is the null character ‘\0’. However, this simple string representation method cannot meet the requirements of Redis for string security, efficiency and function, so what are the benefits of using SDS

Refer to Redis Design and Implementation

Constant complexity gets the length of the string

C string does not record the length of the string, the length must be traversed through the entire string, the complexity is O(N); The SDS structure itself has an len attribute that records the length of the string, all of which is O(1). Redis reduced the complexity required to get the length of a string from O(N) to O(1), ensuring that getting the length of a string does not become a performance bottleneck for Redis

Prevent buffer overflows and reduce the number of memory reallocations caused by string changes

C strings do not record their own length. Each time a string grows or shortens, a memory reallocation operation is performed on the underlying character array. If the append operation is not reallocated to expand the underlying data before concatenation, a cache overflow occurs. If the truncation-trim operation is not followed by memory reallocation to free up unused space, a memory leak occurs

SDS removes the association between the string length and the underlying data length through unused space. In version 3.0, free is used to record the unused space, while in version 3.2, alloc is used to record the total number of allocated bytes. By using unused space, SDS realizes two optimal space allocation strategies, space pre-allocation and lazy space release, and solves the space problems of string concatenation and interception

Binary security

The characters in a C string must conform to a certain encoding. A string cannot contain null characters except at the end of the string, otherwise it is considered to be the end of the string. This limits C strings to holding only text data, not binary data such as images

The SDS API handles the data stored in the BUF array in a binary way, without any restrictions on the data inside. SDS uses the value of the len attribute to determine whether the string ends, rather than the null character

Compatible with some C string functions

While the SDS API is binary safe, it is null-terminated like a C string, so that an SDS that holds text data can reuse some of the C string functions

C String comparison with SDS

C string SDS
Get string length complexity O(N) Get string length complexity O(1)
The API is not secure and can cause buffer overflows The API is secure and does not cause buffer overflows
Modifying the length of the string necessarily requires a memory reallocation Changing the string length N times will require a maximum of N memory reallocations
Only text data can be saved You can save text or binary data
You can use all<string.h>Functions in the library You can use some of it<string.h>Functions in the library

The list

Linked list is a common data structure, which is easy to insert and delete, high memory utilization, and can adjust the length of the list flexibly, but difficult to access randomly. Many high-level programming languages have built-in implementations of linked lists, but C does not, so Redis implements its own linked list data structure

Linked List is widely used in Redis. The underlying implementation of List is linked List. Redis also uses linked lists for publish and subscribe, slow query, monitor and other functions

List node and list definition

A node on a linked list is defined as follows, adlist.h/listNode

typedef struct listNode {
    // Front node
    struct listNode *prev;
    // Rear node
    struct listNode *next;
    / / the node values
    void *value;
} listNode;
Copy the code

A linked list is defined as follows, adlist.h/list

typedef struct list {
    // Link list header node
    listNode *head;
    // End node of the list
    listNode *tail;
    // Copy the value of the node
    void *(*dup)(void *ptr);
    // The node value release function
    void (*free) (void *ptr);
    // Compare the values of nodes
    int (*match)(void *ptr, void *key);
    // The number of nodes in the list
    unsigned long len;
} list;
Copy the code

Each listNode can use the prev and next pointer distribution to point to the previous node and the next node to form a double-ended list. At the same time, each list will have a list structure that provides the head pointer, the tail pointer, and the length counter len for the list. There are also three type-specific functions for implementing polymorphic linked lists

  • dup: is used to replicate the value held by the linked list node
  • free: is used to release the value held by a linked list node
  • match: is used to compare the value stored by a linked list node to another input value

Linked list structure diagram

List features

  • Double-ended list: with Pointers to front and back nodes, which have complexity O(1)
  • Acyclic: of a table head nodeprevAnd the end of the tablenextAll points to NULL, and the access to the list ends with NULL
  • List length counter: withlenProperty to get the list length with complexity O(1)
  • Polymorphism: Use of linked list nodesvoid*Pointers hold node values and can hold different types of values

The dictionary

A dictionary, also known as a symbol table, associative array, or map, is an abstract data structure used to hold key-value pairs. Each key in the dictionary is unique, and the value associated with it can be found by the key, and it can be modified or deleted

Redis’s key-value pair storage is implemented with a dictionary, as is one of the underlying implementations of hashing

Shall we take a look directly at how dictionaries are defined and implemented

Dictionary definition implementation

Redis dictionary is implemented using a hash table. A hash table can have multiple hash table nodes, and each hash table node holds a key and value pair in the dictionary

Hash table structure definition, dict.h/dictht

typedef struct dictht {
    // Hash table array
    dictEntry **table;
    // Hash table size
    unsigned long size;
    // Hash table size mask, used to calculate the index value, equal to size-1
    unsigned long sizemask;
    // Number of existing nodes in the hash table
    unsigned long used;
} dictht;
Copy the code

A hash table consists of the array table, where each element is a pointer to the dict.h/dictEntry structure. The hash table nodes are defined as follows

typedef struct dictEntry {
    / / key
    void *key;
    / / value
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    // Point to the next hash table node to form a linked list
    struct dictEntry *next;
} dictEntry;
Copy the code

Where the key is our key; V is the key, which can be a pointer, an integer, or a floating point number; The next attribute is a pointer to the next hash table node, which can make multiple key-value pairs with the same hash value form a linked list and resolve the problem of key conflicts

And finally, our dictionary structure, dict.h/dict

typedef struct dict {
    // Type related handlers
    dictType *type;
    // Private data
    void *privdata;
    / / a hash table
    dictht ht[2];
    // Rehash index, which has a value of -1 when rehash is no longer performed
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    // Number of iterators
    unsigned long iterators; /* number of iterators currently running */
} dict;
Copy the code

The type and privData properties are key/value pairs for different types and are used to create multi-type dictionaries. Type is a pointer to a dictType structure, and privData holds optional arguments that need to be passed to a type-specific function. See the code below for a dictType structure and a type-specific function

typedef struct dictType {
    // Calculate the number of rows for the hash value
    uint64_t (*hashFunction)(const void *key);
    // The function that copies the key
    void *(*keyDup)(void *privdata, const void *key);
    // Copy the value of the function
    void *(*valDup)(void *privdata, const void *obj);
    // Compare the key function
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);
    // Destroy key function
    void (*keyDestructor)(void *privdata, void *key);
    // Destroy the value of the function
    void (*valDestructor)(void *privdata, void *obj);
} dictType;
Copy the code

The HT attribute of a dict is a two-element array that contains two dictht hash tables. Dictionaries only use the HT [0] hash table. The HT [1] hash table is used when rehashing the HT [0] hash table, that is, when the number of key-value pairs in the hash table exceeds the number of loads. The key-value pairs will be migrated to HT [1]

Rehashidx is also related to rehash. The operation of rehash is not instantaneous. Rehashidx records the progress of the rehash and has a value of -1 if no rehash is currently being performed

Combining the above structures, let’s take a look at the structure of the dictionary (without rehash)

The hashing algorithm and the operation of rehash will not be explained in detail here, but will be covered separately at a later date

When a new key-value pair is added to the dictionary, the hash and index values are calculated based on the keys of the key-value pair, and the corresponding hash table is placed based on the index value, i.e., ht[0] if the index value is 0. A key conflict occurs when two or more keys are assigned to the same index on a hash table array. Hash tables use the chaining method to solve this problem, which uses the next pointer of the hash table node to join multiple nodes on the same index. When a hash table has too many or too few key-value pairs, the hash table needs to be expanded and contracted, which is performed by rehashing

Jump table

An ordinary singly linked list queries an element with a time complexity of O(N), even if the list is ordered. The use of a SkipList is to solve the lookup problem. It is an ordered data structure that does not belong to a balanced tree structure or a Hash structure. By maintaining multiple Pointers to other nodes in each node, it achieves the purpose of fast access to nodes

A skip list is one of the low-level implementations of an Sorted Set. Redis uses a skip list if the Sorted Set contains a large number of elements or if the elements are long strings

Skip list definition

Skip lists can actually be thought of as multilayer linked lists, and they have the following properties

  • The structure consists of multiple layers, each of which is an ordered linked list
  • The list at level 1 contains all the elements
  • The search times of skip list is approximately the number of layers, the time complexity is O(logn), and the insertion and deletion are also O(logn).
  • A skip list is a randomized data structure (determined by flipping a coin)

So how do we understand skip lists, we start with the lowest list of all the elements, given the following list

Then we every other element, put it in a layer of the linked list, here I call it rise (note that the scientific method is the way to flip a coin, to determine whether elements floating on a layer of list, here I just simple surface of every other element onto a layer list, facilitating understanding), the structure of the operation is completed as follows

Find the element method is like this, start from the upper level search, large number to the right to find the head, decimals to the left to find the head, for example, I want to find 17, query order is: 13 -> 46 -> 22 -> 17; 13 -> 46 -> 22 -> 46 -> 35; If it’s 54, it’s 13 -> 46 -> 54

If you add an element, you flip a coin to determine how many layers the element will be in. In other words, there is a 1/2 chance that the element will be in layer 2 and a 1/4 chance that the element will be in layer 3……

The deletion and addition of skip table nodes are unpredictable, and it is difficult to ensure that the index of skip table is always uniform. The method of coin toss can make it more or less uniform

So let’s say we have a skip list for the example above, and now I’m going to add 18 to it, and I’m going to flip a coin to determine how many levels it’s going to have, so heads, tails, tails, let’s say I flip the coin twice, heads, tails

Deleting a skip list is as simple as finding the node you want to delete and then following the steps to delete the same nodes at each level

The cost of maintaining the structure balance of skip list is relatively low, and it depends on random. Compared to binary search tree, after many inserts and deletes, it needs Rebalance to Rebalance the structure

Skip list implementation

H /zskiplistNode and redis.h/zskiplist (redis.h changed to server.h after version 3.2). ZskiplistNode defines the node of the jump list. Zskiplist Stores information about skiplist nodes

/* ZSETs use a specialized version of Skiplists */
typedef struct zskiplistNode {
    // member object (robj *obj;)
    sds ele;
    / / score
    double score;
    // Back pointer
    struct zskiplistNode *backward;
    / / layer
    struct zskiplistLevel {
        // Forward pointer
        struct zskiplistNode *forward;
        / / span
        // The span is actually used to calculate the element rank. In the process of searching for a node, the span of all the layers visited along the way is accumulated. The result is the rank of the target node in the skip list
        unsigned long span;
    } level[];
} zskiplistNode;

typedef struct zskiplist {
    // The header node and the tail node
    struct zskiplistNode *header, *tail;
    // Number of nodes in the table
    unsigned long length;
    // The number of layers of the node with the largest number of layers in the table
    int level;
} zskiplist;
Copy the code

ZskiplistNode structure

  • levelArray (layer) : Each time a new hop node is created, the size of the level array is calculated according to the power law, which is the height of the next layer. Each layer has two properties –Forward Pointersandspan, the forward pointer is used to access other Pointers in the tail direction of the table; The span is used to record the distance between the current node and the node indicated by the forward pointer (point NULL, width 0).
  • backward(Back pointer) : Points to the node before the current one
  • score(score) : Used for sorting, if score is the same look at the member variables in lexicographical order size sort
  • objorele: a member object is a pointer to a string object that holds an SDS; The member object of each node in the hop table must be unique, and the score value can be the same

Zskiplist structure

  • header,tailTable head nodes and table tail nodes
  • lengthThe number of nodes in the table
  • levelThe number of layers of the node with the largest number of layers in the table

Suppose we now present a skip list with four nodes of height 2, 1, 4, and 3

The head node of a zskiplist is not a valid node, it has a ZSKIPLIST_MAXLEVEL layer (32 layers), and the forward of each layer refers to the first node of the jump list of that layer, otherwise NULL. In Redis, the structure of the top skiplist is as follows

  • Each skip list node has layers between 1 and 32
  • In a skip list, nodes are sorted according to the size of their score. Multiple nodes can have the same score. When the nodes are the same, they are sorted according to the size of their member objects
  • Member variables must be unique for each node

The integer set

An integer set (intset) is an abstract data structure used by Redis to store integer values of type int16_t, int32_T, and INT64_T. It is guaranteed that there will be no duplicate elements in the set

The Set of integers is one of the underlying implementations of a Set. If a Set contains only integer numeric elements and a small number of elements, the Set of integers is used as the underlying implementation

Implementation of the definition of the set of integers

The set of integers is defined as inset.h/inset

typedef struct intset {
    // Encoding mode
    uint32_t encoding;
    // The number of elements the collection contains
    uint32_t length;
    // Hold an array of elements
    int8_t contents[];
} intset;
Copy the code
  • contentsArray: Each element of a collection of integers is sorted in an array from smallest to largest in size and does not contain duplicates
  • lengthRecords the number of elements in the integer collection, that is, the length of contents array
  • encodingDetermines the true type of contents array, such as INTSET_ENC_INT16, INTSET_ENC_INT32, INTSET_ENC_INT64

The upgrade of the integer set

When you want to add a new element to an integer set whose type is longer than that of all existing elements in the integer set, the integer set needs to be upgraded before the new element can be added to the integer set. Every time a new element is added to an integer collection, it is possible to cause an upgrade, and each upgrade requires all existing elements in the underlying array to be cast

Upgrade to add new elements:

  • Based on the new element type, expand the size of the underlying array of the integer collection and allocate space for the new elements
  • Convert the existing elements of the array to the type of the new element, and place the converted elements in the correct locations, while maintaining the order of the array
  • Adds new elements to the underlying array

The upgrade strategy of the integer set can improve the flexibility of the integer set and save memory as much as possible

In addition, integer sets do not support degradation, and once upgraded, the encoding remains in the upgraded state

List of compression

A ziplist is designed to save memory. It is a sequential data structure consisting of a sequence of specially encoded contiguous memory blocks. A ziplist can contain multiple nodes, each of which can hold a byte array or an integer value

A compressed List is one of the underlying implementations of lists and hashes. A List containing only a small number of List items, each of which is a small integer value or a short string, is implemented using a compressed List (after version 3.2, using QuickList).

The composition of a compressed list

A compressed list can contain multiple entries, each of which can hold a byte array or an integer value

The components are described as follows

  • zlbytes: Records the number of bytes of memory occupied by the entire compressed list, and realallocates memory in the compressed list, or calculateszlendIs used in the position of
  • zltail: Records the number of bytes from the start address of the compressed list to the end node of the compressed list. This offset can be used to determine the address of the end node of the compressed list without traversing the entire compressed list
  • zllenIf the value of this attribute is less than UINT16_MAX (65535), it is the number of nodes in the compressed list. Otherwise, you would have to traverse the whole compressed list to figure out the true number of nodes
  • entryX: Compresses list nodes
  • zlend: Special value 0xFF (decimal 255), used to mark the end of the compressed list

Compressing list node composition

Each compressed list node can hold either a byte number or an integer value with the following structure

  • previous_entry_ength: Records the length of the preceding byte of the compressed list
  • encodingThe encoding of the: node holds the content type of the node’s content
  • contentThe: Content area is used to store the content of the node. The content type and length of the node are determined by encoding

object

The main underlying data structures of Redis are described above, including simple dynamic strings (SDS), linked lists, dictionaries, skip lists, integer sets, and compressed lists. But instead of using these data structures directly to build a database of key-value pairs, Redis created a system of objects based on these data structures, known as The API-operable Redis data types. For example, String, List, Hash, Set, Sorted Set

According to the type of an object, you can determine whether an object can execute a given command. In addition, different data structures can be set for objects in different scenarios to optimize the usage efficiency of objects in different scenarios

type coding BOJECT ENCODING Command output object
REDIS_STRING REDIS_ENCODING_INT “int” A string object implemented using integer values
REDIS_STRING REDIS_ENCODING_EMBSTR “embstr” A string object implemented using a simple dynamic string encoded with embstr
REDIS_STRING REDIS_ENCODING_RAW “raw” A string object implemented using a simple dynamic string
REDIS_LIST REDIS_ENCODING_ZIPLIST “ziplist” List object implemented using a compressed list
REDIS_LIST REDIS_ENCODING_LINKEDLIST ‘”linkedlist’ A list object implemented using a double-ended linked list
REDIS_HASH REDIS_ENCODING_ZIPLIST “ziplist” Hash object implemented using a compressed list
REDIS_HASH REDIS_ENCODING_HT “hashtable” A hash object implemented using a dictionary
REDIS_SET REDIS_ENCODING_INTSET “intset” A collection object implemented using a collection of integers
REDIS_SET REDIS_ENCODING_HT “hashtable” A collection object implemented using a dictionary
REDIS_ZSET REDIS_ENCODING_ZIPLIST “ziplist” An ordered collection object implemented using a compressed list
REDIS_ZSET REDIS_ENCODING_SKIPLIST “skiplist” An ordered collection object implemented using skip list tables

Reference: Redis Design and Implementation