Understanding the Data Structure behind Redis high Performance (part 1)

preface

Redis, as a powerful tool used by many large factories to solve concurrency and fast response, is favored by many companies for its high performance. I think the high performance of Redis is inseparable from the design and implementation of its underlying data structure. Those of you who have used Redis probably know that there are five basic data types: String, list, Hash, set, and zset. These are just the first level of data structures that Redis services provide to clients. In fact, the internal data structure still has the second level of implementation, Redis uses the second level of one or more data types to achieve the first level of data type. I would like to have the same interest in Redis underlying data structure as ME, so let’s study the design and implementation of underlying data structure behind the high performance of Redis.

Here we mainly study the realization of the data structure of the second layer. The five basic data types in Redis are realized through the following data structures. Let’s look at them one by one:

sds
ziplist
quicklist
dict
skiplist

1. Dynamic String (SDS)

The String type is the most common and commonly used data type in any programming language. Redis is written in C language at the bottom, but Redis does not use C language String type. Instead, we define a Simple Dynamic String (SDS for short) as the implementation of the bottom String of Redis. Compared with the String of C language, SDS has the following advantages:

Dynamically expandable memory. The string represented by SDS can be dynamically expanded. Since C strings do not record their length, if the string length is changed, memory needs to be re-allocated for the new string. If memory is not allocated, overflow may occur. SDS, however, does not need to manually change the size of the memory and does not have buffer overflow problems, because SDS itself records the size of the data stored and the maximum capacity, which is automatically expanded when the capacity is exceeded.
Binary Safe. SDS can store arbitrary binary data, not only strings, but also audio, pictures, compressed files and other binary data. The SDS API handles binary data stored in buF arrays without any restrictions.
In addition, SDS is compatible with C character types.

Below is part of Redis SDS source code. SDS source

typedef char *sds;
/* Note: sdshdr5 is never used, we just access the flags byte directly.
 * However is here to document the layout of type 5 SDS strings. */
struct __attribute__(((packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
 char buf[]; }; struct __attribute__(((packed__)) sdshdr8 {  uint8_t len; /* used */  uint8_t alloc; /* excluding the header and null terminator */  unsigned char flags; /* 3 lsb of type, 5 unused bits */  char buf[]; }; struct __attribute__(((packed__)) sdshdr16 {  uint16_t len; /* used */  uint16_t alloc; /* excluding the header and null terminator */  unsigned char flags; /* 3 lsb of type, 5 unused bits */  char buf[]; }; struct __attribute__(((packed__)) sdshdr32 {  uint32_t len; /* used */  uint32_t alloc; /* excluding the header and null terminator */  unsigned char flags; /* 3 lsb of type, 5 unused bits */  char buf[]; }; struct __attribute__(((packed__)) sdshdr64 {  uint64_t len; /* used */  uint64_t alloc; /* excluding the header and null terminator */  unsigned char flags; /* 3 lsb of type, 5 unused bits */  char buf[]; }; Copy the code

In fact, we can find the internal composition of SDS through the source code, we found that SDS is defined as the type of char, is Redis String type char? In order to maintain type compatibility with traditional C language strings, the type definition of SDS is the same: char *, but SDS is not the same as char.

The real data store is in BUF in SDSHDR, a data structure that can store binary data like pictures and videos as well as strings. SDS, in order to be compatible with C strings, follows the convention that C strings end with a null character, so in BUF, user data is always followed by a \0. That is, “data” + “\0” in the figure is the so-called BUF. Note also that there are five types of SDSHDR. Sdshdr5 is not used, and only four are used. So many type headers are defined so that strings of different lengths can use different size headers. This allows shorter strings to use smaller headers, saving memory.

An overview of SDS is shown below:

Except for SDSHdr5, the structure of the other four headers contains three fields

Len: Indicates the true length of the string (not included)\ 0End character inside).
Alloc: indicates the maximum capacity of the entire SDS (not included\ 0Bytes).
Flags: Always occupy one byte. The lowest three bits represent the type of the header. There are five types of header, and only four of them are used. Constant is defined in SDS.h.

2. List

2.1 Underlying data structure

Redis exposes the list data type, and its underlying implementation relies on several internal data structures. Prior to Redis3.2, the underlying implementation of linked lists was linkedList and zipList. However, linkedList and zipList were largely deprecated after version 3.2, using quickList as the underlying implementation of linked lists. ZipList was replaced by quickList, but zipList remains one of the underlying implementations of Hash and Zset.

ZipList to bidirectional linkedList

Here we use Redis2.8 to see that when I insert 110 short pieces of data in key K5, the list is ziplist coded, and when I insert 10000 pieces of data in key K5, the data code becomes LinkedList.

Prior to Redis3.2, the zipList used as the default data junction at the bottom of the list, and under certain conditions the zipList was converted to linkedList. Redis is designed this way because bidirectional lists take up more memory than compressed lists, so lists use compressed lists first when creating new list keys, and only switch from compressed list implementation to bidirectional list implementation when necessary. Under what circumstances zipList will become a linkedList, two arbitrary conditions need to be met:

The length of this string exceeds server.list_MAX_ziplist_value (the default is 64).
Ziplist contains more nodes than server.list_MAX_ziplist_entries (default is 512).

These two conditions can be modified in redis.conf:

list-max-ziplist-value 64 
list-max-ziplist-entries 512 
Copy the code

Note: The list configuration here is only found in the configuration prior to Redis3.2 because it was removed from Redis3.2 and later because the underlying implementation no longer uses Ziplist and uses QuickList as the default implementation.

2.3 Two-way linedList

When the list entry data exceeds 512, or the length of a single value exceeds 64, the underlying layer will convert zipList into linkedlist encoding. Linkedlist is a standard bidirectional linkedlist. Node Node contains prev and next Pointers, which can be bidirectional traversal. The head and tail Pointers are also saved. Therefore, the time complexity of inserting the head and tail of the linked list is O (1), which is also the key to efficient implementation of LPUSH, RPOP, RPOPLPUSH and other commands.

2.4 zipList compressed list

Although Redis3.2 no longer uses Ziplist directly to implement list building, the underlying implementation uses Ziplist indirectly.

Ziplist is a list of ziplist compressed by Redis to save memory, Redis official definition of ziplist is (from Redis source code SRC /ziplist.c notes) :

The ziplist is a specially encoded dually linked list that is designed to be very memory efficient. It stores both strings and integer values,where integers are encoded as actual integers instead of a series of characters. It allows push and pop operations on either side of the list in O(1) time. However, because every operation requires a reallocation of the memory used by the ziplist, the actual complexity is related to the amount of memory used by the ziplist

Ziplist is a specially coded two-way linked list designed to improve storage efficiency. Ziplist can be used to store strings or integers, where integers are encoded as true binary representations rather than as sequences of strings. It can provide push and POP operations at both ends of the table with O(1) time complexity.

Ziplist stores each item in a contiguous address space. Each item takes up a different amount of space and uses a variable length code. Redis uses Ziplist to store data when the number of elements is small. When the number of elements exceeds a certain value, Ziplist is converted to linkedList in the list key and Ziplist to Hashtable in the dictionary key. Because memory is allocated continuously, traversal is fast.

2.4.1 Compressing the list data structure

Ziplist is a special bidirectional linked list, ziplist does not maintain bidirectional Pointers: Prev Next; Instead, it stores the length of the previous entry and the length of the current entry, and calculates where the next element will be based on the length. In this way, efficient storage space can be obtained at the expense of reading performance. This is a typical “time for space”.

Ziplist uses contiguous blocks of memory, where each entry is stored consecutively. Ziplist storage distribution is as follows:

The meaning of each field.

zlbytes: 32bit: indicates the total number of bytes used by ziplistzlbytesFour bytes in itself).
zltail: 32bit: indicates the offset bytes of the last entry in the Ziplist table.zltailMakes it easy to find the last item (without traversing the entire Ziplist) and to perform a push or pop at the end of the Ziplist quickly.
zllen: 16 bits: indicates the number of entries in the Ziplist. The zllen field has only 16 bits, so the maximum that can be expressed is 2^16-1. It is important to note that if the number of items in ziplist exceeds the maximum number of 16 bits, ziplist can still be represented. How do I represent that? Here is the stipulation: ifzllenLess than or equal to 2 to the 16 minus 2, so it’s not 2 to the 16 minus 1zllenRepresents the number of data items in ziplist; Otherwise, that iszllenIs equal to 16 bits of 1, sozllenDoes not represent the number of data items, at this time to know the total number of data items in ziplist, then you must ziplist from beginning to end through each data item, to count out.
entry: indicates the data item that stores data. The length varies. An entry also has its own internal structure.
zlend: The last byte of ziplist is a closing flag, with a fixed value of 255.

2.4.2 zipList Node Entry Structure

Ziplist Each storage node is a Zlentry. The source for zlentry is in line 268 of ziplist.c

/* We use this function to receive information about a ziplist entry.
 * Note that this is not how the data is actually encoded, is just what we
 * get filled by a function in order to operate more easily. */
typedef struct zlentry {
    unsigned int prevrawlensize; /* prevrawlenSize prevrawlen is the size of the prevrawlen, which can be 1 byte or 5 bytes */
 unsigned int prevrawlen; /* The length of the previous node */  unsigned int lensize; /* lensize specifies the size of the bytes required to encode len */  unsigned int len; Len is the length of the current node */  unsigned int headersize; /* The header size of the current node */  unsigned char encoding; /* The encoding of the node is ZIP_STR_* or ZIP_INT_* */  unsigned char *p; /* A pointer to a node */ } zlentry; Copy the code

Prevrawlen: Indicates the number of bytes occupied by the previous node. Prevrawlen: Indicates the number of bytes occupied by the previous node. Prevrawlen is used to calculate the address of the previous node from the current node. Prevrawlen is a variable length code and can be expressed in two ways
- If the previous node is less than 254 bytes, use 1 byte (Uint8_t) to store prevrawlen;
- If the length of the previous node is greater than or equal to 254 bytes, set the value of the first byte to 254 and save the actual length with the next four bytes.
Len /encoding: Records the memory bytes occupied by the current content and its storage type, which is used to parse the content.
Content: Holds the value of the current node. The key ones are prevrawlen and len/ Encoding, and Content is just the bits that actually store the value.

2.4.3 Why zipList can compress data

Because Ziplist uses a contiguous chunk of memory to store data, it reduces memory fragmentation and pointer footprint. Secondly, each item in the table is stored in the contiguous address space, and each item uses variable length coding because of the different space it occupies, and when there are fewer nodes, Ziplist is easier to be loaded into the CPU cache. This is why Ziplist can compress memory.

2.4.4 Why zipList was abandoned

According to the ziplist data structure we have clearly understood above, in Ziplist, each Zlentry stores the number of bytes occupied by the previous node, and this value is longer, such data structure may cause ziplist chain updates. Suppose we have a compressed linked list entry1 entry2 entry3……. Prevrawlen: Entry1 is 253 bytes long. Prevrawlen: Entry2 prevrawlen: entry1 is 253 bytes long. If a new new_entry is inserted between entry1 and entry2 and new_entry is 254 bytes, then entry2.prevrawlen needs to be expanded to 5 bytes. If entry2’s prevrawlen storage length changes due to the overall length of entry3. Prevrawlen, the updates are cascading until the prevrawlen of the end node or a node is sufficient to hold the length of the previous node. The same goes for deleting the node. This cascading update occurs whenever the Prevrawlen after the node of our operation is changed.

Because of the ziplist chain update problem, also makes ziplist’s advantages and disadvantages are extremely obvious; Ziplist is designed to save memory, and the structure isn’t very good at making modifications. Any changes to the data can cause memory reallocation, possibly leading to memory copying. Later Redis compromised by replacing Ziplist with QuickList.

2.5 quickList quickList

Based on the above, we already know about ziplist’s flaws, so after Redis3.2, the underlying default implementation of lists uses QuickList instead of Ziplist and LinkedList? We’ll take a look at what the quickList data structure looks like, why we use QuickList as the underlying implementation of the Redis list, and what its advantages are over Ziplist. Here is what I did based on Redis3.2, where we can see that the underlying default implementation of the list is QuickList object encoding.

2.5.1 QuickList Data structure

The overall data structure of QuickList is as follows:

The redis/ SRC /quicklist.h structure is as follows:

typedef struct quicklist {
    quicklistNode *head;  / / head node
    quicklistNode *tail; / / end nodes
    unsigned long count; // Total number of all Ziplist data items
    unsigned long len;   // Number of quicklistNode nodes
 int fill : QL_FILL_BITS; // Set the ziplist size based on the list-max-ziplist-size parameter in the configuration file.  unsigned int compress : QL_COMP_BITS; // Set the node compression depth by using the list-compressed-depth parameter in the configuration file.  unsigned int bookmark_count: QL_BM_BITS;  quicklistBookmark bookmarks[]; } quicklist; Copy the code

In fact, even if you use quickList instead of Ziplist, quickList still has some disadvantages. The bottom layer still uses Ziplist, which also has a problem, because ziplist is a continuous memory address. If ziplist is too small, it will generate a lot of small disk fragments. This reduces storage efficiency. If ziplist is large, it becomes more difficult to allocate contiguous chunks of memory, which also reduces storage efficiency. How to balance the size of ziplist? Depending on the scenario used, Redis provides a configuration parameter, list-max-ziplist-size, to adjust the ziplist size.

A positive value limits the ziplist length on each QuickList node according to the number of data items. For example, when this parameter is set to 5, the ziplist of each QuickList node contains a maximum of five items. When negative, the ziplist length on each QuickList node is limited by the number of bytes consumed. In this case, it can only take five values from -1 to -5. Each value has the following meanings:

-5: The ziplist size of each QuickList node cannot exceed 64 Kb. (note: 1kb => 1024 bytes)
-4: The ziplist size of each QuickList node cannot exceed 32 Kb.
-3: The ziplist size of each QuickList node cannot exceed 16 Kb.
-2: The ziplist size of each QuickList node cannot exceed 8 Kb. (-2 is the default value given by Redis)
-1: The ziplist size of each QuickList node cannot exceed 4 Kb.

When we data volume is very big, the most convenient access to data is basically a queue head and of the data (time complexity is O (1)), in the middle of the data to be accessed with low frequency (access performance is low, the time complexity is O (N), if you use the scene in accordance with the characteristics of Redis in order to compress the use of memory, The list-compressed-depth configuration is provided to compress intermediate data nodes. Quicklist internal node compression algorithm, using LZF – a lossless compression algorithm.

This parameter represents the number of uncompressed nodes at both ends of a QuickList. The list-compressed-depth parameter has the following meanings:

0: this is a special value, indicating that none is compressed. This is the default value for Redis.
1: indicates that one node at both ends of the QuickList is not compressed, and the middle node is compressed.
2: indicates that two nodes at both ends of the QuickList are not compressed, and the middle node is compressed.
3: indicates that three nodes at both ends of the QuickList are not compressed, and the middle node is compressed.
And so on

2.5.2 quicklistNode structure

A QuickList is a bidirectional list of QuickList nodes.

typedef struct quicklistNode {
    struct quicklistNode *prev;
    struct quicklistNode *next;
    unsigned char *zl;
    unsigned int sz;             /* ziplist size in bytes */ 
 unsigned int count : 16; /* count of items in ziplist */  unsigned int encoding : 2; /* RAW==1 or LZF==2 */  unsigned int container : 2; /* NONE==1 or ZIPLIST==2 */  unsigned int recompress : 1; /* was this node previous compressed? * /  unsigned int attempted_compress : 1; /* node can't compress; too small */  unsigned int extra : 10; /* more bits to steal for future usage */ } quicklistNode;  typedef struct quicklistLZF {  unsigned int sz;  char compressed[]; } quicklistLZF;  Copy the code

Each node in a QuickList is a QuickList Node, where the fields have the following meanings:

Prev: pointer to the previous node in the list.
Next: a pointer to the next node in the list.
Zl: data pointer. If the current node’s data is not compressed, it points to a Ziplist structure; Otherwise, it points to a quicklistLZF structure.
Sz: indicates the total size of the Ziplist pointed to by zL. Note that if the Ziplist is compressed, the sz value is still the size of the ziplist before compression.
Count: indicates the number of data items contained in ziplist.
Encoding: Indicates whether ziplist is compressed (and which compression algorithm is used). Currently, there are only two values: 2 means compressed (using LZF compression) and 1 means not compressed.
Container: In the current implementation, this value is a fixed value of 2, indicating the use of ziplist as a data container.
Recompress: If you want to extract the data before a compression using a command like lindex, set recompress = 1 and wait until the data is compressed again.
Attempted_compress: This doesn’t quite understand the meaning.
Extra: extends the reserved field.

The QuickList structure combines the advantages of Ziplist and LinkedList. Quicklist balances the consumption of time and space and optimizes the performance to a large extent. Because the time complexity of queue head and queue tail operations is O(1), the list of Redis can also be used by queues.

3. The dictionary dict

As we can see from the figure above, the underlying default data structure for hash keys is ziplist. As the number of hash keys increases, the data structure becomes Hashtable. Although the object encoding format we see here is Hashtable, However, Redis uses dict to complete the underlying data structure of the Hash key, but the dict implementation is implemented using Hash tables. For clients, Redis service exposes hash, and its underlying data structure is implemented in two ways: ziplist and dict. We already talked about ziplist when we talked about lists, so WE won’t repeat it here. We’ll focus here on dict implementations.

Again, when do you switch from Ziplist to HashTable? Two parameters are provided in redis.conf

hash-max-ziplist-entries 512
hash-max-ziplist-value 64
Copy the code

If the number of fields and values corresponding to the key in the hash key is greater than 512, the key is converted to a dictionary.
If the length of the value corresponding to the key in the hash key exceeds 64, the key is converted to a dictionary

3.1 Implementation of dict

Dictionary is an important data structure of Redis, Redis database itself can be regarded as a large dictionary, Redis will have a very high query efficiency, in fact, and Redis data type used at the bottom is related, usually dictionary implementation will use hash table as the bottom storage, The dictionary implementation of Redis is also based on the O(1) hash algorithm.

Redis source code structure definition is as follows: dict source code definition

typedef struct dictEntry {
    void *key;
    union {
        void *val;
        uint64_t u64;
 int64_t s64;  double d;  } v;  struct dictEntry *next; } dictEntry;  typedef struct dictType {  uint64_t (*hashFunction)(const void *key);  void *(*keyDup)(void *privdata, const void *key);  void *(*valDup)(void *privdata, const void *obj);  int (*keyCompare)(void *privdata, const void *key1, const void *key2);  void (*keyDestructor)(void *privdata, void *key);  void (*valDestructor)(void *privdata, void *obj); } dictType;  /* This is our hash table structure. Every dictionary has two of this as we  * implement incremental rehashing, for the old to the new table. */ typedef struct dictht {  dictEntry **table;  unsigned long size;  unsigned long sizemask;  unsigned long used; } dictht;  typedef struct dict {  dictType *type;  void *privdata;  dictht ht[2];  long rehashidx; /* rehashing not in progress if rehashidx == -1 */  unsigned long iterators; /* number of iterators currently running */ } dict; Copy the code

The following figure shows dict’s data structure more clearly.

From the figure above and the source code, we can clearly see that a dict consists of the following items:

DictType: a pointer to the dictType structure, which enables the dict key and value to store any type of data in a customized way.
Privdate: A private pointer passed by the caller when creating a dict.
Dictht [2] : an array of hash tables of size 2. Both HT [0] and HT [1] are valid only during rehashing. In normal cases, only HT [0] is valid and HT [1] does not contain any data.
Rehashidex: current rehash index (rehashidx). If rehashidx = -1, it is not currently in the rehash process. Otherwise, it indicates that the current rehash is in progress and its value records the current rehash progress.
Iterators: The number of iterators currently iterating.

The most important structure here is dictht, which defines a hash table with the following structure:

DictEntry: An array of dictEntry Pointers (tables). The hash of key eventually maps to a location in the array (corresponding to a bucket). If multiple keys map to the same location, a conflict occurs, and a dictEntry list is pulled. For those of you who are familiar with Java, this might remind you of HashMap, which actually looks a little bit like the implementation of HashMap.
Size: Identifies the length of the dictEntry pointer array. It’s always an exponent of 2.
Sizemask: Positional index used to map hash values to tables. Its value is equal to (size-1), which is the index of the array. This is actually the same method used to calculate the index in HashMap.
Used: Records the number of data in the dict. The ratio of this to size is the load factor. The larger the ratio, the higher the probability of hash conflicts.

3.2 How does dict rehash in Redis

On the whole, it is similar to the implementation of HashMap in Java. The processing of hash conflicts and the size of array are the same as that of HashMap in Java. However, there is one difference here, which is about the mechanism of expansion. The dictionary in Redis is the same as the HashMap in Java. In order to ensure the efficiency of the query as the amount of data increases, the size of the array should be adjusted appropriately, also known as rehash, which is also known as expansion. Instead of adding HashMap in Java, we’ll focus on adding dictionary in Redis.

So when will rehash happen? Condition:

1. The server does not run the BGSAVE or BGREWRUTEAOF command, and the hash table load factor is greater than or equal to 1. 2. The server is running the BGSAVE or BGREWRUTEAOF command, and the hash table load factor is greater than or equal to 5.Copy the code

That how to rehash, according to the source code and data structure can see above, the dictionary defines a hash table array with a size 2, we also said to the front, at the time of not expansion, all the data are stored in the hash table first, only when the expansion will use the second hash table. When rehash is required, set the hash table size of DICtht [1] to the size to be expanded, and then rehash all data in DICtht [0] into Dictht [1]. In addition, Redis adopts progressive rehash to ensure that rehash does not consume too much server performance when the data volume is large. When the data volume is small, we rehash the data into the expanded hash table at one time, and the performance of Redis service is negligible. However, when the number of Hash keys in Redis is large, hundreds of thousands or even millions of data, such rehash will have a huge impact on Redis, and even cause Redis to stop serving for a period of time, which is unacceptable.

When the Redis service needs rehash, it does not rehash all the data in DICtht [0] into DICtht [1] at one time, but rehashes the data in turn into the hash table of Dictht [1] in batches. This is a divide-and-conquer approach that avoids the computations associated with a centralized rehash even when the data is large. During the rehash process, the addition, deletion, change and query of the dictionary will operate on two hash tables, because in the rehahs process, both hash tables have data. If we cannot find data in one hash table, we will also look up data in the other hash table. Data added during rehash is not added to the first hash table, but is saved to the second hash table to ensure that the data in the first hash table is not added until the data is empty.

If you are familiar with Java, you may think of the HashMap expansion algorithm. In fact, there are many similarities in capacity design and internal structure. If you are interested, you can learn about them or refer to my article “HashMap Operation in Java1.8”. Compared with the way of dictionary rehash in Redis, WHAT I prefer is the way of exquisite rehahs in HashMap in Java, whose idea is very worthy of our reference.