Food truck

This article analyzes the Zset structure of Redis from the perspective of code, and hopes to master the following contents through this article:

  1. In Redis, zset is not a single structure, but a hop table and a hash table

  2. The realization principle of the hop table, hop table dimension depends on random

  3. Skip table lookup, insert, delete three tips

  4. Usage scenarios (simple delay queue, leaderboard, simple traffic limiting)

If you still don’t know, read on.

Case scenario

Suppose we have a class of all students Chinese scores, to statistics, query range, query single student scores, meet the needs of high-performance reading, Redis zSET structure is undoubtedly the best choice. Redis provides a rich API. Example:

ZADD yuwen 90 s01 89 s03 99 s02 74 s04 97 s05

Yuwen is used as key to store the scores of 6 students from S01 to S06. We can query the scores of any student

ZSCORE yuwen s03

All elements in a specified range can be returned in order

ZRANGE yuwen 1 2 withscores

Can access all elements within the specified score range

ZRANGEBYSCORE yuwen 90 100 withscores

The number of items in a specified range can be counted

ZCOUNT yuwen 80 90

implementation

Zset structure, both support by single element query, and support range query, how to achieve? T_zset.c Redis t_zset.c Redis t_zset.c

/* ZSETs are ordered sets using two data structures to hold the same elements
 * in order to get O(log(N)) INSERT and REMOVE operations into a sorted
 * data structure.
 *
 * The elements are added to a hash table mapping Redis objects to scores.
 * At the same time the elements are added to a skip list mapping scores
 * to Redis objects (so objects are sorted by scores in this "view").Copy the code

There are two data structures in Redis to support zset. One is a Hash table and the other is a Skip list. Let’s look at the zset definition in the code:

typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;Copy the code

Dict is a hash table, implemented in a variety of programming languages. I can guarantee the time complexity of O(1) without too much explanation. Let’s move on to the definition of Zskiplist:

typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;Copy the code

Zskiplist is a Redis variant of Skiplist.

Jump table

Skip Lists by William Pugh in his 1990 paper Skip Lists: A probabilistic alternative to balanced trees was proposed for the first time, and the search time complexity was O(logN) on average and O(N) at worst. In most cases, the efficiency was comparable to that of balanced trees, but the realization was much simpler than that of balanced trees. Skip table is a typical space – for – time data structure.

The jumper has the following characteristics:

  • It consists of many layers.

  • Each layer is an ordered linked list.

  • The lowest linked list (Level 1) contains all elements.

  • If an element appears in a Level I list, it also appears in all lists below Level I.

  • Each node contains two Pointers, one to the next element in the same list and one to the element at the next level down.

The lookup of the hop list starts with the head element of the top-level list, then traverses the list until it finds a node whose element is greater than or equal to the target element, and returns it directly if the current element is exactly equal to the target element. If the current element is smaller than the target element, the search continues vertically down to the next level. If the current element is larger than the target or reaches the end of the list, it moves to the previous node and then vertically down to the next level. Skiplist’s search process is called a Skiplist because it jumps from one layer to the next.

As an example, let’s say the link contains 1 to 10 elements. So if we want to find the ninth one, we have to go through the header nine times to find it




Only one element can be compared at a time, and the worst case is O(n). It would be nice if we could compare two elements at a time:




Two at a time, we found them in five searches. So we have something like the following structure, which adds a layer to the list and reduces the number of elements:




If you add two layers of “linked list”, you can find it in only 3 lookups:



Even if we find element 8, we only need to traverse 1->4->7->8, a total of 4 queries.

A skip table is such a data structure that the nodes skip part of it to speed up the query. As we mentioned in HashMap earlier, in Java 8 when the number of hash conflicts is greater than 7, it is converted to a red-black tree. Why Redis uses hoppers instead of red-black trees? The code is simple compared to the red black tree. If we want to query a range of values, using a balanced tree can be cumbersome. When deleting an interval from a balanced tree, it can be quite difficult because of the balance of the tree.

The whole query process can be simplified as if (next is greater than the result) next else next level



Enhanced skip table

Some changes to Skiplist in Redis:

  • Added backward-driven Pointers (* BACKWARD)

  • Both value and score are recorded, and score can be repeated

  • The first layer maintains bidirectional linked lists

The whole class diagram of zset structure is as follows:



Zskiplist contains the zskiplistNode node definition:

typedef struct zskiplistNode { sds ele; double score; struct zskiplistNode *backward; Struct zskiplistNode {struct zskiplistNode *forward; // Point to the next node unsigned long span; } level[]; } zskiplistNode;Copy the code

ZskiplistNode defines an array of ZSkiplistlevels to hold Pointers to the node at each level. The query is similar to our simulated example and is not described in detail. Focus on the insert operation:

/* Insert a new node in the skiplist. Assumes the element does not already
 * exist (up to the caller to enforce that). The skiplist takes ownership
 * of the passed SDS string 'ele'. */ zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) { zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x; unsigned int rank[ZSKIPLIST_MAXLEVEL]; int i, level; serverAssert(! isnan(score)); x = zsl->header;for(i = zsl->level-1; i >= 0; I --) {/* store rank that is crossed to reach the insert position */ / Rank [I] = I == (ZSL ->level-1) 0 : rank[i+1];while (x->level[i].forward &&
                (x->level[i].forward->score < score ||
                    (x->level[i].forward->score == score &&
                    sdscmp(x->level[i].forward->ele,ele) < 0)))
        {
            rank[i] += x->level[i].span;
            x = x->level[i].forward;
        }
        update[i] = x;
    }
    /* we assume the element is not already inside, since we allow duplicated
     * scores, reinserting the same element should never happen since the
     * caller of zslInsert() should test in the hash table ifThe element is already inside or not. */ / level = zslRandomLevel(); // Record the position if you need to ascendif (level > zsl->level) {
        for (i = zsl->level; i < level; i++) {
            rank[i] = 0;
            update[i] = zsl->header;
            update[i]->level[i].span = zsl->length;
        }
        zsl->level = level;
    }
    x = zslCreateNode(level,score,ele);
    for (i = 0; i < level; i++) {
        x->level[i].forward = update[i]->level[i].forward;
        update[i]->level[i].forward = x;

        /* update span covered by update[i] as x is inserted here */
        x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]);
        update[i]->level[i].span = (rank[0] - rank[i]) + 1;
    }

    /* increment span for untouched levels */
    for (i = level; i < zsl->level; i++) {
        update[i]->level[i].span++;
    }

    x->backward = (update[0] == zsl->header) ? NULL : update[0];
    if (x->level[0].forward)
        x->level[0].forward->backward = x;
    else
        zsl->tail = x;
    zsl->length++;
    return x;
}Copy the code

When inserting, the query is performed first, and then the element to be inserted is inserted starting at the bottom level. And then let’s see if we want to do it layer by layer, from the bottom up. But should I insert the next layer or not? If we want to be efficient with each layer of jumps, the balance is as good as possible (layer 1, layer 2, layer 3, layer 4, and layer 4). But the algorithm is really very complex, and strictly in accordance with the exponential power of 2, we have to adjust the original structure. So the idea is to flip a coin, leave it to chance, and generate a random number. Redis has a 25% probability and then goes up. In this case, the probability that each element has an X layer is 0.25 to the X minus 1. Level is defined in Redis at the time of initialization at 32 levels. So, the probability of how many elements are in the 32nd layer you can calculate.

The whole insertion process can be simplified as: insert the lowest if (random probability) first and extend the next layer

Random function:

/* Returns a random level for the new skiplist node we are going to create.
 * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL
 * (both inclusive), with a powerlaw-alike distribution where higher
 * levels are less likely to be returned. */
int zslRandomLevel(void) {
    int level = 1;
    while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
        level += 1;
    return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}Copy the code



The zskiplist deletion operation can be divided into three steps:

  • Find the node location according to member(obj) and score (variable X in the code is the node, update the last node of each layer x)

  • Move zslDeleteNode to logically remove the X node from skiplist

  • Release the memory of node X

/* Delete an element with matching score/element from the skiplist.
 * The function returns 1 if the node was found and deleted, otherwise
 * 0 is returned.
 *
 * If 'node' is NULL the deleted node is freed by zslFreeNode(), otherwise
 * it is not freed (but just unlinked) and *node is set to the node pointer,
 * so that it is possible for the caller to reuse the node (including the
 * referenced SDS string at node->ele). */
int zslDelete(zskiplist *zsl, double score, sds ele, zskiplistNode **node) {
    zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x;
    int i;

    x = zsl->header;
    for (i = zsl->level-1; i >= 0; i--) {
        while (x->level[i].forward &&
                (x->level[i].forward->score < score ||
                    (x->level[i].forward->score == score &&
                     sdscmp(x->level[i].forward->ele,ele) < 0)))
        {
            x = x->level[i].forward;
        }
        update[i] = x;
    }
    /* We may have multiple elements with the same score, what we need
     * is to find the element with both the right score and object. */
    x = x->level[0].forward;
    if (x && score == x->score && sdscmp(x->ele,ele) == 0) {
        zslDeleteNode(zsl, x, update);
        if(! node) zslFreeNode(x);else
            *node = x;
        return 1;
    }
    return 0; /* not found */
}Copy the code

The whole deletion process can be simplified as: first find, disassociate, delete memory

Hide this logical branch in the creation of zset (zaddGenericCommand method) :

if (server.zset_max_ziplist_entries == 0 ||
    server.zset_max_ziplist_value < sdslen(c->argv[scoreidx+1]->ptr))
{
    zobj = createZsetObject();
} else {
    zobj = createZsetZiplistObject();
}Copy the code

When Redis is initialized, it only determines whether the stored element length is greater than 64 bytes (server.zset_max_ziplist_entries default 128). Larger than 64 bytes select Zkiplist, otherwise Ziplist. When implementing the add, delete, change and review method, choose different implementations depending on whether ziplist or Zkiplist is used. Zkiplist is not covered in detail in this article. Just remember that Zset uses Ziplist in two situations:

  1. The number of stored elements is less than 128

  2. The size of a single element exceeds 64 bytes

The method of implementing add, delete, change, and check varies depending on whether ziplist or Zkiplist is implemented. Zkiplist will not be covered in detail in this article.

conclusion

So far, we have introduced the most complex part of zset in Redis. Considering the code and understanding, think about what data structures are behind these four commands.

ZSCORE YUwen S03 is based on the complexity of hash table O(1)

ZRANGE yuwen 1 2 withscores based on skiplist and span lookup

ZRANGEBYSCORE yuwen 90 100 withscores based on skiplist and score lookup

ZREVRANGE yuwen 1 2 withscores is based on skiplist and score and * backward lookup

Usage scenarios

1. Delay queue

Zset sorts by score if score represents the timestamp of the desired execution time. Insert it into the zset collection at some point, and it will be sorted by timestamp size, that is, before and after the execution time.

If the current timestamp is greater than or equal to the socre of the key value, it will be taken out for consumption deletion, which can achieve the purpose of delayed execution.

2. List

If you spend a lot of time in the tech community, you’re probably familiar with lists like “Most Popular in an hour.” How do you do that? If recorded in a database, real-time statistics are not easily distinguishable. We take the timestamp of the current hour as the key of Zset, the post ID as member, the number of clicks and comments as score, etc., and update score when the score changes. Use ZREVRANGE or ZRANGE to find the corresponding number of records.

3. The current limit

Sliding Windows are a common strategy for traffic limiting. If we define a zset with a user ID as the key, member or score will be the timestamp of the access. We only need to count the number of keys in the specified timestamp interval to obtain the access frequency in the user sliding window, and compare it with the maximum pass times to decide whether to allow the pass.

Sample code for the above three scenarios is presented in the next article. You’re also welcome to think about other applications.


Pay attention to my

If you read on wechat, please click the link to follow me. If you read on PC, please scan the code to follow me. Welcome to communicate with me and point out mistakes at any time