This is the first day of my participation in the August Text Challenge.More challenges in August

In Redis, hash type refers to the key value itself is a key-value pair structure, such as value={{field1, value1},… {fieldN, valueN}} instead! In Redis, each hash key can store 2 ^ 23 −1 key-value pairs.

Hash object encoding

There are two internal encodings of hash objects: Ziplist and Hashtable.

  • Ziplist (compressed list) : When the number of hash object elements is smaller than the hash-max-ziplist-entries configuration (512 by default) and the length of all values is smaller than the hash-max-ziplist-value configuration (64 bytes by default), Redis uses ziplist encoding internally to store hash objects. Each time a new key-value pair is added to the hash object, the program pushes the compressed list node holding the key to the end of the compressed column table, and then pushes the compressed list node holding the value to the end of the compressed column table.

    • Two nodes that store the same key-value pair are always next to each other, with the key-saving node coming first and the value-saving node coming second.
    • Key-value pairs added to the hash object first are placed at the head of the compressed list, while key-value pairs added to the hash object later are placed at the tail of the compressed list.
  • Hashtable (dictionary) : When a hash type does not meet ziplist’s criteria, Redis uses Hashtable as an internal hash implementation because ziplist’s read and write efficiency is reduced and HashTable’s read and write time complexity is O(1).

A dictionary (dict)

Redis dictionary uses hash table as the underlying implementation, a hash table can have multiple hash table nodes, and each hash table node stores each key value in the dictionary.

Code implementation

typedef struct dict {
    // Type specific functions
    dictType *type;
    // Private data
    void *privdata;
    In general, dictionaries only use ht[0] hashes. Ht [1] hashes are only used for rehashing HT [0] hashes.
    dictht ht[2];
    // Rehash index. When rehash is not in progress, the value is -1
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    // The current iteration position
    unsigned long iterators; /* number of iterators currently running */
} dict;

typedef struct dictType {
    // The function that calculates the hash value
    uint64_t (*hashFunction)(const void *key);
    // Copy the key function
    void *(*keyDup)(void *privdata, const void *key);
    // Copy the value of the function
    void *(*valDup)(void *privdata, const void *obj);
    // Compare key functions
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);
    // The function that destroys the key
    void (*keyDestructor)(void *privdata, void *key);
    // A function that destroys values
    void (*valDestructor)(void *privdata, void *obj);
} dictType;

/* This is our hash table structure. Every dictionary has two of this as we * implement incremental rehashing, for the old to the new table. */
typedef struct dictht {
    // Hash table array
    dictEntry **table;
    // Hash table size, default is 4
    unsigned long size;
    // Hash table size mask, used to calculate index value, always size-1
    unsigned long sizemask;
    // Number of existing nodes in the hash table
    unsigned long used;
} dictht;

typedef struct dictEntry {
    / / key
    void *key;
    / / value
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    // Point to the next hash table node to form a linked list
    struct dictEntry *next;
} dictEntry;
Copy the code

The hash algorithm

Redis calculates hash and index values as follows:

// Calculate the hash value of the key using the dictionary set hash function
hash = dict -> type -> hashFunction(key)
Sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask: sizemask
// Ht [x] can be HT [0] or HT [1], depending on the situation.
index = hash & dict -> ht[x].sizemask
Copy the code

When dictionaries are used as the underlying implementation of a database, or a hash key, Redis uses the MurmurHash2 algorithm to calculate the hash value of the key.

MurmurHash algorithm was originally invented by Austin Appleby in 2008. The advantages of this algorithm are that even if the input keys are regular, the algorithm can still give a good random distribution, and the algorithm is also very fast.

The most recent version of the MurmurHash algorithm is MurmurHash3, while Redis uses MurmurHash2. For more information on the MurmurHash algorithm, see the homepage: Code.google.com/p/smhasher/.

Resolving key conflicts

When two or more keys are assigned to the same index of a hash table array, we say the keys are in conflict.

  • Four ways to resolve hash conflicts
    • Open address method
      • Linear detection: add one unit of value to the conflicting value until there is no conflict.
      • Square detection method: add 1 square units to the value of the conflict, if the conflict is subtracted 1 square units; Then 2 squared, 3 squared, until there is no conflict.
      • Pseudorandom method: add a random number to the conflicting value until there is no conflict.
    • Chained address: Conflicting values are stored in a function linked list
      • The processing method is simple, no accumulation phenomenon, the average search length is short.
      • Linked list nodes can be expanded at will and are suitable for situations where length cannot be determined.
      • Compared with the open address method, the chain address method saves more space.
    • Establish a public overflow area: Establish a public overflow area to store all conflicting values.
    • Rehash: Hash the conflicting values again until no conflict occurs.

Each hash table node has a next pointer. Multiple hash table nodes can use the NEXT pointer to form a one-way linked list. Multiple nodes assigned to the same index can be connected by this one-way linked list, which solves the problem of key conflict.

Because the list of dictEntry nodes has no Pointers to the end of the list, the program always adds new nodes to the head of the list for speed.

int dictAdd(dict *d, void *key, void *val)
{
    dictEntry *entry = dictAddRaw(d,key,NULL);

    if(! entry)return DICT_ERR;
    dictSetVal(d, entry, val);
    return DICT_OK;
}

dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing)
{
    long index;
    dictEntry *entry;
    dictht *ht;

    if (dictIsRehashing(d)) _dictRehashStep(d);

    /* Get the index of the new element, or -1 if * the element already exists. */
    if ((index = _dictKeyIndex(d, key, dictHashKey(d,key), existing)) == - 1)
        return NULL;

    /* Allocate the memory and store the new entry. * Insert the element in top, with the assumption that in a database * system it is more likely that recently added entries are accessed * more frequently. */
    ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0];
    entry = zmalloc(sizeof(*entry));
    entry->next = ht->table[index];
    ht->table[index] = entry;
    ht->used++;

    /* Set the hash entry fields. */
    dictSetKey(d, entry, key);
    return entry;
}
Copy the code

rehash

With the continuous execution of operations, the hash table to save the key/value pair will gradually increase or decrease, in order to make a hash table load factor (the load factor) to maintain in a reasonable scope, when the number of key/value pair hash table to save too much or too little, the application needs to be on the size of the hash table corresponding expansion or contraction.

The work of expanding and contracting hash tables can be done by performing a rehash operation. Redis rehashes the dictionary’s hash table as follows:

  1. Allocate space for the dictionary HT [1] hash table, the size of which depends on the operation to be performed and the number of key-value pairs ht[0] currently contains (i.e., the value of ht[0]. Used) :
    • If an extension operation is performed, then the size of ht[1] is the first one greater than or equal to ht[0]. Used * 2 of 2 ^ n (2 to the NTH power);
    • If the shrink operation is performed, then ht[1] has the size of the first 2^n greater than or equal to ht[0]. Used.
  2. Rehash all key-value pairs saved in HT [0] onto HT [1] : Rehash means to recalculate the hash and index values of the key and then place the key-value pairs into the specified position in the HT [1] hash table.
  3. When all key-value pairs contained in HT [0] are migrated to HT [1] (HT [0] becomes empty), release HT [0], set HT [1] to HT [0], and create a new blank hash table in HT [1] in preparation for the next rehash.

For example, suppose the program wanted to extend the dictionary HT [0], then the program would perform the following steps:

  1. Ht [0]. Used is currently 4, 4 * 2 = 8, and 8 (2^3) is the first hash table greater than or equal to 4 to the power of 2 n, so the program sets ht[1] to 8.
  2. Rehash all four key-value pairs contained in HT [0] into HT [1]
  3. Release HT [0] and set HT [1] to HT [0], then allocate a blank hash table for HT [1]

At this point, the expansion of the hash table is complete, and the program has successfully changed the size of the hash table from 4 to 8.

  • Extension conditions for a hash table

    • The server does not run the BGSAVE or BGREWRITEAOF command, and the hash table load factor is greater than or equal to 1
    • The server is running the BGSAVE or BGREWRITEAOF command and the hash table load factor is greater than or equal to 5

    The load factor of a hash table can be calculated by the following formula: load_factor = HT [0]. Used/HT [0].size

    Depending on whether the BGSAVE command or BGREWRITEAOF command is executing, the load factor required for the server to perform the extension operation is different because during the BGSAVE command or BGREWRITEAOF command execution, Redis needs to create a child of the current server process, and most operating systems use copy-on-write techniques to optimize the use of child processes, so the server increases the load factor required to perform an extension operation while the child process exists. As much as possible, hash table extensions are avoided while the child process is in existence. This avoids unnecessary memory writes and maximizes memory savings.

  • The contraction condition of a hash table

    • When the load factor of the hash table is less than 0.1, the program automatically starts to shrink the hash table.

Progressive rehash

Expanding or contracting the hash table requires rehashing all key/value pairs in HT [0] into HT [1]. However, this rehash action is not done in a single, centralized way, but in multiple, incremental ways.

The reason for this is that if only four key-value pairs are stored in HT [0], the server can rehash all of these key-value pairs into HT [1] in an instant; However, if the number of key-value pairs stored in a hash table is not four, but four, or forty, or even four million, then rehashing all of these key-value pairs to HT [1] at once can cause the server to go out of service for a period of time.

Therefore, to avoid the impact of rehash on server performance, the server does not rehash all the key pairs in HT [0] to HT [1] at one time. Instead, the server slowly rehashes the key pairs in HT [0] to HT [1] several times.

Here are the detailed steps for progressive rehash of a hash table:

  1. Allocate space for HT [1] so that the dictionary holds both HT [0] and HT [1] hash tables.
  2. Maintain an index counter variable, rehashidx, in the dictionary and set its value to 0 to indicate that rehash work has officially begun.
  3. Each time a dictionary is added, deleted, found, or updated during rehash, the program rehashes to HT [1] all key/value pairs on the REhashidx index of the HT [0] hash table, in addition to the specified operation. When the rehash is complete, the program increments the value of the rehashidx property by one.
  4. As the dictionary operation continues, eventually at some point in time all key-value pairs of HT [0] are rehashed to HT [1], at which point the program sets the value of the rehashidx property to -1 to indicate that the rehash operation is complete.

The benefit of progressive Rehash is that it takes a dial-and-conquer approach, spreading the computation required for rehash key and value pairs over every add, delete, find, and update operation to the dictionary, thereby avoiding the massive computation that comes with centralized Rehash.

Hash table operations during progressive rehash execution

Because the dictionary uses both THE HT [0] and HT [1] hash tables during progressive rehash, the dictionary deletes, finds, updates, and so on on both hash tables during progressive rehash: For example, to look for a key in a dictionary, the program looks in HT [0] first, and if it doesn’t find one, it continues to ht[1], and so on.

In addition, during progressive rehash, new key-value pairs added to the dictionary are stored in HT [1] and ht[0] does not add any more: This ensures that the number of key-value pairs contained in HT [0] will only decrease, and eventually become empty as the rehash operation executes.

The business scenario

The shopping cart

The user ID is the key, the commodity ID is field, and the commodity quantity is value, which exactly constitute the three elements of the shopping cart, as shown in the figure below.

Store the object

Hash (key, field, value) has a structure similar to that of an object (object ID, attribute, value) and can also be used to store objects.

This topic describes the application scenarios of string. String + JSON is also a way to store objects. Should I use String + JSON or hash to store objects?

The following table compares the two storage modes.

string + json hash
The efficiency of high high
capacity low low
flexibility low high
serialization simple complex

When an attribute of an object needs to be changed frequently, string+ JSON is not suitable because it is not flexible enough. Each change requires the entire object to be serialized and assigned. If the hash type is used, the attribute can be individually modified without serialization and the entire object does not need to be modified. For example, attributes that may change frequently, such as price, sales volume, number of views, and number of comments, are suitable for storing in hash types.

Of course, it’s fine to store infrequently changing attributes in the hash type, such as item name, item description, shipping date, and so on. However, when the object of a property is not the basic type or a string, use hash types must be manually complex serialization, for instance, commodity label is a label object list, goods may receive coupons is a list of coupons object (as shown in the figure below), even with coupons (coupons) as a field, If value wants to store the list of coupon objects, it still needs to use JSON serialization. In this case, serialization work is too tedious. It is better to use string + JSON to store commodity information directly.