Consistency of the hash

In the distributed process we spread the service across several nodes, thereby collectively enhancing the purpose of the service. However, for a client, which node should serve? Or what tasks are assigned to a node?

Strong hash

Considering that a single server cannot carry it, a distributed architecture is used. The initial algorithm is hash() mod n, where hash() usually takes the user ID and n is the number of nodes. This approach is easy to implement and meets operational requirements. The disadvantage is that when a single point of failure occurs, the system cannot automatically recover. It also cannot dynamically add nodes.

Weak hash

To solve single points of failure, use hash() mod (n/m),

In this way, any user can have m servers to be selected randomly by the client.

Because users on different servers need to interact with each other, all servers need to know exactly where the users are.

So the user location is saved into memcached. When the backup server fails, the client can switch to the backup server automatically. Before the backup server is switched, the client must log in to the backup server again.

  • benefits

It has one advantage over strong hashing: it solves the single point problem.

  • disadvantages

However, there are the following problems: unbalanced load, especially after the failure of a single machine left one will be too much pressure; Nodes cannot be added or deleted dynamically. If a node fails, the client needs to log in to the node again

Consistent hash algorithm

The consistent Hash algorithm proposes four definitions for determining the quality of a hash algorithm in a dynamically changing Cache environment:

Balance (Balance)

Balance means that the result of hashing can be distributed among all buffers as much as possible, so that all buffer space can be utilized. Many hash algorithms can satisfy this condition.

Monotonicity (Monotonicity)

Monotony means that if something has already been hashed into the corresponding buffer, a new buffer is added to the system. The result of the hash should ensure that the previously allocated content can be mapped to the original or new buffer, and not to other buffers in the old buffer set.

Dispersion (Spread)

In a distributed environment, it is possible for an endpoint to not see all buffers, but only some of them.

When the terminal wants to map the content to the buffer through the hash process, different terminals may see different buffer scope, resulting in the hash result is inconsistent, the final result is the same content is mapped to different buffers by different terminals.

This situation should obviously be avoided because it causes the same content to be stored in different buffers, reducing the efficiency of system storage. Dispersion is defined by the severity of these occurrences. A good hash algorithm should be able to minimize inconsistencies, that is, minimize dispersion.

Load (Load)

The problem of load is actually another way of looking at the problem of dispersion. Since different endpoints may map the same content to different buffers, it is possible for a particular buffer to be mapped to different content by different users.

Like dispersity, this should be avoided, so a good hash algorithm should minimize the buffering load.

Ordinary hashing algorithms (also known as hard hashing) hash machines in a simple way, which can achieve satisfactory results in the case of constant cache environment, but when the cache environment changes dynamically, this static mode obviously does not meet the requirements of monotonicity (when adding or subtracting a machine, Almost all the stored contents are rehashed into other buffers.

Code implementation

Implementation logic

Consistency hashing algorithm has a variety of specific implementations, including Chord algorithm, KAD algorithm and so on. The implementation of the above algorithms is complicated.

Here is an introduction to a widely circulated online consistent hash algorithm of the basic implementation principle, interested students can follow the link above or go to the Internet for more detailed information.

The basic implementation principle of consistent hash algorithm is to map machine nodes and key values to a 0~2^32 ring according to the same hash algorithm.

When there is a request to write to the cache, the Hash(k) corresponding to the Key value k is calculated. If the value exactly corresponds to the Hash value of a previous machine node, the machine node is directly written to. If there is no corresponding machine node, the next node is searched clockwise to write. If no node is found beyond 2^32, the search starts at 0 (because of the ring structure).

As shown in Figure 1:

In Figure 1, the hash value of Key K is between A and B, so K is processed by node B.

In addition, an entity node can be mapped to multiple virtual nodes based on different processing capabilities.

After consistent hashing, when a new machine is added, the storage condition of only one machine will be affected.

For example, the hash of the newly added node H is between B and C, then some data previously processed by C may be moved to H, while the processing of all other nodes will remain unchanged, thus showing good monotonicity.

However, if a machine is deleted, for example, node C is deleted, the data originally processed by C will be moved to node D, while the processing of other nodes remains unchanged.

Since the same hashing algorithm is used for both the machine node hashing and the buffered content hashing, the dispersion and load are also greatly reduced.

By introducing virtual nodes, the balance is also greatly improved.

The implementation code

consitent-hashing

The original address

consitent-hashing