Consistent hashing algorithm was proposed by MIT in 1997. It is a special hashing algorithm to solve the problem of distributed cache. The algorithm was proposed in the paper Consistent hashing and Random Trees in 1997, and is widely used in distributed systems. Consistency hash is a hash algorithm that can change the mapping between the existing service request and the processing request server as little as possible when removing or adding a server, so as to meet the requirements of monotony as much as possible. This article will briefly introduce the principle of consistent Hash from the perspective of data distribution in a distributed database.

background

Distributed database should first solve the problem of mapping the whole data set to multiple nodes according to partition rules, that is, the sharding problem of data. The most common way to partition is with a hash (node mod). This method allows for data distribution, but the scalability of this system is very poor, and the ability to add or remove a node may be problematic.

For example, the data is distributed among nodes A, B and C:

When a new node D is added, we need to mod 4, so all the data stored before needs to be rehash migrated. Otherwise, the data we stored before will not be mapped to the exact location by %4 and will not fetch the data (if it is Redis cache distribution, the request will hit DB and cause cache avalanche).

The principle of

Consistency of the hash

Imagine a ring with 2^32 nodes:

This ring is what we call a hash ring. At this point we need to hash all the server nodes into the ring. The {A, B, C} % 2 ^ 32:

This is to hash the server node to the node first. And then we’re going to distribute the data around these nodes. Such as {a,b,c,d… } needs to be stored in the node, and again we hash this data into the hash ring. The strategy here is to take the data clockwise and find the first node that is greater than or equal to the hash value.

So A is going to be saved on A, B is going to be saved on B, and so on.

What’s the good of that?

At this point we add a new node D:

It can be seen that the data in red line cannot be accurately mapped except for D (the data is still on A). Most of the other data is still available. ** So we just need to migrate the small part of data in red. ** Compared with node mod, the biggest advantage of this method is that adding and deleting nodes only affects the adjacent nodes in the hash ring and has no impact on other nodes.

What are the other disadvantages?

The nodes added to our hash are not evenly distributed, leading to the skew problem. Once tilted, a large part of the data may fail after joining the nodes. In addition, it will lead to uneven distribution of data (as shown in the figure below, most data will be stored in node A in probability).

The solution

Adding a Virtual Node

Virtual Slot Partition

Virtual slot partitioning makes clever use of hash space, using well-distributed hash functions to map all data into a fixed range of integers defined as slots. For example, the number of Redis Cluster slots ranges from 0 to 16,383. A slot is the basic unit for data management and migration in a cluster.

The basic idea is to map several virtual nodes to existing nodes, and then map these virtual nodes to the Hash ring. For example, A, B, and C simulate A1 to A5, B1 to B10, and C1 to C10.

This is equivalent to a node that manages multiple slots in the hash ring. This will guarantee the uniformity of the distribution with high probability.

conclusion

Consistent hash is used to reduce the cost of data distributed across the system in distributed scenarios. Relatively speaking, the principle is relatively simple.