A friend recently ran to come over to ask what is a Hash algorithm consistency, said the interview was asked to, because do not understand, so there is no answer, ask me could you recommend a corresponding learning materials, then go to work, have no time to reply, back in the evening and forget about it, suddenly saw this today, work overtime to organize what is a Hash algorithm consistency, Hope to help everyone!

Those who often read my articles should be familiar with my routine of writing articles. The first question is why? So why do we have Hash consistency algorithms? Just like the previous introduction of why There is a Spring, we will first analyze it from the perspective of history or project development. Today’s sharing is still the same routine. We will first analyze it step by step from the perspective of history and discuss what Hash consistency algorithm is on earth!

First, use of Redis cluster

When we use Redis, in order to ensure the high availability of Redis and improve the read and write performance of Redis, the simplest way is Master/Slave replication, to form master-master or master-slave, or to set up Redis cluster for data read and write separation. Similar to master/slave replication and read/write separation of databases. As follows:

Assuming that we have a social networking site, we need to use Redis to store image resources. The storage format is key-value pair, the key value is the name of the image, and the value is the path of the file server where the image is located. We need to find the path of the file server where the file is located according to the file name, and the data volume is about 2000W. We can divide the database according to our agreed rules, which are random allocation. We can deploy 8 cache servers, each of which contains about 500W pieces of data, and carry out master/slave replication, as shown in the diagram below:

As the rules are random, all our data may be stored in any group of Redis. For example, in the figure above, our user searched a picture named “A. pong”. Since the rules are random, we are not sure which Redis server they are on, so we need to perform 1, 2, 3, 4. It took 4 queries to get the query (i.e. traversing all the Redis servers), which is obviously not the result we want. If you have known about it, you may think that random rule is not enough, you can use similar rules to the sub-database sub-table rule in the database: Hash, mod, category, field, and other common rules can come out! Ok, so for our theme, we’re going to use Hash.

Use Hash for the Redis cluster

As you can imagine, if we use Hash, each image can be located to a specific server when sorting, as shown in the diagram below:

Hash (a.png) % 4 = 2; hash(a.png) % 4 = 2;

3. Hash problems

While the above approach improves performance, we no longer need to traverse the entire Redis server! However, there are some drawbacks when using the above Hash algorithm for caching. This is mainly reflected in the fact that all the cache positions change when the number of servers changes!

Imagine if four cache servers are no longer sufficient for our cache needs, what should we do? Very simple, add a few more cache servers not on the line! Suppose: we add a cache server, so the number of cache servers goes from 4 to 5. Hash (a.png) % 4 = 2 hash(a.png) % 5 =? The result of this situation is that when the number of servers changes, all the cache locations change! In other words, when the number of servers changes, all caches are invalidated for a certain amount of time, and when the application cannot retrieve data from the cache, it requests data from the back-end database (remember cache Avalanche from the previous article?). !

Similarly, if one of the four caches suddenly fails and cannot be cached, then we need to remove the faulty machine, but if we remove one cache server, then the number of cache servers from 4 to 3 will also be the same problem!

Therefore, we should try to prevent this from happening, but due to the above Hash algorithm itself, this situation is inevitable when using modulo caching, to solve these problems, Hash consistency algorithm (Hash consistency algorithm) was born!

Fourth, the mystery of the consistent Hash algorithm

The consistent Hash algorithm also uses the same modulo method, except that the modulo method described is modulo the number of servers, whereas the consistent Hash algorithm modulo 2^32. What does that mean? In simple terms, the consistent Hash algorithm organizes the entire Hash space into a virtual ring. For example, if the value space of a Hash function H is 0-2^32-1 (i.e. the Hash value is a 32-bit unsigned integer), the entire Hash ring is as follows:

Organize clockwise
Hash ring

The next step is to Hash each server using Hash. Specifically, the IP or host name of the server can be selected as the key word to Hash, so that each machine can determine its position in the Hash ring. Here, it is assumed that the position of the four servers in the ring space after using IP address Hash is as follows:

For example, we have four data objects, Object A, Object B, Object C and Object D. After hashing, their positions in the ring space are as follows:

5. Fault tolerance and scalability of consistent Hash algorithms

If Node C fails, objects A, B, and D will not be affected. Only object C will be relocated to Node D. Generally, in the consistent Hash algorithm, if a server is unavailable, the data affected is only the data between this server and the previous server in its ring space (i.e. the first server encountered by walking counterclockwise), and other data will not be affected, as shown below:

Consider another case where a server Node X is added to the system, as shown in the following figure:

To sum up, the consistent Hash algorithm only needs to relocate a small part of the data in the ring space for the increase or decrease of nodes, which has good fault tolerance and scalability.

Data skew of the Hash ring

If the consistent Hash algorithm has too few service nodes, it is easy to cause data skew due to uneven node distribution (most cached objects are cached on one server in a centralized manner). For example, there are only two servers in the system, and the ring distribution is as follows:

Virtual Node Mechanism
Virtual node

For example, we can calculate three virtual nodes for each server, so we can calculate the hash value of “Node A#1”, “Node A#2”, “Node A#3”, “Node B#1”, “Node B#2”, “Node B#3”, and then form six virtual nodes:

There is only one more step of mapping virtual nodes to real nodes

Seven,

Step by step, in this paper, we analyzed what is a consistent Hash algorithm, mainly considering the distributed system, each node may be failure and the new node is likely to dynamic increase in come in, how to ensure that when the system when there is a change in the number of nodes, our system will still be able to provide good service, it is worth considering.


Reference article:

1, www.cnblogs.com/lpfuture/p/… 2, www.zsythink.net/archives/11…

Java Backend Technology (ID: JavaITWork)1024, you can get it for free! Includes SSM, Spring family bucket, microservices, MySQL, MyCat, cluster, distributed, middleware, Linux, network, multi-threading, Jenkins, Nexus, Docker, ELK and so on free learning video, continue to update!