Concern public number: XY’s technical circle

The Hash algorithm

Whenever a distributed system is involved, there will be problems of load balancing and data distribution. In order to distribute connections (or data) more evenly, Hash algorithms are often used.

So what is a Hash algorithm?

The input of any length is transformed into a fixed-length output using the Hash algorithm. The output is the Hash value. The space for hashes is much smaller than the space for inputs, so a “hash collision” can occur, where two different inputs produce the same output.

The Hash algorithm is only a definition and does not specify an implementation.

What are the specific application scenarios?

Hash algorithms are commonly used in message digest scenarios, and developers have more or less heard of MD5 and SHA as implementations of Hash algorithms.

In the Java language, every Object has a hashCode method, which by default evaluates the Hash value based on the Object’s memory address. Of course, we can override this method and use our own method to calculate a Hash value.

Hash modulus

Hash and then mod is a common scenario in load balancing.

hash(request) % n

Suppose we now have three servers and want to perform load balancing, we can use the Hash function on the IP address or user ID of the request, and then modulo the Hash value of 3 with the remainder to allocate the request to the corresponding server. As shown in figure:

Here, for demonstration purposes, we directly simulate the hash value used.

Disadvantages?

So what’s the downside?

Horizontal scaling is a normal requirement in distributed scenarios. Imagine that when the three servers mentioned above need to scale to four servers, the vast majority of requests basically need to be remapped to another node. Because the Hash value doesn’t change, the remainder must change if the divisor changes. As shown in figure:

Such changes are sometimes unacceptable. For example, in a Web load balancing scenario, sessions are stored on each node. Of course, if you are a “stateless” service, this won’t be a problem. However, in a data persistence scenario, such as the MySQL database table, such a large change is clearly not acceptable. If a node is added or removed, almost all data needs to be relocated.

Stateless service means that the application does not store any state. For example, no session is used, or the session is stored in a third-party cache node.

Consistency of the Hash

Using consistent hashes can resolve large data changes due to horizontal scaling. How does it work out?

We talked about using the number of nodes as a divisor to do the remainder. The divisor of a consistent Hash is 2^32^. From zero to two to the thirty-two to the minus one, end to end forms a ring. We Hash the IP of the server node and divide by 2^32^ to get the position of the server node in the Hash ring:

Now I have a request coming in, and I Hash it again and I mod it at 2^32^. If you land on the Hash ring and then go clockwise to the first node, that node is responsible for processing the request. As shown in the figure:

Inconsistent Hash distribution

Careful readers may notice that a consistent Hash algorithm may have an uneven distribution when the number of nodes is small. As shown in the figure:

At this point, the vast majority of requests are allocated to node A, while nodes B and C are allocated very few requests. The solution to this problem is to add virtual nodes to the Hash ring. There are many ways to implement this, such as:

Hash (” SERVER_IP_A # 1 “) % 2 ^ ^ 32

Hash (” SERVER_IP_A # 2 “) % 2 ^ ^ 32

“SERVER_IP_A hash (# 3) % 2 ^ ^ 32

What are the application scenarios of consistent Hash?

So what’s the point of consistency hashing anyway?

In addition to load balancing, Web sessions, and database tables, consistent Hash is most commonly used in distributed caching.

Distributed cache systems generally Hash keys. Imagine that if the number of nodes is modulated directly, any change in the number of nodes, such as adding a node or deleting a node, will invalidate the cache of almost all nodes. If a consistent Hash is used, only the cache of adjacent nodes on the Hash ring is affected.

As you can see from the Hash ring and clockwise lookup principle, after adding a node, some requests that were originally allocated to the next node are allocated to the new node. After a node is reduced, some requests that were originally allocated to this node are allocated to the next node.

Therefore, whether a node is added or removed, only the next adjacent node will be affected.

Write articles carefully and share with your heart.

Personal website: Yasinshaw.com

Public number: XY’s technical circle