By: no dishes studio – Marklux

Source: Marklux’s Pub

The copyright belongs to the author, please indicate the source

Introduction to the

Consistent Hash is a special Hash algorithm that is widely used in load balancing due to its balanced and persistent mapping characteristics. For example, Nginx and Memcached use consistent Hash as a cluster load balancing solution.

This article introduces the basic idea of consistent Hash and discusses its application in distributed cache cluster load balancing. At the same time, the corresponding code tests will be carried out to verify the algorithm characteristics and give some comparisons with other load balancing schemes.

Introduction to consistent Hash algorithms

Before we look at consistent Hash algorithms, let’s discuss the characteristics of Hash itself. The most important function of ordinary Hash functions is to Hash, or to break up a series of similar data into random, evenly distributed data.

For example, if the string ABC and abcd are md5 computed separately, the result is as follows:

As you can see, two very similar forms of data are md5 hashed into completely random strings. Load balancing takes advantage of this feature to evenly Hash a large number of random requests or invocations through a certain form of Hash to average the pressure. (Of course, using a Hash doesn’t always lead to a uniform Hash, as I’ll look at later.)

For example, if we generate a Key for each request, just use a very simple Hash algorithm Group = Key % N to implement load balancing for the request, as follows:

(If the Key is used as the cache Key and the corresponding Group stores the Value of the Key, a distributed cache system can be realized. The concrete examples in the following are based on this scenario.)

It is not difficult to find that as long as the number of clusters N changes, all previous Hash mappings will be invalid. If each machine in the cluster provides the same service, this doesn’t matter, but for a system like distributed caching, a total mapping failure means that all previous caches fail, which can be disastrous.

Consistent Hash solves this problem by building a circular Hash space instead of a linear Hash space, as shown below:

The entire Hash space is constructed as a end-to-end ring, which requires two mappings to use a consistent Hash.

The first time, you Hash each node (cluster) and record their Hash value, which is where they are on the ring.

The second time, we Hash each Key and then go clockwise to the first node on the ring, which is the cluster that stores the Key.

Analyze the impact of node addition and deletion on load balancing, as shown in the following figure:

As you can see, when a node is deleted, the mapping of the remaining nodes on the ring does not change, except that the Key that was placed on the corresponding node is now moved to the next node in the clockwise direction. The same is true for adding a node, and only a small number of keys will eventually fail. However, after the node changes, the pressure of the whole system is not balanced, and the method mentioned below will solve this problem.

Problems and Optimization

The most basic consistent Hash algorithm is directly applied to the load balancing system, but the effect is still not ideal and there are many problems. The following will analyze these problems one by one and seek better solutions.

Data skew

If the number of nodes is small and the hash ring has a large space (usually 0 to 2^32), the hash will be consistent. In most cases, the nodes will be very uneven in the ring, squeezed into a small area. The ultimate impact on distributed caching is that the amount of cached data stored on each instance of the cluster is inconsistent, leading to severe data skew.

Cache avalanche

If each node has only one node on the ring, then you can imagine that when a cluster disappears from the ring, all the tasks it was responsible for are handed over to the next cluster in the clockwise direction. For example, when group0 exits, the cache it was responsible for is handed over to Group1. This means that the access pressure on Group1 increases suddenly. Imagine if Group1 were to collapse under pressure, and then even more pressure would come to group2, leading to an avalanche of service pressure.

Importing virtual Nodes

The best way to solve these two problems is to expand the number of nodes on the entire ring, so we introduced the concept of virtual nodes. A real node will map to multiple virtual nodes, so that the space split across the Hash ring becomes even.

The introduction of virtual nodes also randomizes the order of nodes on the Hash ring, which means that when a real node fails and exits, its original load is evenly distributed to the other nodes.

The diagram below:

The test code

Now let’s try to write some test code to see if consistent hash actually works as we expect.

First we need a Hash algorithm that will Hash the input evenly. There are many options available. Memcached officially uses the MD5-based KETAMA algorithm, but for efficiency purposes uses the FNV1_32_HASH algorithm as follows:

public class HashUtil {
    /** * Compute the Hash value using the FNV1_32_HASH algorithm *@param str
     * @return* /
    public static int getHash(String str) {
        final int p = 16777619;
        int hash = (int)2166136261L;
        for (int i = 0; i < str.length(); i++) {
            hash =( hash ^ str.charAt(i) ) * p;
        }
        hash += hash << 13;
        hash ^= hash >> 7;
        hash += hash << 3;
        hash ^= hash >> 17;
        hash += hash << 5;

        if (hash < 0) {
            hash = Math.abs(hash);
        }
        returnhash; }}Copy the code

Actual use can be adjusted according to demand.

You then need to use a data structure to hold the hash ring, and there are many ways to do that, the simplest being an array or a linked list. But this lookup requires sorting, and if there are many nodes, it can be slow.

In view of the cluster load balancing state, it is easy to think of the binary balance tree structure to store, in fact, TreeMap (internal implementation is red-black tree) can be used as the Hash ring storage structure.

Write the simplest Hash ring test with no virtual nodes:

public class ConsistentHashingWithoutVirtualNode {

    /** * Cluster address list */
    private static String[] groups = {
        "192.168.0.0:111"."192.168.0.1:111"."192.168.0.2:111"."192.168.0.3:111"."192.168.0.4:111"
    };

    /** * is used to hold the node */ on the Hash ring
    private static SortedMap<Integer, String> sortedMap = new TreeMap<>();

    /** * initializes to add all servers to the Hash ring */
    static {
        // Use red-black tree implementation, insertion efficiency is relatively poor, but the search efficiency is very high
        for (String group : groups) {
            int hash = HashUtil.getHash(group);
            System.out.println("[" + group + "] launched @ "+ hash); sortedMap.put(hash, group); }}/** * Calculates the group on which the widget is loaded **@param widgetKey
     * @return* /
    private static String getServer(String widgetKey) {
        int hash = HashUtil.getHash(widgetKey);
        // Fetch only the parts greater than the hash value without traversing the whole Tree
        SortedMap<Integer, String> subMap = sortedMap.tailMap(hash);
        if (subMap == null || subMap.isEmpty()) {
            // The hash value is at the end and should be mapped to the first group
            return sortedMap.get(sortedMap.firstKey());
        }
        return subMap.get(subMap.firstKey());
    }

    public static void main(String[] args) {
        // Generate random numbers for testing
        Map<String, Integer> resMap = new HashMap<>();

        for (int i = 0; i < 100000; i++) {
            Integer widgetId = (int)(Math.random() * 10000);
            String server = getServer(widgetId.toString());
            if (resMap.containsKey(server)) {
                resMap.put(server, resMap.get(server) + 1);
            } else {
                resMap.put(server, 1);
            }
        }

        resMap.forEach(
            (k, v) -> {
                System.out.println("group " + k + ":" + v + "(" + v/1000.0D +"%)"); }); }}Copy the code

Generate 10,000 random numbers for testing, and finally get the pressure distribution as follows:

[192.168.0.1:111] launched @ 8518713 [192.168.0.2:111] launched @ 1361847097 [192.168.0.3:111] launched @ 1171828661 [192.168.0.4:111] LAUNCHED @ 1764547046 GROUP 192.168.0.2:111:8572 (8.572%) Group 192.168.0.1:111: 18693(18.693%) group 192.168.0.4:111:17764 (17.764%) group 192.168.0.3:111:27870 (27.87%) group 192.168.0.0:111: 27101 (27.101%)Copy the code

As you can see, the pressure is still uneven, so let’s go ahead and introduce virtual nodes:

public class ConsistentHashingWithVirtualNode {
    /** * Cluster address list */
    private static String[] groups = {
        "192.168.0.0:111"."192.168.0.1:111"."192.168.0.2:111"."192.168.0.3:111"."192.168.0.4:111"
    };

    /** * Real cluster list */
    private static List<String> realGroups = new LinkedList<>();

    /** * Virtual node mapping */
    private static SortedMap<Integer, String> virtualNodes = new TreeMap<>();

    private static final int VIRTUAL_NODE_NUM = 1000;

    static {
        // Add the real node list first
        realGroups.addAll(Arrays.asList(groups));

        // Map the virtual node to the Hash ring
        for (String realGroup: realGroups) {
            for (int i = 0; i < VIRTUAL_NODE_NUM; i++) {
                String virtualNodeName = getVirtualNodeName(realGroup, i);
                int hash = HashUtil.getHash(virtualNodeName);
                System.out.println("[" + virtualNodeName + "] launched @ "+ hash); virtualNodes.put(hash, virtualNodeName); }}}private static String getVirtualNodeName(String realName, int num) {
        return realName + "&&VN" + String.valueOf(num);
    }

    private static String getRealNodeName(String virtualName) {
        return virtualName.split("&") [0];
    }

    private static String getServer(String widgetKey) {
        int hash = HashUtil.getHash(widgetKey);
        // Fetch only the parts greater than the hash value without traversing the whole Tree
        SortedMap<Integer, String> subMap = virtualNodes.tailMap(hash);
        String virtualNodeName;
        if (subMap == null || subMap.isEmpty()) {
            // The hash value is at the end and should be mapped to the first group
            virtualNodeName = virtualNodes.get(virtualNodes.firstKey());
        }else {
            virtualNodeName = subMap.get(subMap.firstKey());
        }
        return getRealNodeName(virtualNodeName);
    }

    public static void main(String[] args) {
        // Generate random numbers for testing
        Map<String, Integer> resMap = new HashMap<>();

        for (int i = 0; i < 100000; i++) {
            Integer widgetId = i;
            String group = getServer(widgetId.toString());
            if (resMap.containsKey(group)) {
                resMap.put(group, resMap.get(group) + 1);
            } else {
                resMap.put(group, 1);
            }
        }

        resMap.forEach(
            (k, v) -> {
                System.out.println("group " + k + ":" + v + "(" + v/100000.0D +"%)"); }); }}Copy the code

The mapping between real and virtual nodes is a simple but effective string concatenation, which is officially implemented by Memcached. Set the number of virtual nodes to 1000 and test the pressure distribution again. The result is as follows:

Group 192.168.0.1:111:18354 (18.354%) Group 192.168.0.1:111:20062 (20.062%) Group 192.168.0.4:111: Group 192.168.0.3:111:20116 (20.116%) Group 192.168.0.0:111:20719 (20.719%)Copy the code

We can see that the distribution is almost equal. Then we continue to test the effect of removing and adding nodes on the overall service. The relevant test code is as follows:

private static void refreshHashCircle(a) {
    // When the cluster changes, the hash ring is refreshed and the positions of the remaining clusters on the hash ring remain unchanged
	virtualNodes.clear();
    for (String realGroup: realGroups) {
    	for (int i = 0; i < VIRTUAL_NODE_NUM; i++) {
       		String virtualNodeName = getVirtualNodeName(realGroup, i);
            int hash = HashUtil.getHash(virtualNodeName);
            System.out.println("[" + virtualNodeName + "] launched @ "+ hash); virtualNodes.put(hash, virtualNodeName); }}}private static void addGroup(String identifier) {
	realGroups.add(identifier);
    refreshHashCircle();
}

private static void removeGroup(String identifier) {
    int i = 0;
    for (String group:realGroups) {
    	if (group.equals(identifier)) {
        	realGroups.remove(i);
        }
        i++;
    }
    refreshHashCircle();
}
Copy the code

The pressure distribution before and after deleting a cluster is as follows:

Running the normal test. group 192.168.0.2:111:19144 (19.144%) group 192.168.0.1:111: 20244(20.244%) group 192.168.0.4:111:20923 (20.923%) Group 192.168.0.3:111:19811 (19.811%) Group 192.168.0.0:111: 19878(19.878%) removed a group, RuntestAgain. Group 192.168.0.2:111:23409 (23.409%) group 192.168.0.1:111:25628 (25.628%) 25583 (25.583%) group 192.168.0.0:111-25380 (25.38%)Copy the code

Also calculate how the Key on the missing cluster will eventually be transferred to another cluster:

[192.168.0.1:111-192.168.0.4:111] : 5255 [192.168.0.1:111-192.168.0.3:111] : 5090 [192.168.0.1:111-192.168.0.2:111] : 5069 [192.168.0.1:111-192.168.0.0:111] : 4938Copy the code

As you can see, after deleting the cluster, the pressure on the cluster is evenly distributed to the other clusters, and the cluster is still load balanced, as we expected. Finally, let’s look at adding the cluster.

Pressure distribution:

Running the normal test. group 192.168.0.2:111:18890 (18.89%) group 192.168.0.1:111: 20293 group 192.168.0.4:111:21000 (21.0%) Group 192.168.0.3:111:19816 (19.816%) Group 192.168.0.0:111: 20001(20.001%) Add a group, runtestGroup 192.168.0.2:111:16928 (16.928%) group 192.168.0.1:111: 16888(16.888%) group 192.168.0.4:111:16965 (16.965%) group 192.168.0.3:111:16768 (16.768%) group 192.168.0.0:111: 16927 (16.927%)Copy the code

Pressure transfer:

[192.168.0.0:111-192.168.0.7:111] : 3102 [192.168.0.4:111-192.168.0.7:111] : 4060 [192.168.0.2:111-192.168.0.7:111] : 3313 [192.168.0.1:111-192.168.0.7:111] : 3292 [192.168.0.3:111-192.168.0.7:111] : 3261Copy the code

To sum up, it can be concluded that after introducing enough virtual nodes, consistent hash can perfectly meet the requirements of load balancing.

Elegant shrinkage expansion

Cache servers have high performance requirements, so we want the new cluster to be filled with data and working quickly during capacity expansion. However, there is a considerable delay between starting a cluster and actually joining it and providing services. For more elegant scaling, there are two approaches:

  1. High frequency Key preheating

    As a routing layer, a load balancer can collect and count the access frequency of each cache Key. If a list of frequently accessed keys can be maintained and a new cluster can be preheated by pulling cache values of corresponding keys from the list, Key failures caused by new clusters can be greatly reduced.

    The specific design can be achieved by caching, as follows:

    However, there is a big limitation in the actual use of this scheme, that is, the cache invalidation time of high-frequency keys may be very short, and the Value stored during preheating may have been updated or invalid when it is actually accessed. Improper processing will lead to dirty data, so the implementation is quite difficult.

  2. Historical Hash environmental protection stay

    Reviewing the expansion of consistent Hash, it is not difficult to find that after a new node is added, its corresponding Key will remain in the original node for a period of time. Therefore, if the corresponding Key cache is not loaded on the new node during the capacity expansion delay period, you can read the Key cache from the original node.

    For example, if we have 3 clusters and now want to expand to 6 clusters, it means that 50% of the original keys will be invalid (transferred to the new node). If we maintain the two Hash rings before and after expansion, if we cannot find the storage of keys on the Hash ring after expansion, we will first turn to the Hash ring before expansion to search for a wave. If it can find it, it will return the corresponding value and write the cache to the new node. If it cannot find it, it will pass through the cache, as shown below:

    This has the disadvantage of increasing the cache read time, but is still much better than directly penetrating the cache. The advantage is that you can expand multiple machines without massive cache failures.

After expansion, let’s talk about reduction.

  1. Circuit breakers

    After the capacity reduction, the access pressure on the remaining nodes increases. If a node breaks down due to excessive pressure, a chain reaction may occur. Therefore, a corresponding circuit breaker should be set up for each cluster to protect the stability of the service.

  2. Update delay of multiple cluster LB instances

    This question in shrink when capacity is quite serious, if you use a cluster as a load balancing, and use a configuration server cluster status such as ConfigServer to push to build the Hash ring, then in a cluster exits this state will not necessarily be immediately synchronized to all LB, this could lead to a temporary scheduling, The diagram below:

    If an LB mistakenly sends a request to an exited cluster, it can cause a cache breakdown. There are several main ideas to solve this problem:

    • Slowly shrink the Hash ring until it is fully synchronized. You can do this by listening for access QPS that exit the cluster, waiting until the cluster has almost no QPS, and then pulling it down.
    • Manually delete the node from the Hash ring. If the node cannot be found, manually delete it from the Hash ring and reschedule the node. This method has certain risks and is not compatible with exceptions such as network jitter.
    • Active pull and retry: When a node on the Hash ring fails, the node proactively pulls the cluster status from the ZK to construct a new Hash ring. Multiple retries can be performed within a certain number of times.

Comparison: HashSlot

After understanding the characteristics of the consistent Hash algorithm, it is not difficult to find some unsatisfactory aspects:

  • The entire distributed cache requires a routing service for load balancing, which is a single point of problem (if the routing service fails, the entire cache cools down).
  • The search performance of the Hash ring is low when there are many nodes or updates are frequent

To solve these problems, Redis designs a new idea when implementing its distributed cluster solution: HashSlot algorithm based on P2P structure, which is briefly introduced below:

  1. Using HashSlot

    Similar to Hash rings, Redis Cluster uses HashSlot to evenly distribute keys and manage instances.

    First, 16,384 slots are allocated by default (exactly the size to store in 2KB), and each Slot is equivalent to a node on a consistent Hash ring. All instances of the access cluster will occupy these slots evenly, and finally when we Set a Key, we use CRC16(Key) % 16384 to calculate which Slot the Key belongs to and map it to the corresponding instance.

    How do Slot and instance mappings change when adding or removing instances?

    For example, if there are three nodes A,B, and C, then the Slot overwrites the cluster at the beginning:

    Node A 0-5460 Node B 5461-10922 Node C 10923-16383Copy the code

    Now suppose to add a node D, RedisCluster does this by moving some of the slots on each machine to D (note that this process also means that KV storage will be written to node D). After successful access, the Slot coverage will be as follows:

    Node A 1365-5460 node B 6827-10922 node C 12288-16383 Node D 0-1364,5461-6826,10923-12287Copy the code

    Similarly, to delete a node is to evenly return its Slot and corresponding KV storage to other nodes.

  2. P2P node search

    Now let’s consider how to implement decentralized access, which means that no matter which node in the cluster you visit, you can get the data you want. This is similar to a router’s routing table.

    • Each node has a complete Hashslot-node mapping, which means that each node knows which slots it owns and which node a given Slot corresponds to.

    • Whenever a node is requested to find a Key, the node uses CRC(Key) % 16384 to calculate which Slot the Key resides in and forwards the request to the node in which the Slot resides.

    To summarize, there are two main points: mapping tables and internal forwarding, which is implemented through the famous Gossip protocol.

Finally, we can give the system structure diagram of Redis Cluster, which is obviously different from the consistent Hash ring:

In contrast, the HashSlot + P2P solution solves the problem of decentralization while also providing better dynamic scalability. But compared to consistent Hash, its structure is more complex and its implementation more difficult.

As can be seen from the previous analysis, the consistent Hash scheme has a good performance on the whole. Therefore, in actual system application, the most suitable scheme can be reasonably selected according to the development cost and performance requirements. In short, both are excellent, but which one to use and how to use it is a matter of opinion.

Refer to the reading

  • Server architecture caching part 2: Distributed caching
  • Ketama algorithm of consistent hashing algorithm
  • How to gracefully scale up a consistent hash model
  • An in-depth study of consistent Hash algorithms and Java code implementation