# Distributed cluster architecture scenario-based solution

  • Consistent Hash algorithm

  • Cluster clock synchronization is faulty

  • Distributed ID solution

    Globally unique ID scheme in distributed cluster

  • Distributed scheduling problem (distribution of scheduled tasks)

  • Session Sharing Problem

Consistent Hash algorithm

Hash algorithms, such as MD5 and SHA encryption algorithms, are used to store data and search Hash tables.

Why Hash?

** Direct addressing: ** binds data directly to the subscript of the array, and retrieves data directly from the array.

Advantages: fast speed, a search results.

Disadvantages:

1) Space is wasted, such as 1, 5, 7, 8, 12306. In the above way, we need to define an array of length 12307.

2) data such as: 1,5,7,1,5,7,1,2,1,1,1,3,2,12 maximum 12, open up 13 Spaces, can not store so much data.

Now, in another design, if we have data 3,5,7,12306, we have four data points, and we create any space, let’s say five, where do we store the locations? We can calculate the modulus according to the data (for the number of spatial positions 5), determine the subscript of the storage location according to the remainder of the modulus. If 3%5=3, we can put the data 3 in the position with the subscript 3, and 12306%5=1 is stored in the position with the subscript 1.

It’s a hash algorithm, but it’s not a common and simple hash algorithm, and the way you construct a hash algorithm is called a divisor remainder algorithm

If the data is 1,6,7,8 store those four data points in the array above

On this basis, open addressing method is adopted

** open addressing: **1 goes in, 6 comes in, forward or backward to find a free place to store, bad place, if the array length is defined like 10, the length can’t be extended, 11 data, Hash or not, definitely can’t store that much data

** Zip method: ** The length of the data is defined, how to store more contents, calculate the Hash value, put a linked list in the array element store location

If the Hash algorithm is well designed, the query efficiency will be closer to O(1). If the Hash algorithm is poorly designed, the query efficiency will be very low.

Divisor remainder method

Linearly constructed Hash algorithm

Direct addressing is also a simpler way to construct a Hash: H(key)=key

H of key =a key + b(a,b is constant)

1. Application scenarios of the Hash algorithm

Application scenarios of the Hash algorithm in distributed cluster architecture

Hash algorithm has been applied in many distributed clusters, such as distributed cluster architecture Redis, Hadoop, ElasticSearch, Mysql sub-library sub-table, Nginx load balancing, etc.

The main application scenarios are summarized as follows:

  • Load balancing of requests (such as Nginx’s IP_hash policy)

    The IP_hash policy of Nginx can always route requests sent by clients to the same destination server when the CLIENT IP address remains unchanged, thus realizing sticky sessions and avoiding session sharing problems.

    • How can SESSION stickiness be achieved without the IP_hash policy?

      You can maintain a mapping table to store the mapping between client IP or sessionId and specific target server < IP,tomcat>

    • Disadvantages:

      • In the case of many clients, the mapping table is very large and wastes memory space
      • When the client and target server go online or offline, the mapping table will be maintained again, and the maintenance cost of the mapping table is high
    • If using the hash algorithm, we can to the IP address or sessionId to compute the hash value, hash value modulo arithmetic with the number of servers, the resulting value is the current request should be routed to the server code, so, the same client IP forwards the request can be routed to the same target server, realize the session viscous.

  • Distributed storage

    Hash (key)%3=index to determine which redis to store

2. Problems with the Hash algorithm

The common hash algorithm has a problem. Take ip_hash as an example. Assuming that the IP address of the download user is fixed and has not changed, tomcat3 has a problem.

In real situations, if there are many background servers and many clients, the impact will be great. This problem will exist in both capacity reduction and expansion. A large number of user requests will be routed to other target servers for processing, and user sessions on the original server will be lost.

3. Consistent hash algorithm

First, there is a line, which starts and ends with 1 and 2 to the 32 minus 1 respectively, which is equivalent to an address. For such a line, it bends to form a loop to form a closed loop. Such a loop is called a hash ring. If we hash the IP or host name of the server to the hash ring, then for the client user, we hash the IP to a bit on the ring, and then how do we determine which server to route a client to? Locate the nearest server node in the clockwise direction

If server 3 is taken offline, the client originally routed to server 3 is re-routed to server 4 after server 3 is taken offline. There is no impact on other clients except for this small part (the migration of requests is minimized. This algorithm is very suitable for distributed cluster, avoiding large number of requests to migrate).

After the addition of server 5, some clients originally routed to server 3 will be routed to server 5. There is no impact on other clients, only this small part will be affected. (The migration of requests is minimized, so this algorithm is very suitable for distributed cluster, avoiding large number of requests migration.)

  • To sum up, each server is responsible for a segment, and the consistent hash algorithm only needs to relocate a small part of the data in the ring space for the increase or decrease of nodes, which has good fault tolerance and expansibility.
    • The consistency hashing algorithm causes data skew when there are too few service nodes.
  • In order to solve the problem of data skews, the consistent hash algorithm introduces the virtual node mechanism, computes multiple hashes for each service node, and places one service node in each computed result position, which is calledVirtual node.
    • You can add a number after the server IP address or host name. For example, three virtual nodes can be calculated for each server, which are ip#1 of node 1, ip#2 of node 1, and ip#3 of node 1… Then, 6 virtual nodes are formed. When the client is routed to the virtual node, it is actually routed to the real node corresponding to the virtual node.