Put forward the demand

Public int[] datas = new int[]{1, 5, 6, 3, 8, 2, 7} public int[] datas = new int[]{1, 5, 6, 3, 8, 2, 7} In the.

Common methods

This requirement sounds so simple that you can basically type code without thinking about it

/** ** sequential search */
@Test
public void SequentialSearch(a) {
    int n1 = 2;// Given a number that exists in the dataset
    int n2 = 9;// Give a number that does not exist in the dataset
    for (int i = 0; i < datas.length; i++) {
        if (datas[i] == n1) {
            System.out.println(n1 + "There");
        }
        if (datas[i] == n2) {
            System.out.println(n2 + "There"); }}}Copy the code

This method, called the sequential search method, can meet the requirements, but the efficiency is very low, the time complexity is O(n), the time cost is proportional to the size of the data set.

Try dichotomy for efficiency:

/** * dichotomy */
@Test
public void BinarySearch(a) {
    Arrays.sort(datas); // Dichotomy works only for sorted datasets
    int n1 = 2;// Given a number that exists in the dataset
    int n2 = 9;// Give a number that does not exist in the dataset
    boolean result1 = doBinarySearch(datas, n1);
    boolean result2 = doBinarySearch(datas, n2);
    if (result1){
        System.out.println(n1 +"There");
    }
    if (result2){
        System.out.println(n2 +"There"); }}public boolean doBinarySearch(int[] datas, int num) {
    int left = 0;
    int right = datas.length - 1;
    while (left <= right) {
        int mid = (left + right) / 2;
        if (datas[mid] == num) {
            return true;
        }
        if (datas[mid] < num) {
            left = mid + 1;
        }
        if (datas[mid] > num) {
            right = mid - 1; }}return false;
}
Copy the code

The time complexity of the binary method is log₂n, or O(log n), and the efficiency is higher than that of the sequential search method, but the efficiency is still not optimal.

Can we implement requirements without cycling and without dichotomy? Imagine if we could do it all at once. Wouldn’t that be the fastest query?

Inconsistent Hash algorithm

The hash algorithm is here. First of all, there should be a recognition: hash algorithm and sorting algorithm is the same as a variety of algorithms collectively, not a specific algorithm, like sorting algorithm insertion sort, bucket sort and so on specific algorithm.

Given the game of time and space complexity, it’s natural to think that high efficiency comes at the expense of high storage, but it doesn’t matter, space is not valuable now, we’re going after time.

We can define an array whose length is not less than the maximum value of the dataset. For this example, the maximum value of dataset 1, 5, 6, 3, 8, 2, 7 is 8, so we can define an array whose length is greater than or equal to 8. Then place the data in the data set into the array by index (1 at index 1, 5 at index 5…). .

In this way, we can query the data in one step, for example, to get the number 2, we can query the array [2], which is a simple hash algorithm direct addressing method.

But what’s the downside?

  1. Waste of space. Space may be cheap, but it’s not worth it. The above example is very concentrated so you can’t see the waste, but if you have a data set1, 5, 6, 3, 8, 2, 7, 10,000To do this, we need to have an array of at least 10000 length, and this array only stores 7 elements, which is extremely inefficient and wasteful.
  2. Data duplication problem. If you have a data set1, 5, 6, 3, 8, 2, 7, 1, 1, 1, 2, 2, its maximum value is still 8, and it is no longer sufficient to create an array of length 8 or 8+1.
  3. Data range issues. If you have a data set1, 5, 6, 3, 8, 2, 7, -1, -2Array subscripts are not allowed to be negative, and there is also problem 2.

If you see a problem, you have to solve it, and we can improve it by using division and remainder (a common hash algorithm). The principle of division and remainder is that data modulo memory, what does that mean? Take data set 1, 5, 2, 9 and 8 as the original data set for example: the original data set has 5 elements, so we only need a container with length 5 to store it.

  1. Create a container (array) of length 5

  2. For each data x, modulo (x % 5) is performed on the array capacity 5 to determine the location of the data in the array

    1%5 is left with 1, and data 1 is stored at the position of array subscript 1; 5%5 with 0 remaining, data 5 is stored at the position with subscript 0 in the array, and so on……

In this way, all data can be stored in the new container according to a specific rule. When querying, for example, if you want to check whether 9 exists in the data set or even its location, you only need to calculate 9%5 (capacity) =4 to know the location of 9.

However,

Let’s say the original data set is 1, 5, 2, 9, 6, or create an array of size 5.

This is called a Hash collision, or Hash collision, and the algorithm needs to be improved to handle the collision, hence the open addressing method: once a collision occurs, it looks for the next empty Hash address. As long as the Hash table is large enough, the empty Hash address is always found and the record is stored.

So how does this approach work in practice? There is no such thing as a thief. The reason for this is that as long as the amount of data is large enough, Hash collisions can be numerous, and the strategy of constantly expanding Hash tables to resolve Hash collisions is unrealistic and low. Address method, then had another chain chain address also calls the zipper, it can be understood as is that the increase of the addition to the open addressing method or conflict with open addressing method advocates, avoid abilities, I think you share my pit unintelligent, harmonious way to handle it, find another piece of pit zips method is different, it argues that conflict is not necessarily the latter in place, but in situ.

The zipper approach to conflict resolution is to link all the data nodes with the same Hash into a single linked list, that is, “although you occupy my house, I can build another floor above yours and live together”, kind of like licking a dog.

But there are still loopholes in the dog-licking zipper. For example, for data sets 1, 6, 11, 16, and 21, the result of using the zipper method is that the data is concentrated at index 1, and other space is wasted.

Learn about the application scenarios of the Hash algorithm in distributed cluster architecture and why it is difficult to use the above simple Hash algorithm in this scenario. Hash algorithm is applied in many distributed cluster products, such as distributed Redis, Hadoop, ElasticSearch, Mysql sub-library sub-table, and Nginx load balancing. These main application scenarios can be summarized into two categories:

  • Load balancing of requests
  • Distributed storage

The load balancer of the request is selected here to compare different Hash algorithms.

Simplify certain Hash information (such as IP address) on the client to numbers 5, 4, 7, 3, and 2. Assume that the server is a Tomcat cluster with three servers numbered 0, 1, and 2. Now to use the division remainder method to achieve load balancing, client IP to the number of server module, the module result is the actual processing of the corresponding client request server number.

One server is down, and a new model is required. Most client sessions are affected.

Add one server and ask for a new model. Most client sessions are affected.

As we analyze the drawbacks of the division remainder method, the impact of the number of servers on the client side is visible.

Consistent Hash algorithm

Here comes our hero consistent Hash algorithm.

A consistent Hash algorithm: Take a number line that ranges from 0 to 2³² -1, or 0 to the largest positive integer, and bend it into a circle, end to end, to form a ring called a Hash ring.

The values of server nodes and client nodes calculated by the hash algorithm are distributed on the hash ring. In the figure below, there are four servers. The client request is processed by the first server that the client searches clockwise on the hash ring.

So how well does the consistent Hash algorithm deal with the increase and decrease of servers?

One server is down. Only the clients between server 1 and server 2 in the hash ring are affected.

If one server is added (server 5) and the node calculates the hash value between server 1 and server 2, only the clients between server 1 and server 5 are affected.

This ensures that the impact is minimized. Of course, if a consistent Hash algorithm is not well designed, it can still cause data skew. What is data skew?

When the positions of the server nodes calculated by the algorithm are too close to each other on the hash ring, the server in the latter will be under great pressure. Simply speaking, the server is not evenly distributed on the hash ring. In order to solve the problem of data skew caused by such uneven distribution, we can also add another scheme — virtual node to assist the consistency algorithm.

Consistent Hash algorithm + Virtual node solution: Virtual nodes are created based on real servers, Hash values are calculated and distributed to the Hash ring, so that each server can process more evenly client requests.

Handwritten Hash algorithm

In practice, we don’t need to write a Hash algorithm. Here we just simulate a simple Hash algorithm to feel its effect.

/ * * *@DescriptionOrdinary hash algorithm */
public class CommonHashAlogrithm {
    public static void main(String[] args) {
        // 5 clients
        String[] clientIps = new String[]{"101.18.19.14"."83.12.84.200"."101.28.34.30"."88.45.121.45"."124.214.77.5"};
        // Number of servers: 5
        int serverCount = 5;
        for (String clientIp : clientIps) {
            int index = Math.abs(clientIp.hashCode()) % serverCount;  // The client hash modulo the number of servers
            System.out.println("Client" + clientIp + "Is requested by" + index + "Server processing"); }}}Copy the code
/ * * *@DescriptionConsistent Hash algorithm - None Virtual Node edition */
public class ConsistenceHashAlgorithmWithoutVirtual {
    public static void main(String[] args) {
        // 3 servers
        String[] serverIps = new String[]{"159.35.7.25"."95.1.75.30"."22.44.66.88"};
        TreeMap<Integer, String> serverMapping = new TreeMap<>();   // Save to TreeMap. TreeMap has its own sort
        for (String serverIp : serverIps) {
            int index = Math.abs(serverIp.hashCode());
            serverMapping.put(index, serverIp); // Store mappings 
      
        }

        // Get 10 clients
        String[] clientIps = new String[]{"123.23.90.89"."124.34.78.67"."134.45.66.45"."156.56.34.60"."167.67.88.45"."123.44.90.11"."124.34.78.22"."12.45.166.33"."156.69.34.44"."58.67.188.55"};

        // Find the first server on the Hash ring that can process the request clockwise based on the client Hash value
        for (String clientIp : clientIps) {
            int clientHash = Math.abs(clientIp.hashCode()); // Computes the Hash value of the client
            SortedMap<Integer, String> sortedMap = serverMapping.tailMap(clientHash);   // Get the mapping table whose Hash value is larger than that of the client
            String handlerServerIp = null;
            if (sortedMap.isEmpty()) {  // If the mapping table is empty, the Hash value of no server IP address is greater than that of the current client IP address
                handlerServerIp = serverMapping.get(serverMapping.firstKey());  // Because of the hash ring, it is assigned to the first server clockwise
            } else {
                handlerServerIp = serverMapping.get(sortedMap.firstKey());  // Otherwise, the request is forwarded to the first server in the clockwise direction
            }
            System.out.println("Client" + clientIp + "Request by server" + handlerServerIp + "Processing"); }}}Copy the code
/ * * *@DescriptionConsistent Hash Algorithm - Virtual Node Edition */
public class ConsistenceHashAlgorithmWithVirtual {
    public static void main(String[] args) {
        // Two servers
        String[] serverIps = new String[]{"159.110.222.33"."159.101.222.34"};
        TreeMap<Integer, String> serverMapping = new TreeMap<>();   // Save to TreeMap. TreeMap has its own sort
        int virtualNodeCount = 3;   // Define three virtual nodes for each real server
        for (String serverIp : serverIps) {
            int index = Math.abs(serverIp.hashCode());
            serverMapping.put(index, serverIp); // Store mappings 
      
            // Create a virtual node mapping
            for (int i = 0; i < virtualNodeCount; i++) {
                String vServerIp = i + "#" + serverIp;
                int vIndex = Math.abs(vServerIp.hashCode());
                serverMapping.put(vIndex, "Virtual Node" + vServerIp); // Storage mapping 
      
       , add virtual node to the IP address of the virtual node for easy query
      }}// Come to 20 clients
        String[] clientIps = new String[]{
                "123.23.90.89"."124.34.78.67"."134.55.66.55"."156.56.124.60"."167.67.88.45"."123.44.90.11"."124.34.78.22"."12.45.166.33"."156.69.34.44"."58.67.188.55"."34.23.90.45"."124.255.78.67"."134.12.66.55"."99.56.124.60"."167.67.58.45"."123.44.90.11"."88.34.72.22"."43.45.65.49"."156.69.68.167"."126.67.44.55"
        };

        // Find the first server on the Hash ring that can process the request clockwise based on the client Hash value
        for (String clientIp : clientIps) {
            int clientHash = Math.abs(clientIp.hashCode()); // Computes the Hash value of the client
            SortedMap<Integer, String> sortedMap = serverMapping.tailMap(clientHash);   // Get the mapping table whose Hash value is larger than that of the client
            String handlerServerIp = null;
            if (sortedMap.isEmpty()) {  // If the mapping table is empty, the Hash value of no server IP address is greater than that of the current client IP address
                handlerServerIp = serverMapping.get(serverMapping.firstKey());  // Because of the hash ring, it is assigned to the first server clockwise
            } else {
                handlerServerIp = serverMapping.get(sortedMap.firstKey());  // Otherwise, the request is forwarded to the first server in the clockwise direction
            }
            System.out.println("Client" + clientIp + "Request by server" + handlerServerIp + "Processing"); }}}Copy the code