This is the third day of my participation in the August Text Challenge.More challenges in August

The search for the last article is not the feeling of wanting more? Because we’re really getting into time complexity optimization. From O(n) of linear search to O(logN) of split search, it is definitely a qualitative leap. But what is one of the core requirements of our split search? It has to be that the raw data is in order. This can be a pain in the ass, since sorting can be tricky if there’s a lot of data. But don’t worry, today we are going to look at a hash table lookup which is another form of lookup. How far can it go?

O(1), yes, you read that right, hash table lookup at its best can achieve this constant level of lookup efficiency, isn’t it amazing?

Hashing (division and remainder method)

Let’s look at a very simple hash algorithm through a practical example. In the case of a large amount of data, we often need to perform table operations on the data table. The simplest scheme is to mold the data table according to a certain field, such as ID. In other words, if we want to divide 20 tables, we divide the ID of the data by 20 and get its remainder. This data is then added to the table corresponding to the remainder. We simulate this operation in code.

or($i=0;$i<100;$i{+ +)$arr[] = $i+1;
}

$hashKey = 7;
$hashTable = [];
for($i=0;$i<100;$i{+ +)$hashTable[$arr[$i] %$hashKey=] []$arr[$i];
}

print_r($hashTable);
Copy the code

In this case, we assume that we are putting 100 entries into seven tables, just use the modulo operator % to get the remainder, and then place the entries into the corresponding array subscripts. The 100 values are placed in indices from 0 to 6 in the array. In this way, we have realized the simplest idea of a data table. Of course, real business development is much more complicated than this. Because we’re looking at different scenarios to figure out exactly how to split tables, how many tables to split, and how to expand, which means it’s a lot more complicated than what we’ve written here.

As a demonstration code, this hash form is actually the most classic and most used division remainder method in hash table lookup. In fact, there are other methods, such as square center method, folding method, number analysis and so on. Their core idea is to act as a hash algorithm, allowing the original data to correspond to a new value (position).

In fact, the most typical of similar ideas is the hash operation of MD5 (), which produces different values for different contents. Other Key and value caching databases, such as Redis and Memcached, hash the Key and store it in memory for fast lookup.

Hashing problem (Linear detection method)

The problem with the above example is that if the value of the hash algorithm is very small, there will be a lot of duplicate conflicting data. If it is a real hash table storing data, such a store does not help us find the data we need quickly and accurately. Search search, its core ability is actually in the search. So if we’re given some random data, how do we keep it at the same length and avoid collisions? That’s what we’re going to be looking at in hashing.

$arr = [];
$hashTable = [];
for($i=0;$i<$hashKey;$i{+ +)$r = rand(1.20);
    if(! in_array($r.$arr)) {$arr[] = $r;
    }else{
        $i-; } } print_r($arr);
for($i=0;$i<$hashKey;$i{+ +)if(!$hashTable[$arr[$i] %$hashKey]) {$hashTable[$arr[$i] %$hashKey] = $arr[$i];
    }else{
        $c = 0;
        echo 'Position of conflict:'.$arr[$i] %$hashKey.', value: '.$arr[$i], PHP_EOL;
        $j=$arr[$i] %$hashKey+1;
        while(1) {if($j> =$hashKey) {$j = 0;
            }
            if(!$hashTable[$j]) {$hashTable[$j] = $arr[$i];
                break;
            }
            $c+ +;$j+ +;if($c> =$hashKey) {break;
            }
        }
    }
}
print_r($hashTable);
Copy the code

This time we just generate seven random numbers, and let them divide and mod 7 again. We also need to store their hashed results in another array, which we can think of as memory space. If you have the same hash, then of course you can’t put it in the same space, because if you have two pieces of data in the same space, you don’t know which one you’re really taking.

In this code, we use the linear probing method of the open address method. This is the easiest way to handle hash conflicts. Let’s take a look at the output and then analyze what happens when the conflict occurs.

// Array
/ / (
// [0] => 17 // 3
// [1] => 13 // 6
// [2] => 9 // 2
// [3] => 19 // 5
// [4] => 2 // 2 -> 3 -> 4
// [5] => 20 // 6 -> 0
// [6] => 12 // 5 -> 6 -> 0 -> 1
// )
// Conflict position: 2, value: 2
// Conflict position: 6, value: 20
// Conflict position: 5, value: 12
// Array
/ / (
/ / [3] = > 17
/ / [6] = > 13
/ / [2] = > 9
/ / [5] = > 19
/ / [4] = > 2
/ / [0] = > 20
/ / [1] = > 12
// )
Copy the code
  • First, the numbers we generate are 17, 13, 9, 19, 2, 20, 12.

  • 17%7=3, 17 saved in subscript 3.

  • 13%7=6, 13 is saved in subscript 6.

  • 9 percent. 7=2, 9 is saved in subscript 2.

  • 19%7=5, 19 is saved in subscript 5.

  • 2%7=2, ok, so there’s a conflict, 2%7 is also 2, but the index of 2 is already occupied, so let’s start at 2 and see if the index of 3 is occupied, so we get to 4, which is empty, so we save the 2 in index 4.

  • 20%, 7 is 6, and just like above, 6 is already occupied, so we go back to the 0 we started with, 0 is not occupied yet, so 20 is stored in index 0.

  • So the last 12%, 7 is 5, and it’s going to go through subscript 5, 6, 0, 1 and then it’s going to go to subscript 1.

The resulting output is the result of our final array output. As you can see, linear detection is basically looking down one by one if you find that the position is occupied. So the time complexity is actually not very good, and of course, the best case is that the total length of the data matches the length of the hash key, which is O(1).

Of course, in addition to linear detection, there are quadratic detection (square), pseudo-random detection and other algorithms. In addition, linked lists can be used to solve the problem of hash conflict. These contents you can refer to the relevant documents or books.

conclusion

The lookup function at the end of a hash hash is actually the same as the process we generated the hash table above, and the resolution of conflicts is also the same, so I won’t talk about it here. For hash this, whether it is teaching materials or all kinds of learning materials, in fact, the content of the introduction is not particularly much, so, we are also in a state of mind to simply understand the hash hash this knowledge, more content we can study more share ha!

Test code:

Github.com/zhangyue050…

Reference Documents:

Data Structures, 2nd edition, Weimin Yan

Data Structures, 2nd edition, Chen yue

Follow the public account: [Hardcore project manager] to get the latest articles

Add WeChat/QQ friends: free xiaoyuezigonggong / 149844827 】 【 PHP, project management, learning materials

Zhihu, Public Account, Douyin, Toutiao Search [Hardcore Project Manager]

ID of station B: 482780532