Hash tables and Hash algorithms

An overview,

If a keyword is k, its value is stored in f(k). F (k) is a Hash function, and the table created based on this idea is a Hash table.

However, during the calculation of f(k), different k will calculate the same value, which is also called Hash conflict. To solve this problem, we also need a method to deal with conflicts.

To sum up, based on the Hash function f (k) and the methods of dealing with conflict to map a set of keywords to a limited set of contiguous address (interval), and take the keyword in the address of “like” as recorded in the storage location in the table, the table is called a Hash table, this mapping process is referred to as the Hash tabulation or Hash, the storage location of Ha Sh address.

Can’t read the text? See the following figure for the Hash calculation

Two, common methods

‘Direct addressing’ : take a keyword or a linear function of the keyword value as a hash address.

That is, H(key)=key or H(key)= A ·key + B. Where a and b are constants (such Hash functions are called self functions). If H(key) already has a value, go to the next one until H(key) has no value, and put it in. For example, there is a statistical table of population numbers from 1 to 100 years old, in which age is the keyword and the hash function takes the keyword itself. As shown in the following table:

address	01	02	03	.	20	21	.	200
age	1	2	3	.	20	21	.	200
The number of	200	199	198	.	180	179	.	1

So if you want to ask how many 25-year-olds there are, you just look up item 25 in the table. The size of the address set and keyword set obtained by direct addressing are the same. Therefore, there is no conflict for different keywords, but in practice you can use this hash function in very few cases.

‘number analysis’ : assuming that the keyword is a number based on r (e.g., a decimal number based on 10), and that the possible keywords in the hash table are known in advance, then take several digits of the keyword to form the hash address.

Such as a set of the employee’s date of birth, we found the same date of birth of the first few digits, in this case, the chances of conflict will be very big, but we found that date back several month and date number difference is very big, if use the back of the Numbers to form a hash address, will significantly reduce their risk of conflict. Therefore, numerical analysis method is to find out the rule of numbers and use these data to construct the hash address with low probability of conflict as far as possible.

‘square take middle method’ : when can not determine the keyword in which a few more evenly distributed, you can first ‘find the keyword square value’, and then according to the need to take ‘square value of the middle several bits as hash address’. This is because the middle bits after the square are related to each bit in the keyword, so different keywords will produce different hash addresses with a high probability.

We use the alphabetic position of the alphabet as the internal code of the English letter. For example, the internal code of K is 11,E is 05,Y is 25,A is 01, and B is 02. Thus, the internal code of the keyword “KEYA” is 11052501. Similarly, we can get the internal code of the keywords “KYAB”, “AKEY” and “BKEY”. After the keyword is squared, bits 7 to 9 are taken out as the hash address of the keyword, as shown in the figure below:

‘fold’ : divide the keyword into parts with the same number of digits, the last part of the number can be different, and then take the sum of these parts (remove the carry) as the hash address. There are two methods of digital superposition: shift superposition and interboundary superposition. Shift stacking is to align the lowest order of each part after segmentation, and then add; Interboundary stacking is when you fold back and forth along the partition boundary from one end to the other, and then align and add.

For example: every Chinese and Western book has an International standard book number (ISBN), which is a 10-digit decimal number. If you want to build a hash table with it as the key word, when the collection of books is less than 10000, you can use the folding method to construct a four-digit hash function. In the folding method, there are two methods of digit superposition: shift superposition and interboundary superposition. Shift stacking is to align the lowest order of each part after segmentation, and then add; Interboundary stacking is when you fold back and forth along the partition boundary from one end to the other, and then align and add. For example, the hash address of international standard book no. 0-442-20586-4 is shown in Figure (a) and (b) respectively.

Random number method: Selects a random function and takes the random value of the keyword as the hash address, that is, H(key)=random(key) where random is a random function and is usually used when the keyword length is different.
Division remainder method: the hash address is the remainder of the key word divided by a number p that is not larger than the length of the hash table m. That is, H(key) = key MOD p,p<=m. Not only can the key directly modulus, but also after folding, square medium operation modulus. The choice of P is very important, generally take prime number or m, if the choice of P is not good, it is easy to produce synonyms.

Dealing with conflicts

‘open addressing’ : Hi=(H(key) + di) MOD m, I =1,2… ,k(k<=m-1), where H(key) is the hash function,m is the length of the hash table,di is the incremental sequence, there can be the following three methods:
- Di = 1, 2, 3,… + di=1^2,-1^2,2^2,-2^2,⑶^2,… ,± (k)^2,(k<=m/2) is called quadratic detection hashing, + di= pseudo-random number sequence is called pseudo-random detection hashing.
Hi=RHi(key), I =1,2… ,k, RHi are all different hash functions, that is, when the address conflict is generated by synonyms, the address of another hash function is calculated until the conflict no longer occurs. This method is not easy to produce “aggregation”, but increases the calculation time.
A HashMap is used to store all records whose keywords are synonyms in the same linear linked list. Suppose that the hash address generated by a hash function is in the interval [0, m-1], then set up a pointer vector Chain ChainHash[m] whose initial state of each component is a null pointer. Records with van hash address I are inserted into the list with ChainHash[I] as the head pointer. The insertion position in the linked list can be at the top or bottom of the table; Can also be in the middle to keep synonyms ordered by keyword in the same linear linked list.
It is assumed that the range of hash function is [0, m-1], then let the vector HashTable[0.. m-1] be the basic table, each component stores a record, and let the vector OverTable[0..v] be the overflow table. All records in the base table where the keyword is a synonym for the keyword, regardless of their hash address derived from the hash function, are filled into the overflow table in the event of a conflict.

Pay attention to the subscription number program ape small ha, contact the author, join the learning exchange group and partners to learn progress together

An overview,

Two, common methods

Dealing with conflicts

Related Posts

The development journey from monolithic applications to microservices

Go Language introduction | Go Theme month

[Auxiliary tools] to achieve QQ message bombing