Introduction to the

Because the physical address of the sequential table is continuous, the efficiency of querying the data of the corresponding index is very high. That is, the actual data is accessed at one time by using the offset

Hash table (hash table) principle: create a sequential table to save data, through a series of hash functions (also known as the hash algorithm), to achieve the keyword into the sequential table index

Pay attention to

Hash function no matter how to design, there will always be the same element mapping to a position, so will produce conflict, need conflict solution sequence table the size is bigger than the actual data, the actual data takes up the interval between the default is 0 ~ 1 (generally take up 0.6 ~ 0.9 better, too little space waste, more easy to conflict to reduce query speed)Copy the code

Therefore, it can be seen that the hash function consists of order table, hash function, and conflict resolution

The hash function

The hash store implements the transformation of keywords to addresses, and the linear table of the hash becomes a hash table or hash table

Common design methods of hash function: numerical analysis method, divide the remainder method (take the remainder method), square take the middle method, folding method

You can also design your own better hash function

Numerical analysis

According to the keyword characteristics, to design the hash function, the flexibility is very strong, more common

Case 1:

Given keywords: 12121201, 12121203, 12121243, 12121244, 12121246, 12121263, 12121290

You can see that the first six digits of a keyword are the same, and the last two digits are the key to determine the difference, so you can take the last two digits as the index

Since the array base collection ranges from 0 to 100, the sequence table array length is set to 100

The hash function takes the last two digits

long hashNum(long num) {
    return num % 1000000;
}
Copy the code

Mapping index: 01, 03, 43, 44, 46, 63, 90

Value setting process:

If the value is 12121243, use the hash function hashNum to generate index 43 and compare whether the value of index 43 is the current value. If it is, the search succeeds

Case 2:

Given keywords: 1212120112, 1212123232, 1212124345, 1212124497, 1212124635, 1212126379, 1212129045

As you can see, the first 6 bits of the keyword are the same, and the last 4 bits are different, but the number is relatively small. Creating 10,000 bits of space is a waste of space, so you can map it to 2 bits by using the hash function

Because the index is two digits long, the basic array collection is between 0 and 100, and the sequence table array length is still set to 100

The hash function and design can be added by adding the last two digits to generate a new number

long hashNum(long num) {
     num %= 1000000;
     return (num / 100 + num % 100) % 100;
}
Copy the code

Mapping index: 13, 64, 88, 41, 81, 22, 35

Hence the essence of digital analysis: design and optimize the hash function according to the keyword position and length of the data

Take more method

It is more important to choose the appropriate divisor p by taking the remainder to make the distribution even. Generally, the prime number is taken (n < p < m, where n is the number of elements to be stored and m is the hash table length), which is common

Case study:

Given keywords: 13838, 4847474, 1343535, 233555, 233425, 663443, 6654545, 74465745

With the naked eye can not distinguish its digital features, relatively messy, use the mod method to deal with

Set the length of the order table as 20(m), a total of 8(n) numbers, divisor p(20), and take mod as the index

The hash function selects the divisor and takes the mod

int hashNum4(int num) { return num % 20; // the divisor is between n and m;Copy the code

Middle square

Square the value, take the middle of the method, suitable for a single digit value is not uniform (digit repetition rate is high, such as: 11010101), more common

Case study:

Given the keywords: 444, 323, 333, 111, 666, 555, 533, 222

Set the sequence length to 10(m), there are 8(n) numbers, then the square result takes the middle bit as the index

The hash function is pinched, just take the middle number

int hashNum5(int num) { return num * num % 100 / 100; // The current number is three digits, and the range is 0 to 1000000}Copy the code

Folding method

The keyword is divided into several groups according to a certain length from left to right, and then added to take the keyword with the same length. It is suitable for storing long numbers.

For example, 1234343535. If the length of the hash table is 1000, it is divided into 123, 434, 353, 5, and then add the three digits. That is, the stored keyword is 915

The hash function looks like this:

int hashNum6(long num) {
    int res = 0;
    while (num > 0) {
        res += num % 1000;
        num /= 1000;
    }
    return res % 1000;
}
Copy the code

Conflict resolution

From the process of generating the hash table above, it can be seen, or guessed, that whether or not there are duplicate numbers, there must be a case where the keywords map to the same index at some point. This case is called a conflict, and thus leads to a conflict resolution

There are two solutions: open address method and chain address method

Open address method

Open address method is aimed at only one order table array to store the content of the case, all elements (including conflict elements) will be stored in the order of the table, the elements conflict, then parallel backward, find the space to store

In order to resolve conflicts, it is possible that the index position corresponding to the hash value is not the corresponding hash value element

The open address method is divided into two conflict resolution schemes: linear detection method and chain address method

Below through the linear detection method, chain address method introduced development address method

1. Linear detection

When the address corresponding to the hash value has a value, then the conflict, each time backward 1 bit, until the space is found, the order table as a logical ring, if you find a circle back to the current node, the storage is full, the end

The following figure illustrates the conflict handling steps of the linear detection method

Int crashLineHandle(int *hashList, int index, int length, int data) {int lastIndex = index; do { index++; index = index % length; int tem = hashList[index]; if (tem) continue; else return index; } while (lastIndex == index); return -1; }Copy the code

2. Double hash function detection method

Similar to the linear detection method, when the address corresponding to the hash value has a value, the conflict will occur. Each time, the sequence table will be moved back n bits until the vacancy is found. Consider the order table as a logical ring.

If the figure is shown, demonstrate the processing steps of the double hash function probe method, where the probe interval is 3

Int crashDoubleLineHandle(int *hashList, int index, int length, int data) {int lastIndex = index; do { index += 11; // In contrast to the current detection method, set a certain interval, and the total length of length is mutually prime, can reduce the detection interval, for example, 11 index = index % length; int tem = hashList[index]; if (tem) continue; else return index; } while (lastIndex == index); return -1; }Copy the code

Chain address method

Unlike the open address method, the chained address method stores data only at the index corresponding to the hash value in the sequential table. Once the hash value is repeated, the position is transformed into a vertical list array, lengthwise expanding the data of the repeated hash value

That is, at the index corresponding to the hash value of the chain address method, only the data corresponding to the hash value of the index will exist, and its content is stored in the vertical list

The following figure shows the chain address method, which is more commonly used