Hash table principle

Hash table is one of the most common data structures, for its use, we are very familiar with, here is a detailed discussion of its principle. The bottom layer of a Hash table is actually stored based on an array. When inserting a key-value pair, it is not directly inserted into the array. Instead, the Hash value is obtained by the Hash operation on the key and then modulo with the array capacity to obtain the position in the array before inserting. If the specified key value matches the stored key, the key pair is returned. If the specified key value does not match, no corresponding key pair exists in the Hash table. The advantage of this is that it can be done in search, insert, delete, and so onWorst-case scenarioOf course, this is the most extreme case and rarely encountered.

Implement a Hash function
Resolve Hash conflicts properly
Implement the operation method of HashMap

The Hash function

The Hash function is very important. A good Hash function not only performs well, but also distributes the values stored in the underlying array more evenly, reducing the occurrence of conflicts. The reason for this is to reduce collisions is because the process of taking the Hash is actually mapping the input key (the domain) into a very small space, so collisions are unavoidable and all you can do is reduce the probability of Hash collisions. In Rust and many other languages, the default hash algorithm is SipHash.

By default, Rust’s HashMap uses the SipHash hash algorithm, which is designed to prevent hash table collision attacks while providing reasonable performance on a variety of workloads. While SipHash shows a competitive advantage in many cases, one of the cases where it is slower than other hashing algorithms is the use of short keys, such as integers. This is why Rust programmers often observe that HashMap underperforms. FNV hashing is often recommended in these cases, but note that it does not have the same crashproof properties as SipHash.

In addition to the Hash function itself, the size of the underlying array is also an important reason for Hash collisions. Obviously, in the extreme case if the size of the array is one, there’s bound to be a collision, and if the size of the array is infinite, there’s very little chance of a collision. So, hash collisions also depend on the load factor. The load factor is the ratio of the number of key-value pairs stored to the size of the array. For example, if the array is 100, 90 key-value pairs are currently stored and the load factor is 0.9. The load factor determines when to expand the hash table. If the value of the load factor is too large, it indicates that the stored key-value pairs are close to capacity, increasing the risk of collisions. If the value is too small, it wastes space.

Therefore, since conflicts cannot be avoided, there must be a mechanism for resolving Hash conflicts.

Several ways to handle conflicts

There are four main types of conflict handling methods:

External zipper method (common)
Open addressing (common)
Public overflow area (not commonly used)
Rehash (uncommon)

External zipper method

The main idea is to resolve conflicts based on the combination of arrays and linked lists. Instead of storing key-value pairs directly in buckets, each Bucket is linked to a linked list. When conflicts occur, the conflicting key-value pairs are inserted into the linked list. Exterior zip method is a bit method is simple, also won’t produce gathered phenomenon between the synonyms (compared to open addressing method), and its spatial structure is a dynamic application, so more suitable for uncertain table long: the disadvantage is that list pointer need extra space, when they encounter collisions denial of service degradation for singly linked lists.

Synonyms: Two elements that have the same index address using the Hash function are called synonyms.

The following are two implementations of the external zipper method. The main difference is whether the Bucket stores data.

Open addressing

The main idea is to go straight to the next empty address in case of a conflict, as long as the underlying table is large enough, it will always find an empty address. The act of finding the next address is called probing.`,Is the hash function,Is the length of the hash table,Is the incremental sequence,Is the number of conflicts that have occurred. There are a variety of detection methods based on the delta sequence method:

Called Linear Probing; namelyOr some other linear function. It is equivalent to probing the address table one by one until it finds an empty cell and stores the hash address in that empty cell.
It is called a Quadratic Probing. Relatively linear detection, equivalent to conflict detection intervalIs the location of the unit empty? If it is empty, store the address in it.
, called pseudo-random detection.

The following figure shows linear detection:

Public overflow area

The main idea is to create a separate public area where conflicting key-value pairs can be placed. Not commonly used, not detailed here.

Then the Hash method

The main idea is that when there is a conflict, use another Hash function to compute the Hash value. Not commonly used, not detailed here.

Implement hash table operation method

Main is:

insert
remove
get
contains_key
……等等……

The most important operations are insert, find, and delete.

See the Hash table document

Pay attention to wechat public number, push common data structure, TCP/IP, distributed, back-end development and other technology sharing, learn and progress together!

The Hash function

Several ways to handle conflicts

External zipper method

Open addressing

Public overflow area

Then the Hash method

Implement hash table operation method

Related Posts

Escape the Big Maze with Leetcode 1036

Efficient development: Annotating the Controller layer

Technical Summary: Chapter 10: Dynamic selection of profiles by Maven Profiles