In this article we are going to look at one of the most useful data structures — a Hash Table, or Hash Table.

Hash tables use arrays to randomly access data by subscript, so the hash table is actually an extension of the array, evolved from the array. If there is no array, so to speak, there is no hash table.

A hash table stores data composed of keys and values. For example, we store the gender of each person as data, where the key is the name of the person and the value is the corresponding gender, where M stands for male and F for female.

Why do I need a hash table?

To compare this to a hash table, let’s store this data in an array.

Here we have six bins (arrays of length 6) to store data. Suppose we need to query Ally for her gender, and since we don’t know which bins Ally’s data is stored in, we have to start from scratch, which is called linear lookup. In general, we can think of keys as data identifiers and values as data contents.

Starting with box 0, we find that the key stored in box 0 is Joe instead of Ally, so we look for box 1.

Oh dear, it’s not Ally in box 1 either, so we have to keep looking down.

It’s a little bad, and boxes two and three are not Ally.

When we found box no. 4, we found that the key of the data was Ally. By taking out the corresponding value of the key, we knew that Ally’s gender was female (F).

Through the above search process, we found that the more data, the longer the linear search time. As you can see, arrays are not suitable for storing data because the query of the data is time-consuming.

But we can solve this problem by using a hash table. First we have an array. This time we use an array of five bins to store the data.

Try to store Joe and use the Hash function to compute the key of Joe, which is the Hash of the string Joe, for example, to get 4928.

Divide the hash value by the length of the array, 5, and find the rest. This is called a mod operation, where the result of the mod operation is 3.

So, we store Joe’s data in box 3 of the array and repeat the previous steps to store the rest of the data in the array.

Sue key hash value is 7291, mod 5 result is 1, store Sue data in box 1.

The hash value of Dan key is 1539, the result of mod 5 is 4, and the Dan data is stored in box 4.

The Nell key has a hash value of 6276 and mod 5 has a result of 1. It should be stored in box 1 of the array, but box 1 already stores Sue’s data, so duplicate locations are called conflicts.

In this case, you can use linked lists to store new data after existing data (linked list method).

The Ally key has a hash value of 9143 and mod 5 results in 3, which should be stored in box 3 of the array, but box 3 already has Joe’s data, so use a linked list and store Ally’s data after it.

The Bob key has a hash value of 5278 and the result of mod 5 is 3. It should be stored in box 3 of the array, but box 3 already has Joe and Ally’s data, so use the linked list to continue storing Bob’s data after Ally.

Once you’ve stored all your data like this, you’re done with the hash table.

Let’s say we want to query the gender of Dan.

To figure out which bin Dan is stored in, we first need to hash the Dan key, then mod it, and get a result of 4, so we know it’s stored in bin 4.

Checking box 4, we can see that the key of data in box 4 is consistent with Dan, so we take out the corresponding value, so we know Dan’s gender is male (M).

So what do you do when you want to query Ally’s gender? To find out where to store it, you need to hash out the Ally key and mod it, which gives you a value of 3.

However, the key of the data in box 3 is Joe, not Ally, so we need to do a linear lookup of Joe’s linked list.

So we find the data with the key Ally, pull out the corresponding value, and we know Ally’s gender is female (F).

Hash conflict

In a hash table, we can use hash functions to quickly access the target data in an array. If a hash conflict occurs, we use linked lists for storage, so that we have flexibility regardless of the amount of data.

If the array space is too small, the hash table is prone to collisions, and linear lookups will be used more frequently. On the other hand, if the array space is too large, there will be many empty boxes, resulting in a waste of memory. Therefore, it is important to set the appropriate space for the array.

In the process of data storage, if there is a conflict, you can use the linked list to insert new data after the existing data to solve the conflict, which is called the linked list method, also known as the linked address method.

One of the methods to resolve conflicts in Java collection class HashMap is to use the linked list method, recommended reading HashMap source.

In addition to the chain address method, there are several ways to resolve conflicts. Among them, the open address method, or open addressing method, is widely used. This method means that when a conflict occurs, an alternate address (the location in the array) is immediately calculated and stored in it. If there is still a conflict, the calculation of the next alternate address continues until there is no alternate address, which can be calculated by using hash functions or linear detection methods several times.

In Java, ThreadLocal uses the open address method.

The design of hash function determines the probability of hash conflict and the performance of hash table.

conclusion

This article mainly talked about some basic hash table knowledge, including the origin of hash table, hash conflict resolution method.

A hash table, also called a hash table, is derived from an array. It extends the data structure of an array with the help of hash functions. It takes advantage of the feature that an array supports random access to elements according to subscripts and is a collection of key-value mappings.

The two core problems of hash table are hash function design and hash conflict resolution. For a given Key, the hash table can read and write in approximately O(1) time. Hash table realizes the conversion of Key and array subscript by hash function, and resolves hash conflicts by open addressing and linked list method. The design of hash function determines the probability of hash conflict and the performance of hash table.

In JDK 8, for example, we introduced a red-black tree. If the list length is too long (default is more than 8), the list will be converted to a red-black tree, so we can use the red-black tree to quickly add, delete, change, and search. Improve the performance of HashMap.

reference

My First Algorithm Book

https://github.com/wupeixuan/JDKSourceCode1.8