Before we talk about hash tables, let’s look at the process of inserting data into an array.

  1. Confirm the subscript of the inserted data;
  2. Put the data into an array.

Taking the everyday example of queuing by height, we want to get a list of names from lowest to highest. We’re repeating this process:

  1. Find the name of the smallest person in the remaining group;
  2. Put at the end of the current array element;
  3. Repeat the process.

But there’s a big limitation here, which is that in this case we can know what the index of an array is for a name, and in many other cases we can’t. Using arrays to handle the scenario is sufficient in this case.

Then look at a scenario where we need to store a person’s name and its corresponding height. What do we do with arrays at this point? We can change the element stored in the array to be (name, height), and then when we look for a person’s height, we iterate through the array to find the name and return the height. The problem with this approach is that the search time is too high, with O(n) complexity per search. Can we find elements in the time complexity of O(1)?

This is where the HashTable is needed.

What is a hash table

A HashTable stores key and value pairs — keys and values.

In other words, the hash table can directly store “name”: height to solve the above problem.

There are three key points to hash tables:

  • The hash function
  • Hash conflict
  • Load factor

The hash function

Now let’s explain the hash function in terms of the figure above.

Now we’re going to add (“xiaoming”, 175) to our hash table. What’s going on?

First, the data key = “xiaoming” will be converted to the index I of the array by the hash function of the table. We then store value = 175 at array[I].

Here the hash function satisfies several properties:

  1. Hashing the same key gives the same output.
  2. Different keys hash different outputs;
  3. The hash function produces output within the valid range of the array.

When we query the height of “xiaoming” in the hash table, the hash function first calculates the array index corresponding to the key, and then fetchthe array data.

Here we can see, hash table = hash function + array, hash table is actually an extension of the array, is evolved from the array, to solve the problem that the array can not solve. Here we also emphasize that there is no optimal data structure and algorithm, only more suitable for a certain scenario, to solve a certain problem data structure and algorithm.

As we will see later, a hash table can also be another form: hash = hash function + array + linked list.

Hash conflict

You might think, in our stored procedure, although we can get an integer value for a key from our hash function, our array is finite, let’s say the integer value here is k, and the array length is n. We get the actual array index of k by doing k % n mod. We’ll have two integers mod n that have the same index. For example, 3% 2 == 5% 2. What do we do if we already have an element in the array?

Since it is the output value of the hash function that causes the conflict, this situation is called hash conflict or hash value conflict.

There are two corresponding solutions:

  1. Open addressing
  2. The linked list method

Open addressing

Open addressing includes linear detection, which says that in the event of a conflict, we go back in the array until we find a free place. But with this approach, as we insert more and more data, the likelihood of collisions increases, and the linear detection time increases. At this time, the insertion and search time of the hash table will degenerate, and the worst case is O(n). ThreadLocalMap in Java uses this method to resolve hash conflicts.

How do we optimize these problems when we use linear detection?

Here we introduce the concept of a load factor, which is equal to the number of existing elements divided by the total length of the hash table. First we determine a loading factor, and then when the current factor is greater than the loading factor we set, we expand the array to ensure that there is a certain amount of remaining space, thus reducing the time complexity of linear detection. The loading factor also needs to be set up to balance the efficiency of linear detection with the cost of expansion, depending on the situation. In essence, it is a balance between time and space. When we require a higher efficiency of hash table execution, we can set a smaller loading factor to increase the remaining array space, which is conducive to faster linear detection. When we are short of memory, we need to set a larger loading factor.

Another open addressing method is double hashing. When the underlying values of one hash function conflict, another hash function is used to recalculate the underlying values until a free position is found.

The linked list method

The linked list method takes a different approach. It changes the elements stored in an array into a linked list. Instead of trying to store data in an array when a conflict occurs, it stores data in a linked list with the same index. The figure above shows the hash structure of the linked list method. At this point we find hash table = hash function + array + linked list. LinkedHashMap in Java uses a linked list approach to resolve conflicts.

When we insert data, it is the time complexity O(1) of inserting elements in a linked list. When we query, depending on the length m of the corresponding linked list, the time complexity is O(m).

But this approach has a fatal flaw for hackers. A hacker can keep adding keys to the hash table with the same index. This will result in a long linked list of only that location in the array, which will lead to a very long query time, which will attack your program. We call this a hash table collision attack.

So a hash function that produces uniform subscripts is very important. In addition to using better hash functions, we can also optimize linked lists, such as red-black trees, to solve the problem of slow lookups. This is actually done in Java 8.

After the above analysis, we know that open addressing is more suitable for scenarios with small data volume and small loading factor, which is why ThreadLocalMap in Java uses open addressing. Linked list method is suitable for storing large amount of data and large objects.

What is a good hash table

Combined with the above, we know that a good hash table has the following characteristics:

  • Quick query, insert, delete
  • Stable performance, can not be too serious degradation
  • Reasonable memory usage

To ensure the above features, you need to:

  • The output of the hash function is fairly uniform
  • Suitable loading factor size, high performance dynamic expansion
  • Suitable hashing conflict resolution methods

conclusion

In addition to understanding that a hash table is a data structure that stores key-value pairs, we also need to understand how the process of inserting, deleting, and searching for a piece of data occurs. In addition, we also need to master the three key elements of hash table: hash function, conflict resolution method, loading factor. Finally, we will choose the appropriate hash table or implement our own hash table by adjusting the three key elements according to our actual scenario.

practice

Finally, I recommend you to do the problems in Leetcode. You can write your own code first, and then write another code using hash tables. Finally, the time and space complexity of the two are compared.

  • Valid letter heterotopic word
  • The sum of two Numbers
  • The sum of three number

Follow the public account AndroidRain for more algorithms and data structures with one click. Open the small program VisualLearning, VisualLearning computer knowledge.