Hash table basis

A simple case

How to store 50,000 words?

Scheme 1: arrays

The disadvantage is that when you get a word (such as Java), it is very difficult to know the location of the word, and the search is very inefficient.

Scheme two: linked lists

Reference array lookup difficulty, linked list needless to say.

For more information about arrays and linked lists, see data Structures and Algorithms

Solution 3:

Convert the word to the subscript of the array, so that you can find information about a word by asking for the element directly through the subscript value.

Let’s say we have Java, Python, go, swift, and convert them to their respective array subscripts. Such as:

java -> 1000

python -> 2000

go -> 3333

swift -> 4000
Copy the code

This way, when you need to look up the word python, you can immediately know that its subscript value is 2000, and further access the desired element.

Hash table

Hash table is the concrete implementation of scheme 3 above. The following is the definition of hash table:

A hash table is also called a hash table; Is a data structure that accesses a storage location in memory directly based on a Key. It is accessible by calculating a key-value function that maps the data for the query to a location in the table.

The structure of a hash table is an array, because it uses the property of an array that supports random access to data by subscript, but the difference is that the transformation of the subscript value is carried out through the hash function, and the process of transformation is hash, and the array of records is hash table.

Therefore, hash tables are usually implemented based on arrays, but arrays have major advantages:

Can provide very fastinsert,delete,To find theOperation.
The time complexity of query, insert, and delete is O(1). (If the probability of hash conflicts increases, the time complexity will increase)

The hash function

Hash function is also called hash function, hash algorithm, simply put, is used to calculate the hash value of the corresponding Key (Key) method.

A basic hash function needs to satisfy the following requirements:

1. The computed hash value is a non-negative integer. (Since the array subscript starts at 0, the hash function generates a non-negative integer.)

2. Under the same hash function, if the two input keys are the same, the two generated hashes are the same

3. Under the same hash function, if the input two keys are different, then the generated two hashes are also different (it is better to generate two hash values even if the difference between the two keys is very small).

Of course, a good hash function needs the following:

1. Do quick calculations

The advantage of hash table is efficiency, which requires fast calculation to obtain the corresponding hash value of the element. One way to improve efficiency is to multiply and divide as little as possible.

2. Distribute evenly

Map elements to different locations as far as possible to reduce hash conflicts and make elements evenly distributed in the hash table. Generally speaking, the longer the hash value, the lower the probability of hash conflicts.

3. Raw data is hard to extrapolate

This is especially true for hash algorithms used for secure encryption, preventing the disclosure of raw data.

The following figure illustrates the implementation of scheme 3 and the generation of the hash table in this case

Easy to copy, hashFun function code:

function hashFun (str, size) {
  var hashCode = 0
  for (var i = 0; i < str.length; i++) {
    hashCode = 37 * hashCode + str.charCodeAt(i)
  }
  var index = hashCode % size
  return index
}
Copy the code

Hash conflict

A hash table uses hash functions to convert the Key value to obtain the value of the index stored in the corresponding array. However, even a good hash function cannot avoid the situation that two different keys are input but the same hash value is generated, which is called hash conflict.

The hashFun function above, for example, returns 2 when hashFun(‘react’, 7) and hasFun(‘go’, 7) are executed, causing hash conflicts.

The common methods to resolve hash conflicts are as follows: open addressing, chain address, and double hash

Open addressing

Idea: If a hash conflict occurs, re-probe a free location and insert the element.

One of the simpler detection methods is linear detection. The so-called linear detection is that when a data (Key) has been hashed, the storage location has been occupied, then we start from the current location, in turn to find whether there is a free location, until it is found.

In addition to data insertion, data lookup and deletion also have their special features:

To find the

The word ‘go’ and ‘react’ are stored at array subscript 2 and 3, respectively, according to linear probe logic for insertion location. When the word ‘react’ is searched, the array subscript value obtained by hashFun is 2, but in fact ‘react’ is stored at array subscript value 3.

In the actual lookup, if the subscript is not ‘react’, then the next position is searched until the word ‘react’ or the next position is empty.

delete

In general, data deletion is done by simply leaving the data to be deleted empty, but in this case, a deletion mark (presumably marked deleted) should be added during the deletion in order to ensure that the search operation does not fail.

Because during the search, if an empty position is found through the method of linear detection, we can determine that the data does not exist in the hash table. However, if the position is deleted later and empty, the search operation will have problems.

If the words ‘go’, ‘react’, and ‘vue’ have positions with subscript 2, 3, and 4 respectively, then when I delete ‘react’ and empty the position with subscript 3, when I search for the word ‘vue’, I will find the value of the position with subscript 3 is empty. The result that ‘vue’ does not exist is returned, causing the lookup operation to fail.

So when deleting the word ‘react’, we need to add a deleted mark to position 3 so that it can continue to look up the word ‘vue’.

Problems with linear detection

There is a serious problem with linear detection, which is aggregation.

If you already store the words ‘go’, ‘react’, and ‘vue’ in array 2, 3, and 4, then you can only store the word ‘less’ in array 5, if you already store the word ‘go’, ‘react’, and ‘vue’ in array 4. As a result, WHEN I look for ‘less’, I know its subscript is 2, but I still have to look down 3 places to find it.

Therefore, as more and more data are inserted, the possibility of hash conflict will be more and more likely, and the free position will be less and less, and the time of linear detection will be longer and longer. In extreme cases, the time complexity of search is O(n), thus affecting the performance of hash table.

Other detection methods: For the open addressing method, in addition to linear detection, there are two other detection methods: secondary detection, re-hashing method

Second detection

The detection step of linear detection is 1, which means that in the search, insert and delete operations, every time the corresponding element is not found in the current position, the position of array subscript +1 will be searched.

For example, linear detection is x+1, x+2, x+3 starting from the subscript value x, and the second detection is x+1², x+2², x+3². In this way, a relatively long distance can be detected at one time, which optimizes linear detection to a certain extent.

Then the hash method

In fact, the re-hash method is also an optimization of the detection step, which is to use another hash function to hash the keyword again, and the result of the hashing as the detection step. And for the specified keyword, the step size is constant throughout the probe, although different keywords use different steps.

The second hash should have the following characteristics:

It’s different than the first hash function
The output cannot be 0 (if the output is 0, each probe is running on the spot and the algorithm enters an infinite loop)

In general, the following formula can be used to calculate the step size needed to probe:

Step = constant – (key % constant), where constant is a prime number that is smaller than the size of the array, and key is a keyword. The advantage of this is that the probe step size cannot be zero.

In particular, rehash requires that the capacity of the hash table also be a prime number, so why is that?

Because if the capacity of the hash table is not prime, let’s say it’s 10, then according to the step formula, when the key is 0 and the step is 5, it’s going to go 0, 5, 0, 5, 5 and so on, and it’s only going to try those two positions.

If the capacity of the hash table is 13, when the key is 0 and the step size is 5, its probe path is 0, 5, 10, 2, 7, 12, 4, 9, 1, 6, 11, 3, and it can find every position in the hash table.

If the key is 0 and the step is 5, it is based on 0, 5, 10, 15, 20, 25…. But there is a modulus operation between the length of the hash table, so it becomes 0,5,10,2,7….. 11, 3.

Load factor

The load factor represents the ratio of the data items already contained in the current hash table to the entire length of the hash table. The larger the load factor is, the fewer slots in the hash table, the more conflicts, and the performance of the hash table will decline.

The formula for calculating the load factor is: Load factor = number of elements filled/length of the hash table

Open addressing Note: The various probing problems used in the open addressing method above are all based on their probing step size. Both storage and lookup are performed according to their corresponding probe methods. (PS: it is storage and search. I mistakenly thought that the search is carried out according to the corresponding detection method, while the storage is the way of linear detection. This understanding is wrong.)

Chain address method (linked list method, zipper method)

The so-called linked address method is to store a linked list (or array) in each element of the array, and insert the element at the beginning or end of the list (array) once duplicate elements are found.

Each item in the array can be called a bucket or slot, and each item corresponds to a linked list (or array).

Double hash

A double hash is a hash function that evaluates the subscript value of the element to be inserted. The first hash function, hashFun1, is used to compute a new location in the array if the location is already occupied, and a new location is computed using hashFun2 until a free location is found.

Expand the hash table

When the load factor is too large, the performance of the hash table will be reduced, so it is necessary to expand the hash table.

Expansion mode:

Apply for twice the original memory space, and then re-compute all the data through the hash function and insert into the new hash table.
After applying for the memory space, insert the expansion operation into the insert operation in batches. That is, when there is new data to insert, the new data is inserted into the new hash table, and one data from the old hash table is put into the new hash table. By repeating the above process, you can move the data from the old hash table to the new hash table bit by bit. In this way, it can solve the time-consuming problem caused by one-time expansion.

JS implements hash tables

Key points:

Resolving conflicts by chain address (Storage with arrays)
Capacity expansion is performed when the load factor is greater than 0.75
Hash functions use hashFun above

function HashTable () {
  this.storage = [] // Array of related elements
  this.count = 0    // Number of elements currently stored
  this.limit = 7    // Hash table total length
  // loadFactor
  // hash function
  HashTable.prototype.hashFun = function (str, size) {
    var hashCode = 0
    for (var i = 0; i < str.length; i++) {
      hashCode = 37 * hashCode + str.charCodeAt(i)
    }
    var index = hashCode % size
    return index
  }

  // Hash table insert and modify is the same function, if the original key does not exist, is the insert operation, if there is a key, is the modify operation.
  HashTable.prototype.put = function (key, value) {
    // Get index based on key
    var index = this.hashFun(key, this.limit)

    // Select bucket from index
    var bucket = this.storage[index]

    // Determine whether the bucket is null. If so, create a bucket
    if (bucket == null) {
      bucket = []
      this.storage[index] = bucket
    }

    // Determine whether to modify or insert data
    for (var i = 0; i < bucket.length; i++) {
      var item = bucket[i]
      // The format of item is [key, value], so use item[0] to compare whether there is a key, if there is, modify the corresponding value, that is, item[1].
      if (item[0] === key) {
        item[1] = value
        return}}// Add
    bucket.push([key, value])
    this.count++

    // Determine whether capacity expansion is required
    if (this.count > this.limit * 0.75) {
      this.resize(this.limit * 2)}}// Get operation
  HashTable.prototype.get = function (key) {
    // Obtain the corresponding index based on key
    var index = this.hashFun(key, this.limit)

    // Get the bucket value based on the index value
    var bucket = this.storage[index]
  
    // Check whether the bucket is null. If the bucket is null, the key does not exist
    if (bucket === null) return null

    // Bucket exists, and the for loop checks if the value exists
    for (var i = 0; i < bucket.length; i++) {
      var item = bucket[i]
      // the format of item is [key, value], so use item[0] to check whether there is a key, if there is, return the corresponding value
      if (item[0] === key) {
        return item[1]}}// If there is no corresponding key, the element does not exist and null is returned
    return null
  }

  // Delete operation
  HashTable.prototype.delete = function (key) {
    // Obtain the corresponding index based on key
    var index = this.hashFun(key, this.limit)

    // Get the bucket value based on the index value
    var bucket = this.storage[index]

    // Check whether the bucket is null. If the bucket is null, the key does not exist
    if (bucket === null) return null

    // Bucket exists, and the for loop checks if the value exists
    for (var i = 0; i < bucket.length; i++) {
      var item = bucket[i]
      // the format of item is [key, value], so use item[0] to check whether there is a key, if there is, delete the corresponding value
      if (item[0] === key) {
        bucket.splice(i, 1)
        this.count--
        return item[1]}}// If there is no corresponding key, the element does not exist and null is returned
    return null
  }

  // Determine whether the hash table is empty
  HashTable.prototype.isEmpty = function () {
    return this.count === 0
  }

  // Get the number of hash table elements
  HashTable.prototype.size = function () {
    return this.count
  }

  // Hash table expansion is implemented by recalculating all data at once and inserting it into a new hash table
  Hashtable.prototype.resize = function (newLimit) {
    // Save the original content
    var oldStorage = this.storage

    // Reset all attributes
    this.storage = []
    this.count = 0
    this.limit = newLimit

    // Walk through all buckets in oldStorage
    for (var i = 0; i < oldStorage.length; i++) {
      // Retrieve the corresponding bucket
      var bucket = oldStorage[i]

      // If the bucket is null, skip this loop and proceed to the next one
      if (bucket === null) continue

      // If the bucket contains data, insert it again
      for (var j = 0; j < bucket.length; j++) {
        var item = bucket[j]
        this.put(item[0], item[1])}}}}Copy the code

The part of the code that gets the bucket can be wrapped a little better.

A small summary

Hash tables are stored in the same way as arrays, except that arrays have subscripts based on 0,1,2… The hash table is stored in sequence, but the hash function needs to convert the corresponding Key to the corresponding array index to store.
Because the array index converted by hash function will be repeated, so it will cause hash conflict, conflict resolution methods are: open addressing, chain address method, double hash
The design of hash function is a very important step to achieve hash table, it needs to meet the following points:

Difficult to reverse push
It’s very sensitive to the data input, so the difference between two pieces of data is very small but the resulting hash value has to be very different
Minimize the probability of hash collisions
Computational efficiency should be as efficient as possible