Hash table of data structures and algorithms

The system is for bloggers to learn the implementation of data structures and algorithms for recording themselves. This article mainly introduces Hash Table (also known as Hash Table) principle, part of the source code, and practice implementation.

First, hash table principle

This part mainly introduces the concepts and principles of hash table.

(1) Concept

1. hash table definition

A Hash Table is also called a Hash Table. In essence, the hash table is a data structure that extends the array. It can be said that the hash table is optimized on the basis of the array’s support for subscript direct index data (value) and can add, delete, change and search data within constant time as far as possible.

This standard hasuniqueness, we call them key valueskeyFor example, in school our student number is our unique identity, in society our id number is our unique identity, and a hash table is a means to translate this unique identity (also can becomemapping) is the index subscript in the hash table, which uniquely identifies 123456 to 3,345215 to 4, and so on.

Hash functions

A hash table is a means by which this unique identifier is converted (or mapped) to the corresponding position in the hash table. This means/mapping method is called a Hash Function, and its input parameter is a key.

3. Hash conflict

No matter what data structure is, the storage space is limited, that is, the amount of data to be stored is certain. Then, when the amount of data to be stored exceeds the current capacity of the data structure, at least two key values will correspond to the same position in the hash table. For example, if there are ten places in a hash table, then eleven to ten places will be stored, that is, at least two of the numbers will be stored in the same place. This is the pigeon nest principle (drawer principle).

(2) the design of hash function

1 hash function part of the source code

The implementation design of the hash function determines the final arrangement of elements in our hash table. Below for hash function design source code

static final int hash(Object key) {
   int h;
   return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
//d string hash calculation
public int hashCode(a) {
  int var1 = this.hash;
  if(var1 == 0 && this.value.length > 0) {
    char[] var2 = this.value;
    for(int var3 = 0; var3 < this.value.length; ++var3) {
      var1 = 31 * var1 + var2[var3];
    }
    this.hash = var1;
  }
  return var1;
}
Copy the code

The above is one of the design of hash function, which is mainly mapped by resizing the capacity. For the key value of character type, a polynomial calculation is carried out by bit first and then the capacity is modulo.

And as you can see, hash functions aren’t too complicated to design, not to waste too much time calculating hash values, but it’s just a way of designing, and you can design other forms of hash functions as well.

2 source hash function in some details

As shown in the figure above, hashCode returns 32 bits, which can map 2^32^=4*10^9^ subscripts. The hash function maps evenly in this case, but if you apply for such a large array directly, you will run out of memory, so the following operation is performed

(h = key.hashCode()) ^ (h >>> 16)
Copy the code

This step shifts the high 16 bits of the hashCode to the lower 16 bits and then mixes it with the original lower 16 bits of the hashCode, further increasing the randomness of the hashCode and noting that the hash value will change if the resulting hashCode changes by one bit. It also reduces the likelihood of conflict.

int index = hash(key)&(capacity-1);
Copy the code

Here we calculate index, which is the position in the hash table, and do bitwise and operation, which is the mod operation, which is more efficient.

Example:

0110 1110 1101 0111 & 0000 0000 0000 1111 = 0000 0000 0000 0111

16 = 7 28375%

index = hash(key)%capacity
Copy the code

Once the hash function is determined, all that remains is how to solve the problem of the pigeon nest principle which is the resolution of the hash conflict. Capacity -1 == == == == == == == == == == == == == == The above is an example.

3. Hash function design

So, if we’re going to design a hash function, make sure that two points are a convenient hash function.

(1). The hash function should not be too complex to avoid spending too much time in calculating the hash value.

(2) generated hash values should be distributed as evenly and randomly as possible to reduce the probability of hash conflict.

(3) Resolution of hashing conflict

Currently, there are two main types of resolving hash conflicts: Linking Addressing and Opening Addressing.

1. Linking Addressing

The hash function maps the keys to each location in the hash table, called bucket/slot. Each bucket has a linked list, and you put the keys with the same hash into the corresponding bucket. There is no need to say more about the benefits of linked list insertion. When nodes are used, they are applied again, which makes better use of space.

When inserting, the hash function evaluates to the appropriate bucket, and then inserts either a header or a tail Node. In Java’s HashMap, Java7 is called Entry and Java8 is called Node. Prior to Java8, headers were used. Java8 later changed to tail insert, here related to the expansion mechanism, later introduced, today first introduced the principle of hash table. When searching or deleting, it depends on the length of the list.

The time complexity of insertion, search and deletion can be optimized by changing the linked list structure (such as red-black tree structure).

2. Opening Addressing

The idea of open addressing is simple: if a key2 has been mapped to the same hash position as key1 has been mapped to (that is, a hash conflict has occurred), then key2’s data will be put in when the first empty space is found down from that position. The process of finding a free location from that location down is called detection.

There are three main categories of detection:

(1). Linear Probing

Linear detection is to search down from the current conflict positions one by one when looking for free positions:

Hash(Key)+0, Hash(Key)+1, Hash(Key)+2, Hash(Key)+3, Hash(Key)+4, Hash(Key)+5……

The linear detection method has obvious disadvantages, that is, with the decrease of free positions, the possibility of hash conflict increases, which leads to the operation of the hash table will increase the linear detection time, the worst case can reach O(n), and square detection and double hashing is to reduce the detection time.

(2). Quadratic detection

When searching for the free position, the current conflict position is detected in the following way, that is, the square value of the distance to the conflict position:

Hash (Key) + 0, Hash (Key) + 1 2 ^ ^, Hash (Key) + 2 ^ ^ 2, Hash (Key) 2 + 3 ^ ^…

Square detection: When the table size is prime, a new element can always be inserted when the table has more than half the free space.

(3). Double hashing

When searching for free positions, multiple hash mapping functions are used to detect the current conflict positions as follows:

Hash1(Key), Hash2(Key), Hash3(Key)……

Note:

In the open addressing method, if the data to query a key value is in the hash table, if the data corresponding to the key value is found before the free position, it is considered that the key value does not exist, which requires that the bucket can not be directly empty when deleting the position in the hash table corresponding to a key value. Therefore, we need to add a marker bit deleted to indicate that the location data has been deleted. Please continue to look for it.

Ii. Capacity expansion mechanism

Hash collisions are inevitable, and when buckets are used up, new data cannot be added, and expansion is needed.

(I) Loading factor

We can’t just wait until the bucket is full to expand the hash table, so we need to find a threshold to start expanding the hash table when the number of elements in the table reaches a certain value. This criterion is known as the load factor. Of course, expansion is not so simple expansion, but also need to rehash.

Because it's no longer just copying, the capacity change means that the mod operation, the bucket corresponding to the key value, has changed.

Let’s start with the loading factor.

Load factor = number of elements already in the hash table/capacity of the hash table

It is obvious that the loading factor is proportional to the possibility of hash conflict, and the larger the loading factor is, the more likely hash conflict will occur.

Here are some parameters from the HashMap source code.

    /** * The default initial capacity - MUST be a power of two. */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
    / *... * /
	/** * The load factor used when none specified in constructor. */
	static final float DEFAULT_LOAD_FACTOR = 0.75 f;
    /** * The bin count threshold for using a tree rather than list for a * bin. Bins are converted to trees when adding an element to a * bin with at least this many nodes. The value must be greater * than 2 and should be at least 8 to mesh with assumptions in * tree removal about conversion back to plain bins upon * shrinkage. */
    static final int TREEIFY_THRESHOLD = 8;

    /** * The bin count threshold for untreeifying a (split) bin during a * resize operation. Should be less than TREEIFY_THRESHOLD, and at * most 6 to mesh with shrinkage detection under removal. */
    static final int UNTREEIFY_THRESHOLD = 6;

    /** * The smallest table capacity for which bins may be treeified. * (Otherwise the table is resized if too many nodes in a bin.) * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts * between resizing and treeification thresholds. */
    static final int MIN_TREEIFY_CAPACITY = 64;
Copy the code

The default capacity of HashMap initialization is 16.

Tree structure and non-tree structure are hash tables constructed for linked list addressing method. In JDK1.8, HashMap is optimized:

When the list length is greater than 8(TREEIFY_THRESHOLD), the list is converted to a red-black tree

When the list length is less than 6(UNTREEIFY_THRESHOLD), the red-black tree is converted to a linked list.

When tree structurizing, the list loads a minimum of 64 elements (MIN_TREEIFY_CAPACITY), which is used to avoid conflicts between resizing and tree thresholds.

(2) How to expand capacity

As mentioned above, when the load factor reaches a certain threshold (the default for Java is 0.75), capacity expansion begins, and each capacity expansion doubles the original capacity.

As mentioned above, capacity expansion is not simple replication, so how can we better implement capacity expansion mechanism?

You might say, well, the easiest way to do that is just to get a new hash table with twice the space, and then remap it.

This is true in theory, but it takes time when the data volume is very large, so it is not recommended to migrate the data volume all at once.

It is difficult to complete the migration of data volume at one time, so can we carry out data migration for multiple times?

The answer is yes, it is more convenient to use the idea of sharing/sharing to expand and move data.

First, we apply for the corresponding size of capacity space.

Second, each time == = inserts data, move the data into the new hash table first, and then take a data from the old hash table and put it into the new hash table.

Finally, if there is any operation that involves a query, the new hash table is queried first and then the old hash table is queried.

This makes capacity expansion much more efficient. In the double capacity expansion mechanism, the time complexity of capacity expansion is O(1) after allocating time complexity.

(3) Ask questions

Combining what I have learned and other people’s blogs, I put forward two small questions.

1. Why is the load factor set to 0.75?

As shown in the figure, in the source code, data studies show that in random hashCodes, the number frequency of points in the bucket meets the Poisson distribution, such as the probability of zero points in the bucket is about 0.6 (as I understand it).

The load factor is 0.75, so the utilization of your existing capacity can be well coordinated with the frequency of capacity expansion, so that you don’t waste space and time by expanding too early or cause too many hash conflicts by expanding too late.

2. Why double capacity?

In this part, the comparison between capacity increasing and capacity doubling is used to explain why the capacity doubling strategy can reach O(1).

Enter n =m*I(n>>2) consecutive elements in an empty table with initial capacity 0

A) Capacity increasing strategy: Append capacity I of a fixed size each time

In 1, I + I + 1, 2, 1, 3 I + 1,… Secondary insertion expansion

Time: 0, I, 2I… , (m-1)*I

The total time is: I(m-1)m/2=O(n^2^), amortized time is O(n)

B) Capacity increasing policy: Double capacity expansion each time

In the first,2,4,8,… 2^m^=n insertions

Time: 1, 2, 4… ,2^m^=n

Total time: O(n), amortization time: O(1)

Therefore, it can be seen that the amortized time complexity of the capacity doubling strategy can be reduced to O(1).

Three, practice realization

The Java code implementation of linked list addressing is given below

import java.util.*;

public class LinkingAddressingHashTable<T> {
	private static final int DEFAULT_TABLE_SIZE = 101;
	private List<T>[] myLists;
	private int currentSize;
	private int capacity;
	public LinkingAddressingHashTable(a){
		this(DEFAULT_TABLE_SIZE);
	}
	public LinkingAddressingHashTable(int size){
			capacity=nextPrime(size);
			myLists = new LinkedList[nextPrime(capacity)];
			for(int i=0; i<capacity; i++) { myLists[i] =newLinkedList<>(); }}public int nextPrime(int n) {
		if(n%2= =0)n++;
		for(; ! isPrime(n); n+=2);
		return n;
	}
	public boolean isPrime(int n) {
		if(n==2||n==3)return true;
		if(n==1||n%2= =0)return false;
		for(int i=3; i*i<=n; i+=2) {if(n%i==0) return false;
		}
		return true;
	}
	public void insert(T x) {
		List<T> whichList = myLists[myHash(x)];
		if(! whichList.contains(x)) { whichList.add(x);if(++currentSize>capacity) { reHash(); }}}public void remove(T x) {
		List<T>whichList = myLists[myHash(x)];
		if(whichList.contains(x)) { whichList.remove(x); currentSize--; }}public boolean contains(T x) {
		List<T> whichList = myLists[myHash(x)];
		return whichList.contains(x);
	}
	public void makeEmpty(a) {
		for(int i=0; i<capacity; i++) { myLists[i].clear(); }}public int myHash(T x) {
		int hashVal = x.hashCode();
		hashVal%=capacity;
		if(hashVal<0) {
			hashVal += capacity;
		}
		return hashVal;
	}
	public static int hash(String key,int tableSize) {
		int hashVal = 0;
		for(int i=0; i<key.length(); i++) { hashVal =27*hashVal + key.charAt(i);//hashVal is a signed 32-bit integer
		}
		hashVal%=tableSize;
		if(hashVal<0) hashVal+=tableSize;
		return hashVal;
	}
	public void reHash(a) {
		List<T>[] oldLists = myLists;
		myLists = new List[nextPrime(2*capacity)];
		for(int j=0; j<capacity; j++) { myLists[j] =new LinkedList();
		}
		currentSize=0;
		for(List<T> list :oldLists) {
			for(T item:list) { insert(item); }}}public static void main(String[] args) {
		LinkingAddressingHashTable<String> ht = new LinkingAddressingHashTable();
		ht.insert("Zhang San");
		ht.insert("Li Si");
		String x = "Zhang San"; System.out.print(ht.contains(x)); }}Copy the code