The HashMap is arguably the most frequently used data structure for handling key-value mappings. It does not guarantee insertion order and allows the insertion of null keys and values. This paper uses JDK8 source code, in-depth analysis of the principle, implementation and optimization of HashMap. First in wechat public number Epiphany source.

1. Basic structure

In JDK8, when the list length is greater than 8, it is converted to a red-black tree. The basic structure is as follows:

HashMap has a Node<K,V>[] table field, which is an array of hash buckets. The array elements are Node objects.

static class Node<K.V> implements Map.Entry<K.V> {
  final int hash; // Used to calculate the array index
  final K key;
  V value;
  Node<K,V> next; // Next Node

  Node(inthash, K key, V value, Node<K,V> next) { ... }... }Copy the code

The hash bucket array is initialized when first used, defaults to a size of 16, resists as needed, and is always a power of 2. If the initial capacity set by the constructor is not a power of 2, return a value greater than and closest to its power of 2 using the following method:

static final int tableSizeFor(int cap) {
  int n = cap - 1;
  n |= n >>> 1;
  n |= n >>> 2;
  n |= n >>> 4;
  n |= n >>> 8;
  n |= n >>> 16;
  return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Copy the code

The principle is to set all the bits to the right of the highest bit 1 to 1, and then add 1 to the highest bit 1, and all the bits to the right become 0, so as to obtain a power of 2. The integer.highestonebit (int I) method is used in JDK7, which finally evaluates with n – (n >>> 1) and returns a power less than and closest to the input parameter of 2.

Other fields inside the HashMap:

// The number of key-value pairs
transient int size;
// Count the number of structure changes for quick failures during iteration
transient int modCount;
// Load factor, default 0.75F
final float loadFactor;
// The maximum number of key-value pairs (capacity * loadFactor)
int threshold;
Copy the code

The main parameters that affect the performance of a HashMap are: initial capacity and load factor. When the number of hash elements exceeds the product of the load factor and the current capacity, it expands to twice the original capacity and rehashes the keys.

  • If the initial capacity is too small, multiple capacity expansion and rehash will be triggered. Therefore, it is more efficient to pre-allocate a large enough capacity
  • The load factor defaults to 0.75F, which is a good balance between time and space costs and generally does not need to be modified. Higher values reduce space overhead but increase the cost of lookups

No matter how reasonable the hash algorithm is, it is inevitable that the linked list is too long, which affects the performance of HashMap. Therefore, JDK8 converts the linked list to a red-black tree when the length is larger than 8, to take advantage of the characteristics of red-black tree to quickly add, delete, change and query.

Hash functions

The most common way to hash integers is by division and remainder. In order to hash the hash values of the keys evenly, it is common to take the size of the array as a prime (the initial size of a HashTable is 11), because the number of prime factors is small and the remainder is less likely to be equal, so the probability of collisions is small.

The capacity of a HashMap is always a power of 2, which is a composite number, designed to improve performance by converting modulo operations to bitwise operations. This equation h % length = h & (leng-1) works for the following reasons:

2^1 = 10          2^1 -1 = 01 
2^2 = 100         2^2 -1 = 011 
2^3 = 1000        2^3 -1 = 0111
2^n = 1(n zero)2^n -1 = 0(n1) 
Copy the code

It can be found that when length = 2^n, the result of h & (length-1) is exactly between 0 and Length-1, which is equivalent to modular operation.

After the bit-operation, Length-1 is equivalent to a low-order mask. In bit-and-operation, it will change the high position of the original hash value to 0, which results in the hash value only changing within a small range of the mask, obviously increasing the probability of conflict. In order to reduce conflicts, HashMap uses xor of high and low bits in the design of hash algorithm, so that the high bits of the key are also involved in the operation. The code is as follows:

static final int hash(Object key) { // JDK8
  int h;
  // h = key.hashCode() 1
  // h ^ (h >>> 16) 2. The high and low 16 bits are xor, and the high and low 16 bits are reserved
  return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
JDK8 does not have this method, but the principle is the same
static int indexFor(int h, int length) {
   return h & (length-1); // 3
}
Copy the code

The high shift xor can not only ensure the effective use of high and low key information, but also reduce the system overhead, so that the design is a trade-off between speed, efficiency and quality.

3. Put operation

The put operation does the following:

  1. If the hash bucket array table is empty, initialize it with the resize() method
  2. The key to be inserted already exists, overwriting the value directly
  3. If not, insert the key-value pair into the corresponding linked list or red-black tree
  4. Check whether to turn to red black tree when inserting list
  5. Determine whether capacity expansion is required

The core code is as follows:

public V put(K key, V value) {
  // Hash the key's hashCode
  return putVal(hash(key), key, value, false.true);
}
final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) {
  Node<K,V>[] tab; Node<K,V> p; int n, i;
  If table is null, initialize the hash bucket array
  if ((tab = table) == null || (n = tab.length) == 0)
    n = (tab = resize()).length;
  // 2. Compute the corresponding array subscript (n-1) & hash
  if ((p = tab[i = (n - 1) & hash]) == null)
    // 3. Insert data directly into this slot
    tab[i] = newNode(hash, key, value, null);
  else {
    Node<K,V> e; K k;
    // 4. The node key exists, overwriting value directly
    if(p.hash == hash && ((k = p.key) == key || (key ! =null && key.equals(k))))
      e = p;
    // 5. The chain turns into a red-black tree
    else if (p instanceof TreeNode) // Insert into the tree
      e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
    // 6. The chain is a linked list
    else {
      for (int binCount = 0; ; ++binCount) {
        // traverse to find the tail node to insert
        if ((e = p.next) == null) {
          p.next = newNode(hash, key, value, null);
          // List length greater than 8 becomes red black tree
          if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
            treeifyBin(tab, hash);
          break;
        }
        // Overwrite value when the same key is encountered
        if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
          break; p = e; }}if(e ! =null) { // existing mapping for key
      V oldValue = e.value;
      if(! onlyIfAbsent || oldValue ==null)
        e.value = value;
      afterNodeAccess(e);
      return oldValue;
    }
  }
  ++modCount;
  // 7. If the capacity exceeds the upper limit, expand the capacity
  if (++size > threshold)
    resize();
  afterNodeInsertion(evict);
  return null;
}
Copy the code

JDK8 uses tail insertion (sequential insertion) while JDK7 uses head insertion (reverse insertion).

6. Capacity expansion mechanism

By default, the initial capacity is 16, the load factor is 0.75F, and the threshold is 12, which means that 12 key-value pairs are inserted to expand the capacity.

When you expand, it’s twice as large, because you’re using a power of 2 extension, so the element’s position either stays the same or it’s shifted by a power of 2.

As can be seen in the figure above, doubling is equivalent to moving n one bit to the left, so n-1 has an extra 1 in the high place. In this case, the sum operation with the original hash value involves one more bit, which is either 0 or 1:

  • Zero is the same index
  • If it’s 1, the index becomes old index +oldCap

So how do you tell if this bit is a 0 or a 1? If the value of “old hash value & oldCap” is 0, then the bit bit is 0. The expansion code is as follows:

final Node<K,V>[] resize() {
  Node<K,V>[] oldTab = table;
  int oldCap = (oldTab == null)?0 : oldTab.length;
  int oldThr = threshold;
  int newCap, newThr = 0;
  if (oldCap > 0) {
    // The capacity will not be expanded if the value exceeds the maximum value
    if (oldCap >= MAXIMUM_CAPACITY) {
      threshold = Integer.MAX_VALUE;
      return oldTab;
    }// Otherwise double the size
    else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
           oldCap >= DEFAULT_INITIAL_CAPACITY)
      newThr = oldThr << 1; // double threshold
  }
  else if (oldThr > 0) // initial capacity was placed in threshold
    // During initialization, threshold temporarily saves the value of the initialCapacity parameter
    newCap = oldThr;
  else {               // zero initial threshold signifies using defaults
    newCap = DEFAULT_INITIAL_CAPACITY;
    newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
  }
  // Calculate the new resize upper limit
  if (newThr == 0) {
      float ft = (float)newCap * loadFactor;
      newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                (int)ft : Integer.MAX_VALUE);
  }
  threshold = newThr;
  @SuppressWarnings({"rawtypes"."unchecked"})
    Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
  table = newTab;
  // Move the old key-value pair to the new hash bucket array
  if(oldTab ! =null) {
    for (int j = 0; j < oldCap; ++j) {
      Node<K,V> e;
      if((e = oldTab[j]) ! =null) {
        oldTab[j] = null;
        if (e.next == null) / / no chain
          newTab[e.hash & (newCap - 1)] = e;
        else if (e instanceof TreeNode)
          // Split the red-black tree into two sub-lists, and then convert them into red-black trees as needed
          ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
        else { // preserve order
          // Split the list into two sub-lists in the same order
          Node<K,V> loHead = null, loTail = null;
          Node<K,V> hiHead = null, hiTail = null;
          Node<K,V> next;
          do {
            next = e.next;
            // A list of children with the same position
            if ((e.hash & oldCap) == 0) {
              if (loTail == null)
                loHead = e;
              else
                loTail.next = e;
              loTail = e;
            }
            // Sublist of oldCap offset from original position
            else {
              if (hiTail == null)
                hiHead = e;
              elsehiTail.next = e; hiTail = e; }}while((e = next) ! =null);
          // Put it in the new hash bucket
          if(loTail ! =null) {
            loTail.next = null;
            newTab[j] = loHead;
          }
          if(hiTail ! =null) {
            hiTail.next = null;
            newTab[j + oldCap] = hiHead;
          }
        }
      }
    }
  }
  return newTab;
}
Copy the code

When you recalculate the position of an element in a linked list, you can only get two child lists: a list of elements with the same index and a list of elements with the same offset. In the process of constructing sub-linked lists, head nodes and tail nodes are used to ensure the order after splitting:

The treenode.split () method shows that the split logic of a red-black tree is the same as that of a linked list, except that after the split is complete, the following processing is done according to the length of the sub-list:

  • If the length is less than 6, return a normal linked list that does not contain TreeNode
  • Otherwise, turn the sublist into a red-black tree

The reason why a red-black tree can be split according to the logic of a linked list is that the linked list retains the chain reference of the original list when it is transformed into a red-black tree, which also facilitates traversal operations.

7. Linked list to red black tree

Linked lists to red black trees do the following things:

  1. Check whether the bucket capacity meets the minimum requirements for tree configuration. Otherwise, expand the bucket capacity
  2. Convert the original list to a bidirectional list of Treenodes
  3. Turn the new list into a red-black tree

The code is as follows:

final void treeifyBin(Node<K,V>[] tab, int hash) {
  int n, index; Node<K,V> e;
  // If the hash bucket capacity is smaller than the tree minimum capacity, expand the hash bucket capacity first
  if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
    resize();
  else if ((e = tab[index = (n - 1) & hash]) ! =null) {
    TreeNode<K,V> hd = null, tl = null;
    do { // Turn an ordinary node into a tree node
      TreeNode<K,V> p = replacementTreeNode(e, null);
      if (tl == null)
        hd = p;
      else {
        p.prev = tl;
        tl.next = p;
      }
      tl = p;
      // Change the list from single to bidirectional
    } while((e = e.next) ! =null);
    if((tab[index] = hd) ! =null)
      hd.treeify(tab); // Turn the list into a red-black tree}}TreeNode<K,V> replacementTreeNode(Node<K,V> p, Node<K,V> next) {
  return new TreeNode<>(p.hash, p.key, p.value, next);
}
Copy the code

HashMap should not have been designed with red-black trees in mind, so it does not provide a comparator for key or require key to implement the Comparable interface. To compare the size of two keys, a HashMap does the following:

  1. If the hash values of the two keys are different, compare the hash values
  2. If equal, use the compareTo method if the key implements the Comparable interface
  3. If the results are still equal, use the custom tieBreakOrder method to compare, the logic being as follows
static int tieBreakOrder(Object a, Object b) {
  int d;
  if (a == null || b == null || // Compare the size of className
    (d = a.getClass().getName().compareTo(b.getClass().getName())) == 0)
    // Compare the hash value generated by the local method, there is still a possibility of conflict, the probability is too small, this is considered to be less than the result
    d = (System.identityHashCode(a) <= System.identityHashCode(b) ? -1 : 1);
  return d;
}
Copy the code

Nodule 8.

The HashMap code in JDK8 is quite complex, and there are three main optimizations:

  • The optimized hash algorithm only moves once
  • Introduce red-black tree to reduce the time complexity of GET operation from O(n) to O(logn) in case of serious conflicts.
  • During capacity expansion, the binary characteristics of the power of 2 are utilized to save the time of recalculating hash and hash the previously conflicting nodes to other locations

In addition, HashMap is not thread-safe, and the main condition of contention between threads is to break and renew the linked list in the event of conflicts or expansion. Capacity expansion also means memory copying, which is a very performance-intensive operation, so pre-allocating a large enough initial capacity to reduce the number of capacity expansion times can make a HashMap perform better.

Search the wechat public account “Epiphany source code” to get more source code analysis and build the wheel.