preface

Learning record

  • Learning record
    • Time: Week 3
    • SMART sub-target: Java container

Note the key points about HashMaps in learning Java container knowledge points.

Knowledge overview:

  • A, hashCode ()

  • 2. Implementation of HashMap

    1. Introduction to the

    2. Storage structure

    3. Important attributes

    4. Add element operation

      Q: Why does the length of a HashMap default to an initial length of 16 and must be a power of 2 every time it is resized ()?

    5. A HashMap capacity

      Q: HashMap dead link problem

    6. Compare Java 8 with Java 7

    7. Why red black trees?

  • Three, endnotes

A, hashCode ()

In the Object class, the hashCode() method is a native modified class, and JavaDoc describes returning the hash value of that Object.

So what does the hash return value do?

It is mainly designed to ensure that the collection based on hash, such as HashSet, HashMap and HashTable, can not be repeated when inserting elements. At the same time, it is designed to improve the efficiency of convenient insertion and deletion of elements. It is mainly for the ease of finding.

Take Set, for example,

As we all know, a Set cannot be repeated. If every time a Set is added, new elements are compared with equal() one by one, then the efficiency of inserting 100,000 pieces of data can be said to be very low.

Therefore, the application of hash table appears when adding data. Hash algorithm is also called hashing algorithm. When adding a value, it first calculates its hash value and inserts data into the specified position according to the calculated hash value. This avoids the efficiency problem of using equal() all the time.

Specific performance:

  • If the specified position is empty, it is added directly
  • If the specified position is not empty, equal() is called to determine whether the two elements are the same, and if they are, they are not stored

In the second case above, if two elements are different, but hashCode() is the same, then we have what we call a hash collision.

The probability of a hash collision depends on how hashCode() is computed and the size of the space.

In this case, a linked list is created at the same location, and all elements with the same key value are stored in the linked list.

In HashMap, zip is used to resolve hashCode conflicts.

conclusion

HashCode is the identity of an object, and in Java an object’s hashCode is a value of type int. Using hashCode to specify the index of the array allows you to quickly locate the object in the array and then iterate through the list to find the corresponding value, ideally in O(1) time, and different objects can have the same hashCode.

2. Implementation of HashMap

Introduction of 0.

  1. HashMap is implemented based on the Map interface of a hash table and exists in the form of key-value storage.

  2. Non-thread-safe;

  3. Either key value can be null.

  4. The mappings in a HashMap are not ordered;

  5. In JDK1.8, HashMap is composed of array, linked list and red-black tree. Red-black tree is added as the underlying data structure.

  6. If the length of a list stored in a hash bucket is greater than 8, the list will be converted into a red-black tree; if the length is less than 6, the list will be converted from a red-black tree to a linked list.

  7. Source code before 1.8 and 1.8 and later, the difference is large

1. Storage structure

In JDK1.8, HashMap is made up of arrays, linked lists, and red-black trees. Red-black trees are added as the underlying data structures.

Hash is used to determine the location of the array. If a hash collision occurs, the list is stored as a linked list, but if the list becomes too long, the HashMap converts the list to a red-black tree for storage, with a threshold of 8.

Here is the structure of a HashMap:

2. Important attributes

Table 2.1

  	/** * The table, initialized on first use, and resized as * necessary. When allocated, length is always a power of two. * (We also tolerate length zero in some operations to allow * bootstrapping mechanics that are currently not needed.) */
    transient Node<K,V>[] table;
Copy the code

In JDK1.8 we learned that a HashMap is a structure consisting of an array plus a linked list plus a red-black tree, where table is the array in a HashMap.

2.2 the size

    /** * The number of key-value mappings contained in this map. */
    transient int size;
Copy the code

Number of key-value pairs stored in a HashMap.

2.3 loadFactor

    /**
     * The load factor for the hash table.
     *
     * @serial* /
    final float loadFactor;
Copy the code

Load factor. A load factor is a factor that balances resource utilization against allocated space. When the total number of elements > array length * load factor.

2.4 threshold

    /**
     * The next size value at which to resize (capacity * load factor).
     *
     * @serial* /
    // (The javadoc description is true upon serialization.
    // Additionally, if the table array has not been allocated, this
    // field holds the initial array capacity, or zero signifying
    // DEFAULT_INITIAL_CAPACITY.)
    int threshold;
Copy the code

Capacity expansion threshold. Threshold = Array length x load factor. If yes, perform capacity expansion.

2.5 TREEIFY_THRESHOLD/UNTREEIFY_THRESHOLD

    /** * The bin count threshold for using a tree rather than list for a * bin. Bins are converted to trees when adding an element to a * bin with at least this many nodes. The value must be greater * than 2 and should be at least 8 to mesh with assumptions in * tree removal about conversion back to plain bins upon * shrinkage. */
    static final int TREEIFY_THRESHOLD = 8;

    /** * The bin count threshold for untreeifying a (split) bin during a * resize operation. Should be less than TREEIFY_THRESHOLD, and at * most 6 to mesh with shrinkage detection under removal. */
    static final int UNTREEIFY_THRESHOLD = 6;

Copy the code

Tree the threshold. A list stored in a hash bucket is converted from a red-black tree when its length is greater than 8, and from a red-black tree when its length is less than 6.

3. Add elements

    public V put(K key, V value) {
        return putVal(hash(key), key, value, false.true);
    }
Copy the code

3.1 the hash ()

We can see that we’re actually adding elements to putVal(), and before we do putVal(), we hash() on the key. Let’s see what we’re doing here

    static final int hash(Object key) {
        int h;
		// key.hashcode () : returns the hash value that is hashCode
		// ^ : bitwise xor
		// >>>: Move right without sign, ignore sign bit, empty space is filled with 0
        return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
    }
Copy the code

key==nullA HashMap supports null keys.

The same method is used in Hashstable to retrieve hashCode directly with a key. There is no key==null check, so Hashstable does not support null keys.

Let’s go back to this hash() method. The technical term for this method is a perturbation function.

The use of hash(), also known as a perturbation function, is intended to prevent some poorly implemented hashCode() methods. In other words, to reduce hash collisions.

JDK 1.8 hash methods are simpler than JDK 1.7 hash methods, but the principles remain the same. Let’s take a look at how this is done in JDK1.7.

        / / code in JDK1.7
		static int hash(int h) {
            // This function ensures that hashCodes that differ only by
            // constant multiples at each bit position have a bounded
            // number of collisions (approximately 8 at default load factor).
            h ^= (h >>> 20) ^ (h >>> 12);
            return h ^ (h >>> 7) ^ (h >>> 4);
        }
Copy the code

Hash methods in JDK 1.7 perform slightly worse than those in JDK1.8 because they are disturbed four times.

3.2 putVal ()

Looking at the putVal() method that actually adds elements,

	final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        // When the array is empty or its length is 0, initialize the array size(resize() is used for initialization or expansion).
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        // Compute the array subscript I = (n-1) & hash
        // If there are no elements at this location, create a Node value directly
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            // There are already elements at this location
            Node<K,V> e; K k;
            if(p.hash == hash && ((k = p.key) == key || (key ! =null && key.equals(k))))
                // The hash value is equal to the key value. Use the e variable to get the reference to the element at the current position
                e = p;
            else if (p instanceof TreeNode)
                // It is currently stored as a red-black tree and executes its own putVal method, putTreeVal
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                // It is currently stored as a linked list
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        // Insert to the end of the list!
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            // When the threshold is exceeded, the storage mode is converted to a red-black tree
                            treeifyBin(tab, hash);
                        break;
                    }
                    if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
                        break; p = e; }}if(e ! =null) { // existing mapping for key
                V oldValue = e.value;
                if(! onlyIfAbsent || oldValue ==null)
                    // onlyIfAbsent If true - Does not overwrite existing values
                    // Assign the new value to it
                    e.value = value;
                afterNodeAccess(e);
                returnoldValue; }}// Record the number of changes
        ++modCount;
        // Check whether the number of elements exceeds the threshold
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }
Copy the code

3.3 Why is the default initial length of a HashMap 16, and the length must be a power of 2 every time resize() is performed?

This is a common interview question. This problem describes a design that actually serves as a Hash algorithm for mapping from Key to array index.

As mentioned earlier, for HashMap storage to be efficient, hash collisions should be minimized; that is, elements should be distributed as evenly as possible.

The Hash value ranges from 2147483648 to 2147483647, which add up to about 4 billion mapping space. As long as Hash functions are mapped evenly and loosely, collisions are hard to occur in general applications. But the problem is that with a 4 billion array, you can’t fit it in memory. So you can’t use this hash value directly.

That’s why you need a mapping algorithm. This is the (n-1) & hash method that appeared in 3.2.

Let’s take this algorithm a step further:

  1. Let’s say I have akey="book"
  2. To calculatebookThe result is 3029737 in decimal and 101110001110101110 1001 in binary.
  3. Assuming a HashMap Length of 16 by default, the calculation of Length-1 results in 15 in decimal and 1111 in binary.
  4. 101110001110101110 100&1111 = 1001, decimal is 9, so index=9.

By doing this and operation, you can have the same effect as modulo operation hashCode % length, which in this case is 3029737%16=9.

And the performance is greatly improved by bit operation.

And maybe you still don’t know why the length has to be a power of two, because of this bit manipulation.

** of Length 16 or some other power of 2, the value of length-1 is all 1 bits, in which case the result of index is the same as the value of the last few bits of HashCode. The result of the Hash algorithm is uniform as long as the input HashCode itself is evenly distributed. ** If the length of a HashMap is not a power of 2, something will never happen to the index, which clearly does not meet the principle and expectation of uniform distribution. So in the source code has been emphasizing power-of-two expansion and size must be power of two.

In addition, the HashMap constructor allows the user to pass in capacity that is not 2 to the power of n, because it automatically converts the capacity passed in to 2 to the power of n.

/** * Returns a power of two size for the given target capacity. */
static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Copy the code

4. A HashMap expansion

Let’s talk about HashMap expansion.

4.1 capacity

The initial length of a HashMap is 16. Assuming that the key-value pairs in the HashMap keep increasing but the size of the table array stays the same, hash collisions will occur and lookups will definitely become less and less efficient. So when the number of key-value pairs exceeds a certain threshold, the HashMap performs an expansion operation.

So how is the expansion threshold calculated?

Threshold = array length * load factor

threshold = capacity * loadFactor

After each expansion, the threshold doubles

This calculation appears in the resize() method. This method is examined in detail below. So let’s move on.

The loadFactor parameter, as we mentioned earlier, is a factor that balances resource utilization against allocated space. Why 0.75? This is actually a good tradeoff for the authors, but you can also manually set the load factor through the constructor. public HashMap(int initialCapacity, float loadFactor) {…) .

Next comes the main resize() method

    final Node<K,V>[] resize() {
        // Old array reference
        Node<K,V>[] oldTab = table;
        // The old array length
        int oldCap = (oldTab == null)?0 : oldTab.length;
        / / old threshold
        int oldThr = threshold;
        // New array length, new threshold
        int newCap, newThr = 0;
        if (oldCap > 0) {
            if (oldCap >= MAXIMUM_CAPACITY) {
                // The old array has exceeded its maximum capacity
                // Set the threshold to maximum and return the old array
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            // newCap becomes twice as large
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                // Perform capacity expansion. New threshold = old threshold x 2
                newThr = oldThr << 1; // double threshold
        }
        else if (oldThr > 0) // initial capacity was placed in threshold
            // The initial threshold is manually set
            // Array capacity = initial threshold
            newCap = oldThr;
        else {               // zero initial threshold signifies using defaults
            // Initialize the operation
            // Array capacity = default initial capacity
            newCap = DEFAULT_INITIAL_CAPACITY;
            // Initial threshold = capacity x default load factor
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
            // If the threshold has not been set before
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        // Update the threshold
        threshold = newThr;
        @SuppressWarnings({"rawtypes"."unchecked"})
        	// Create an array
            Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        // Update the array referenced by table
        table = newTab;
        if(oldTab ! =null) {
            / / capacity
            for (int j = 0; j < oldCap; ++j) {
                // Iterate over the old array
                Node<K,V> e;
                if((e = oldTab[j]) ! =null) {
                    // Retrieve the head node at this position
                    // Remove old references to facilitate garbage collection
                    oldTab[j] = null;
                    if (e.next == null)
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
                        // Red black tree processing
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
                        // List handling This list handling is actually quite clever
                        // Two chains are defined
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
                                if (hiTail == null)
                                    hiHead = e;
                                elsehiTail.next = e; hiTail = e; }}while((e = next) ! =null);
                        if(loTail ! =null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if(hiTail ! =null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }
Copy the code

The above code red black tree and linked list processing do not know if you understand, I was a little dizzy at the first time anyway. But it feels very clever to understand.

For example, what it does is it splits the list into high and low lists while iterating through the old table array. What exactly does that mean? Look at the examples below.

  1. There is a HashMap of size 16. There are six elements A/B/C/D/E/F, where the Hash value of A/B/C is 5, and the Hash value of D/E/F is 21. It’s going to be stored at index=5.
  2. If they were inserted sequentially, then at index 5, there would beA->B->C->D->E->FA linked list like this.
  3. When the HashMap is expanded, we have the old array oldTable[], which has a capacity of 16, and the new array newTable[], which has a capacity of 32.
  4. When iterating through the old array index=5, go to the list processing section mentioned above and perform the following operations on the list elementsHash & oldCapacityA/B/C with A Hash value of 5 is evaluated to 0 and is assigned to the low list, while D/E/F with A Hash value of 21 is assigned to the high list.
  5. Then place the low list at index=5 and the high list at index=5+16=21 in the new array.

Red-black tree operations are coded differently, but they actually do the same thing. That is, separate list elements of different Hash sizes in the same position into a new table array. Hopefully this makes sense to you.

4.2 Dead Links of HashMap

In Java7, after the HashMap is expanded and transferred, the order of the linked list is inverted. During the transfer process, other threads modify the reference relation of the nodes in the original linked list, resulting in the formation of a circular linked list at the location of a Hash bucket. In this case, get(key), If the key does not exist in the HashMap and the Hash of the key is equal to the Hash position that forms the circular list, the program will enter an infinite loop **;

Java8 does not cause an infinite loop on the same premise, because the sequence of the linked list remains the same after the expansion and migration of Java8, and the reference relationship between the previous nodes remains.

See this for a graphic demonstration. Comics: HashMap with high concurrency

5. Compare Java 8 with Java 7

  1. When a hash conflict occurs, Java7 inserts at the head of the list and Java8 inserts at the tail

  2. When data is transferred after capacity expansion, the sequence of the Java7 linked list before and after data transfer is reversed, but The sequence of Java8 linked list remains unchanged

  3. The introduction of Red-black trees in Java8 greatly optimizes the performance of HashMap.

  4. When the PUT operation reaches the threshold, elements are added before Java7 and elements are added before Java8.

6. HashMap traversal

HashMap traversals in two ways,

		// traversal mode 1
		Iterator<Map.Entry<String, Integer>> entryIterator = map.entrySet().iterator();
        while (entryIterator.hasNext()) {
            Map.Entry<String, Integer> next = entryIterator.next();
            System.out.println("key=" + next.getKey() + " value=" + next.getValue());
        }
        
		// traversal mode 2
		Iterator<String> iterator = map.keySet().iterator();
        while (iterator.hasNext()){
            String key = iterator.next();
            System.out.println("key=" + key + " value=" + map.get(key));

        }

Copy the code

The first method is recommended for traversal.

The first method can fetch the key value at the same time, while the second method needs to fetch the value through the key once, which is inefficient.

7. Why red black trees?

Many people might say that for better lookup performance, but more specifically, the red-black tree approach is intended to improve the performance of a HashMap in the case of extreme hash collisions.

Here is some benchmark code to test the performance of HashMap:


import com.google.caliper.Param;
import com.google.caliper.Runner;
import com.google.caliper.SimpleBenchmark;
public class MapBenchmark extends SimpleBenchmark {
     private HashMap < Key, Integer > map;
     @Param
     private int mapSize;
     @Override
     protected void setUp(a) throws Exception {
          map = new HashMap < > (mapSize);
          for (int i = 0; i < mapSize; ++i) { map.put(Keys.of(i), i); }}public void timeMapGet(int reps) {
          for (int i = 0; i < reps; i++) { map.get(Keys.of(i % mapSize)); }}}Copy the code

class Key implements Comparable < Key > {
    private final int value;
    Key(int value) {
        this.value = value;
    }
    @Override
    public int compareTo(Key o) {
        return Integer.compare(this.value, o.value);
    }
    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null|| getClass() ! = o.getClass())return false;
        Key key = (Key) o;
        return value == key.value;
    }
    // @Override
    // public int hashCode() {
    // return value;
    // }

    @Override
    public int hashCode(a) {
        // Key returns the same hash value
        return 0; }}Copy the code

public class Keys {
     public static final int MAX_KEY = 10 _000_000;
     private static final Key[] KEYS_CACHE = new Key[MAX_KEY];
     static {
          for (int i = 0; i < MAX_KEY; ++i) {
           KEYS_CACHE[i] = newKey(i); }}public static Key of (int value) {
          returnKEYS_CACHE[value]; }}Copy the code

As you can see, the hash value returned by the Key object has been modified to be the same, in what we call extreme hash collisions, to measure the difference in query performance between the Java7 and Java8 versions of the HashMap.

The results of Java 7 are predictable. The performance cost of hashmap.get () increases proportionately to the size of the HashMap itself. Since all key-value pairs are in the same bucket in a large linked list, finding an entry requires an average of half of such lists (size N) to be traversed. Thus O(n) complexity is visualized on a graph.

In contrast, Java8 is much better, with the same benchmark performed on JDK 8 producing O (logn) worst-case performance in the event of a catastrophic hash collision.

The algorithm optimization here is actually described in JEP-180,

In addition, if the Key object is not Comparable, then the efficiency of inserting and removing elements in the event of a major hash conflict is greatly reduced. (Because the bottom implementation is a red-black tree, we need to use compare method to determine the order)

And you might say, well, where is this extreme hash conflict?

This is actually a security consideration, although it is rarely possible to have many conflicts under normal circumstances. But imagine that if the Key came from an untrusted source (such as the HTTP header name received from the client), it would be possible to get a forged Key, and this would not be difficult because the hash algorithm is well known. Suppose someone had the heart to forge the Key of the same hash. Then your HashMap will have this extreme hash conflict. Now, if you run multiple queries against the HashMap, the application will be slow to execute the query, CPU intensive, and even refuse to service it.

Three, endnotes

HashMap related knowledge is introduced here, the length is very long, I have also written for a long time.

For example, ConcurrentHashMap has the same implementation principle as HashMap, except that the former uses segmental locking to ensure thread safety. The underlying principle of Hashstable is similar to that of HashMap, except that Hashstable uses synchronized for synchronization and has not been updated since the release of Hashstable. Both HashMap and TreeMap low-level involve red-black trees. When you study against each other, you will find that if you understand one of them, the other ones are pretty much the same, because many of the principles are interspersed.

The next chapter covers the other Map types and compares them.

reference

  1. Code Efficient
  2. Dzone.com/articles/ha…
  3. Stackoverflow.com/questions/3…
  4. Comics: HashMap with high concurrency
  5. Mp.weixin.qq.com/s/RIBRB2xKR…