HashMap source code analysis

HashMap source code analysis based on JDK7

The introduction of class

The following class description is translated from the source code

HashMap is a Map interface implementation class based on a hash table implementation. This implementation provides all map-related operations and allows the use of null keys and null values. (A HashMap is roughly the same as a Hashtable, except that a HashMap is not synchronized and it allows you to have null keys and values.) ; Additionally, the elements inside the HashMap are unordered.

The basic operations of a HashMap, such as PUT and get, are efficient (constant O(n) time complexity), assuming that the hash function distributes elements reasonably among hash buckets. The time a HashMap takes to iterate over all elements is proportional to the sum of the capacity (number of hash buckets) and size (number of key-value pairs) of its instances. Therefore, if you are concerned about the iterative performance of a HashMap, you should not set the initial capacity too high or the load factor too low.

An instance of a HashMap has two parameters that affect its performance: initial capacity and load factor. Capacity refers to the number of buckets in the hash table. The initial capacity is the initial size specified when the hash table is created. A load factor is a measure of the extent to which a hash table should be automatically expanded when it reaches capacity. When the number of elements in the hash table exceeds the product of the load factor and the current capacity, the hash table rehashes (that is, reconstructs the internal data structure) and the number of hash buckets roughly doubles.

In general, the default load factor value of 0.75 is a good trade-off between time and space costs. A higher value reduces the space overhead, but increases the cost of lookups (shown in most operations of the HashMap class, including GET and PUT). Therefore, when setting the initial capacity, we should reasonably consider the number of elements expected to load and the load factor to reduce the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor (Initial capacity > Max entries/load factor), no reload operation will occur.

If an instance of a HashMap needs to store many elements (key-value pairs), specifying a large enough capacity when creating a HashMap can make it much more efficient than automatic capacity expansion.

Note that hash table performance can be slow if many keys use the same hashCode() method. To improve the impact, HashMap uses the ordering of keys when they are Comparable to improve efficiency.

Note that the HashMap is not synchronized. If multiple threads access a HashMap at the same time, and at least one thread has made a structural change, it must be synchronized externally. (Structural change is any operation that adds or removes key-value pairs. In the source code, this is the operation that causes modCount properties to change. Simply changing the value of a key is not a structural change.) External synchronization is usually done by synchronizing an object that encapsulates the map.

If there is no such objects, you can use the Collections, synchronizedMap converts a map into synchronized map, this movement is best done when creating, avoid accidental visit before conversion to map out of sync.

Map m = Collections.synchronizedMap(newHashMap(...) );Copy the code

HashMap iterator all collection related methods are rapid failure (fail – fast) : if you are creating an iterator, in addition to the remove method iterator itself, structural changes have taken place in the map, the iterator can throw ConcurrentModificationException. Therefore, in the face of concurrent changes, iterations fail quickly and cleanly without taking any risks.

Note that the fast failure feature of iterators is not a hard guarantee against asynchronous concurrent modifications. Rapid failure of the iterator will do our best to throw ConcurrentModificationException. Therefore, it would be wrong to write programs that rely on this exception to ensure its correctness: the fast failure behavior of an iterator should only be used to detect errors.

The constructor

There are four types of HashMap constructors:

  • The default initial capacity is 16 and the default load factor is 0.75
  • Specify the initial capacity. The load factor defaults to 0.75
  • Specify the initial capacity and load factor
  • Through the map construct passed in

1, 2, and 4 all call the third constructor, and the fourth is just a convenient way to construct a HashMap from an existing Map, so here we focus on the implementation of constructors 3 and 4.

 public class HashMap<K.V> 
 	extends AbstractMap<K.V> 
 	implements Map<K.V>, Cloneable.Serializable {   
 	
	/ /...
	
	/ / the empty table
    static finalEntry<? ,? >[] EMPTY_TABLE = {};/ / a hash table
    transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;
	
    // Container capacity expansion threshold. When the container size reaches this value, the container will be expanded.
    // size = Capacity * load factor
    // If table == EMPTY_TABLE, then this value is used as the initial capacity to create a new hash table
    int threshold;
    
    // Load factor
    final float loadFactor;

    Constructor 3: Specifies the initial capacity and load factor
    public HashMap(int initialCapacity, float loadFactor) {
        // Check the parameters
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " + loadFactor);
        // Set the load factor
        this.loadFactor = loadFactor;
        // The default threshold is equal to the initial capacity
        threshold = initialCapacity;
        init();
    }

    Constructor 4: Constructs a new HashMap from the passed map
    public HashMap(Map<? extends K, ? extends V> m) {
        this(Math.max((int) ( m.size() / DEFAULT_LOAD_FACTOR) + 1,
                      DEFAULT_INITIAL_CAPACITY), DEFAULT_LOAD_FACTOR);
        // Allocate the hash table space
        inflateTable(threshold);
        putAllForCreate(m);
    }
    
    / /...
    
}
Copy the code

In the above source code, there are several points to note:

  1. The default capacity expansion threshold is 16.When the hash table is empty, the HashMap internally uses this threshold as the initial capacityTo build a hash tableThe hash table is essentially aAn array of
  2. inflateTableThe method is to create a hash table and allocate the memory of the table. (The translation is “inflate”, which is more on this later.) butConstructor that specifies the initial capacity and load factorIt’s not called right awayinflateTable. Find all calls in the source codeinflateTableThe places are:
Graph LR HashMap constructor -- Map -- inflateTable put -- inflateTable putAll -- inflateTable clone -- inflateTable readObject --> inflateTableCopy the code

At first glance, only the parameter list is the Map constructor that calls inflateTable, However, the logic inside the HashMap(Map Map) constructor is to call the HashMap(int initialCapacity, float loadFactor) constructor to initialize the capacity and load factors and then call the inflateTable inflateTable. So to summarize: a HashMap does not create a hash table immediately during initialization.

Call logic

To better understand the code’s invocation, the following diagram shows the invocation relationships between some methods:

Internal data structure

The internal data structure maintained by HashMap is array + linked list. Each key-value pair is stored in the static internal class Entry of HashMap. The structure is shown as follows:

The implementation of the put

    public V put(K key, V value) {
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);
        }
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key);
        int i = indexFor(hash, table.length);
        for(Entry<K,V> e = table[i]; e ! =null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

    void addEntry(int hash, K key, V value, int bucketIndex) {
        // If the container size is greater than or equal to the threshold and the entry of the target bucket is not null
        if ((size >= threshold) && (null! = table[bucketIndex])) {// Container expansion: the original length of the hash table is 2
            resize(2 * table.length);
            // Recalculate the key hash
            hash = (null! = key) ? hash(key) :0;
            // Recalculate the hash value corresponding to the stored hash table location
            bucketIndex = indexFor(hash, table.length);
        }
        createEntry(hash, key, value, bucketIndex);
    }
Copy the code
  1. Inside the PUT method, the hash table is checked to see if it is empty. If it is, the hash table (the array in the internal data structure mentioned above) is created. Once the table is created, there is space for key-value pairs.

  2. To store a key-value pair, you need to compute a hash based on the key, which returns a value of type int, and then compute the location (subscript) of the fixed length array.

  3. Once you have the index, you have to find the location of the storage in the hash table. A HashMap first loads an Entry stored in the specified subscript. If the Entry is not empty, it compares the hash and key of the Entry (equals and = are used to compare keys). If the hash and key match, the value on the Entry is overwritten and the old value is returned. Otherwise, search for the next Entry that the Entry points to until the last Entry.

  4. If the Entry object in the specified subscript of the HashMap load is null, or there is no hash or key matching the entire Entry list. So call addEntry to create an Entry

  5. The addEntry method does some pre-processing. The HashMap determines whether the number of key-value pairs currently stored in the container has reached the preset capacity threshold. If so, it doubles the capacity. After expansion, the hash code is recalculated, and the storage location is recalculated according to the new hash code and the new array length. Once the potential has been processed, createEntry is called to create an Entry.

  6. The createEntry method does not need to worry about capacity expansion because it has already done the pre-processing. This method will store the key and value that you put in at the given subscript, and of course the key and value is wrapped in the Entry, and then you point the Entry to the old Entry.

Creating hash tables using inflateTables

Creating a hash table is implemented in the inflateTable method:

	/** * convert a number to 2 to the NTH power *@param number
     * @return* /
    private static int roundUpToPowerOf2(int number) {
        // assert number >= 0 : "number must be non-negative";
        return number >= MAXIMUM_CAPACITY
                ? MAXIMUM_CAPACITY
                : (number > 1)? Integer.highestOneBit((number -1) < <1) : 1;
        Integer.highestonebit ((number-1) << 1)
        // For example, number = 23, 23-1 = 22, binary: 10110
        // Move 22 one bit to the left (add a 0 to the right), resulting in: 101100
        // integer.highestonebit () takes the highest left bit and 0 for the rest of the bits.
        // that is, 101100 -> 100000, which is 32 in decimal
    }

    /** * inflation is the cause of this. Initialize hash table, allocate hash table memory */
    private void inflateTable(int toSize) {
        // Find a power of 2 >= toSize
        // Find the size of the hash table greater than or equal to toSize to the NTH power of 2
        int capacity = roundUpToPowerOf2(toSize);
        // Calculate the new capacity expansion threshold: capacity x load factor
        threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
        // specify the capacity to build the hash table
        table = new Entry[capacity];
        // Determine whether hashSeed needs to be initialized based on capacity
        initHashSeedAsNeeded(capacity);
    }
Copy the code

RoundUpToPowerOf2 roundUpToPowerOf2

RoundUpToPowerOf2 part of the calculation results:  roundUpToPowerOf2(0) = 1 roundUpToPowerOf2(1) = 1 roundUpToPowerOf2(2) = 2 roundUpToPowerOf2(3) = 4 roundUpToPowerOf2(4) = 4 roundUpToPowerOf2(5) = 8 roundUpToPowerOf2(6) = 8 roundUpToPowerOf2(7) = 8 roundUpToPowerOf2(8)  = 8 roundUpToPowerOf2(9) = 16 roundUpToPowerOf2(10) = 16 roundUpToPowerOf2(11) = 16 roundUpToPowerOf2(12) = 16 roundUpToPowerOf2(13) = 16 roundUpToPowerOf2(14) = 16 roundUpToPowerOf2(15) = 16 roundUpToPowerOf2(16) = 16 RoundUpToPowerOf2 (17) = 32 roundUpToPowerOf2(6) Example: Formula: integer.highestonebit ((5-1) <<1) Calculate 5<<1: 00000101 <<1 ------------- 00001010 1010 is 10 in decimal, HighestOneBit (10) then computes integer.highestonebit (10), which takes the highest bit of the value passed in and the rest of the low bits 0, so integer.highestonebit (10) should be equal to binary 1000, or 8Copy the code

Note that the initHashSeedAsNeeded(Capacity) method is used to determine whether hashSeed needs to be initialized based on capacity. The default hashSeed is 0. It’s going to be a random value.

The Alternative of hashing and hashSeed

There is a constant ALTERNATIVE_HASHING_THRESHOLD_DEFAULT in the source code, whose comments provide some noteworthy information:

    /** * The default threshold of map capacity above which alternative hashing is * used for String keys. Alternative hashing reduces the incidence of * collisions due to weak hash code calculation for String keys. * <p/> * This value may  be overridden by defining the system property * {@code jdk.map.althashing.threshold}. A property value of {@code 1}
     * forces alternative hashing to be used at all times whereas
     * {@code -1} value ensures that alternative hashing is never used.
     */
    static final int ALTERNATIVE_HASHING_THRESHOLD_DEFAULT = Integer.MAX_VALUE;
Copy the code

ALTERNATIVE_HASHING_THRESHOLD_DEFAULT is the default threshold for enabling alternative hashing when a key-value pair’s key is String and the map’s capacity is reached. Alternate hashes reduce the (easier) incidence of hash collisions for String keys. JDK. This value can be defined system attributes map. Althashing. The threshold to specify. If the value is 1, it forces the standby hash to always be used; If the value is -1, the value is disabled.

HashMap have a static inner class Holder, it is used in the virtual machine starts according to JDK. Map. Althashing. The threshold and ALTERNATIVE_HASHING_THRESHOLD_DEFAULT initialization ALTERNATIVE_HAS HING_THRESHOLD, the related codes are as follows:

	/** * Holder maintains some values */ that can only be initialized after the virtual machine starts
    private static class Holder {

        /** * Triggers the capacity threshold of the hash table for enabling standby hashes */
        static final int ALTERNATIVE_HASHING_THRESHOLD;

        static {
            / / read the JVM parameter - Djdk. Map. Althashing. Threshold
            String altThreshold = java.security.AccessController.doPrivileged(
                    new sun.security.action.GetPropertyAction(
                            "jdk.map.althashing.threshold"));

            int threshold;
            try {
                // If this parameter has no value, use the default value
                threshold = (null! = altThreshold) ? Integer.parseInt(altThreshold) : ALTERNATIVE_HASHING_THRESHOLD_DEFAULT;// If the value is -1, the standby hash is disabled
                ALTERNATIVE_HASHING_THRESHOLD_DEFAULT is also equal to integer.max_value
                // So the JDK disables standby hashes by default
                if (threshold == -1) {
                    threshold = Integer.MAX_VALUE;
                }

                // If the parameter is any other negative value, it is considered illegal
                if (threshold < 0) {
                    throw new IllegalArgumentException("value must be positive integer."); }}catch(IllegalArgumentException failed) {
                throw new Error("Illegal value for 'jdk.map.althashing.threshold'", failed); } ALTERNATIVE_HASHING_THRESHOLD = threshold; }}Copy the code

As mentioned earlier, the initHashSeedAsNeeded(Capacity) method is finally called in inflateTable. This method is used to determine whether hashSeed needs to be initialized based on capacity. HashSeed defaults to 0. So here’s a look at this method:

    /** * A randomizing value associated with this instance that is applied to * hash code of keys to make hash collisions harder to find. If 0 then * alternative hashing is disabled. */
    transient int hashSeed = 0;

    /** * Initializes hash seeds on demand */
    final boolean initHashSeedAsNeeded(int capacity) {
        // If hashSeed! = 0, indicating that the standby hash is currently in use
        booleancurrentAltHashing = hashSeed ! =0;
        // If the VM is started and the map capacity is greater than the threshold, use an alternate hash
        boolean useAltHashing = sun.misc.VM.isBooted() &&
                (capacity >= Holder.ALTERNATIVE_HASHING_THRESHOLD);
        // Xor operations are false if both values are false or true.
        boolean switching = currentAltHashing ^ useAltHashing;
        if (switching) {
            // Set hashSeed to random
            hashSeed = useAltHashing
                    ? sun.misc.Hashing.randomHashSeed(this)
                    : 0;
        }
        return switching;
    }
Copy the code

As you can see from the comment on the hashSeed variable, the hashSeed is a random value that is used when calculating the hash code for the key, in order to further reduce hash collisions. If hashSeed=0, the standby hash is disabled.

The ALTERNATIVE_HASHING_THRESHOLD maintained in Holder is the threshold that triggers the enabling of alternate hashes, indicating that if the container’s capacity reaches this value, the container should enable alternate hashes.

Holder will attempt to read the JVM startup incoming parameters – Djdk. Map. Althashing. The threshold and the assignment to the ALTERNATIVE_HASHING_THRESHOLD. Its value has the following meanings:

  • ALTERNATIVE_HASHING_THRESHOLD = 1, always use an alternate hash
  • ALTERNATIVE_HASHING_THRESHOLD = -1, the alternate hash is disabled

In the initHashSeedAsNeeded(int Capacity) method, a random hashSeed is generated if the container’s capacity >=ALTERNATIVE_HASHING_THRESHOLD, This seed is used in the hash method during the put method call:

    /** * get the hash code of the key and apply a supplementary hash function to form the final hash code. * This is critical because HashMap uses power-of-two length hash tables, that * otherwise encounter collisions for hashCodes that do not differ * in lower bits. Note: Null keys always map to hash 0, thus index 0. */
    final int hash(Object k) {
        // If the hash seed is a random value, use an alternate hash
        // (method call chain: inflateTable()-->initHashSeedAsNeeded()-->hash(),
        // In initHashSeedAsNeeded() it was decided whether to initialize hash seeds.)
        int h = hashSeed;
        if (0! = h && kinstanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }

        h ^= k.hashCode();

        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }
Copy the code

Compute storage index (indexFor)

    /** * returns the subscript */ of the hash table calculated according to the hash code
    static int indexFor(int h, int length) {
        // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
        return h & (length-1);
    }
Copy the code

This code is pretty simple, but it has a few interesting points.

Why is the capacity designed to be 2 to the NTH

Note that the capacity is essentially the length of the internal array, and note that it’s 2 to the NTH power, not a multiple of 2. Take a look at the following test code:

public class Main {

    static final int hash(Object k) {
        int hashSeed = 0;
        int h = hashSeed;
        if (0! = h && kinstanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }
        h ^= k.hashCode();
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

    static int indexFor(int h, int length) {
        return h & (length-1);
    }

    public static void main(String[] args) {
        String key = "14587";
        int h = hash(key);
        int capacity = 16;
        for (int i = 0; i < 10; i++) {
            System.out.println(String.format("Hash code: %d, capacity: %d, subscript: %d",
                    h, // Same hash code
                    (capacity<<i), // Different capacity
                    indexFor(h,capacity<<i))); // The calculated subscript}}// key: hello
// Hash code: 96207088, capacity: 16, subscript: 0
// Hash code: 96207088, capacity: 32, subscript: 16
// Hash code: 96207088, capacity: 64, subscript: 48
// Hash code: 96207088, capacity: 128, subscript: 112
// Hash code: 96207088, capacity: 256, subscript: 240
// Hash code: 96207088, capacity: 512, subscript: 240
// Hash code: 96207088, capacity: 1024, subscript: 240
// Hash code: 96207088, capacity: 2048, subscript: 240
// Hash code: 96207088, capacity: 4096, subscript: 240
// Hash code 96207088, capacity 8192, subscript 240

// key: 4
// Hash code: 55, capacity: 16, subscript: 7
// Hash code: 55, capacity: 32, subscript: 23
// Hash code: 55, capacity: 64, subscript: 55
// Hash code: 55, capacity: 128, subscript: 55
// Hash code: 55, capacity: 256, subscript: 55
// Hash code: 55, capacity: 512, subscript: 55
// Hash code: 55, capacity: 1024, subscript: 55
// Hash code: 55, capacity: 2048, subscript: 55
// Hash code: 55, capacity: 4096, subscript: 55
// Hash code: 55, capacity: 8192, subscript: 55

// key: 14587
// Hash code: 48489485, capacity: 16, subscript: 13
// Hash code: 48489485, capacity: 32, subscript: 13
// Hash code: 48489485, capacity: 64, subscript: 13
// Hash code: 48489485, capacity: 128, subscript: 13
// Hash code: 48489485, capacity: 256, subscript: 13
// Hash code: 48489485, capacity: 512, subscript: 13
// Hash code: 48489485, capacity: 1024, subscript: 13
// Hash code: 48489485, capacity: 2048, subscript: 1037
// Hash code: 48489485, capacity: 4096, subscript: 1037
// Hash code: 48489485, capacity: 8192, subscript: 1037

}
Copy the code

HashSeed =0 is the default value of HashMap. The main method calculates the hash code by key and the subscript by hash code and array length. As can be seen from the test results, when the same hash code is expanded for multiple times, using the indexFor algorithm, the subscript changes are less, which can reduce the number of Entry movement operations caused by expansion.

The key is 4 and the capacity is 16, 32, and 64…… IndexFor is the process of calculating the subscript.

The hash code for the string "4" is: 55 (binary 110111) when length = 16: H & (leng-1) = 55&(16-1) = 110111&1111 when length = 32: H & (32-1) = 55&(16-1) = 110111&11111 When length = 64: H & (Leng-1) = 55&(64-1) = 110111&111111Copy the code

Since the capacity is doubled each time it is expanded (capacity x 2), after a certain number of times (red dotted line to the left), the bits used to do the and operation with H must all be 1, so the calculated indices will be the same. In this way, although expansion will cause subscript changes, but relatively stable.

Imagine if the capacities were 17, 33, 65….. Then lenght-1 binary is 0 except for the high (leftmost) digit of 1, and the rest of the binary is 0. The subscripts calculated by different hashes and length-1 are more likely to have repeated subscripts. Setting all bits of LENGht-1 to 1 makes the calculated subscript distribution more uniform and reduces hash collisions.

To summarize, the capacity is designed to be 2 to the NTH power in order to:

  • In the PUT method, indexFor is called to compute the subscript, and a capacity of 2 to the NTH power makes the subscript relatively uniform and reduces hash collisions
  • In the transfer method related to expansion, indexFor is also called to recalculate subscripts. Capacity design to the NTH power of 2 makes the recalculation subscript relatively stable during expansion, reducing the number of moving elements

Capacity expansion and thread safety issues

    /**
     * Rehashes the contents of this map into a new array with a
     * larger capacity.  This method is called automatically when the
     * number of keys in this map reaches its threshold.
     *
     * If current capacity is MAXIMUM_CAPACITY, this method does not
     * resize the map, but sets threshold to Integer.MAX_VALUE.
     * This has the effect of preventing future calls.
     *
     * @param newCapacity the new capacity, MUST be a power of two;
     *        must be greater than current capacity unless current
     *        capacity is MAXIMUM_CAPACITY (in which case value
     *        is irrelevant).
     */
    void resize(int newCapacity) {
        // Cache hash table data
        Entry[] oldTable = table;
        int oldCapacity = oldTable.length;
        if (oldCapacity == MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return;
        }
        // Create a new hash table with the expanded capacity
        Entry[] newTable = new Entry[newCapacity];
        transfer(newTable, initHashSeedAsNeeded(newCapacity));
        table = newTable;
        threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);
    }

    /** * move all entries from the current hash table to the new hash table */
    void transfer(Entry[] newTable, boolean rehash) {
        int newCapacity = newTable.length;
        for (Entry<K,V> e : table) {
            while(null! = e) { Entry<K,V> next = e.next;if (rehash) {
                    e.hash = null == e.key ? 0 : hash(e.key);
                }
                inti = indexFor(e.hash, newCapacity); e.next = newTable[i]; newTable[i] = e; e = next; }}}Copy the code

As you can see from the transfer process shown in the figure above, the linked list is in reverse order after the transfer. 3–>7–>9 becomes 9–>7–>3. In a single-threaded environment, there is no closed loop.

However, in a multi-threaded environment, transfer may be called by multiple threads, and the transfer method accesses a global variable table and modifies the Entry pointed to in the subscript. Since the transfer process causes the list to be in reverse order, it is possible to have closed-loop references: 3–>7–>9–>3, and then an infinite loop when the GET method is called.