The HashMap we Loved and Hated over the Years

Introduction to HashMap collection

Features:

HashMap is an important implementation class of Map interface. It is based on hash table and stores data in key-value form, which is not thread safe.
Null can be used as a key. There can be only one such key. One or more keys can be null.
Access elements are out of order.

Underlying data structure:

Before JDK1.8, it was composed of arrays and linked lists. Arrays were the body of the data stored, and linked lists existed to resolve hash conflicts.
After JDK1.8, the list consists of an array + a list + a red-black tree. If the list length is greater than the threshold (8 by default) and the array length is greater than 64, the list will be converted to a red-black tree to resolve hash conflicts.

Note: If the threshold value is greater than 8 but the array length is less than 64, the linked list will not be converted into a red-black tree to store data, but the array will be expanded.

Why: If the array is small, avoid red-black tree structures. Because the structure of red-black tree is more complex, red-black tree is also called balanced binary tree, need to carry out left rotation, right rotation, color changes and other operations to ensure balance. In the case of small arrays, manipulating arrays is more time efficient than manipulating red-black trees. To sum up: To improve performance and reduce search time, linked lists are converted to red-black trees only when the threshold is greater than 8 and the array length is greater than 64. For details, see the treeifyBin method.

HashMap storage data structure diagram:

The underlying process of storing data in a HashMap

1. Analyze as shown in the following code:

package hashmap_demo;
import java.util.HashMap;
public class HashMapTest {
    public static void main(String[] args) {
        HashMap<String, Integer> map = new HashMap<>();
        map.put("Ada".18);
        map.put("Yang".28);
        map.put("Andy Lau".40);
        map.put("Ada".20); System.out.println(map); }}{Yang Mi =28, Liu Yan =20, Liu Dehua =40}
Copy the code

2.HashMap stored procedure diagram:

3. Stored procedure analysis:

1. When executing HashMap

map = new HashMap<>(); When this line creates a HashMap instance object; Prior to JDK1.8, an Entry[] table array of length 16 was created in the constructor to store key-value pairs. After JDK1.8, the timing of array creation has changed. Instead of creating arrays in constructors, the Node[] TABLE array is created on the first call to the PUT () method (that is, the first time an element is added to the HashMap).
,>

Note that creating HashMap instance objects has changed around JDK1.8 for two main reasons: the timing of creation has changed; The array type is changed from Entry[] to Node[].

2. Store Liuyan-18 in the hash table, the hash value corresponding to Liuyan will be calculated according to the rewritten hashCode() method in String class, and then the index value of Liuyan in Node[] array will be calculated using some algorithm combining with the length of array. If there is no data in the index position, insert Liuyan-18 directly into the index position. For example, the corresponding index of willow rock is calculated as 3, as shown in the figure above.

Interview question: What algorithm is used at the bottom of the hash table to calculate the index value? What other algorithms compute index values?

Answer: Compute the hash using the key’s hashCode() method, then compute the index by unsigned right (>>>), bitwise xor (^), bitwise and (&) combined with the array length; You can also use the square method, remainder, pseudo random number method.

Remainder: 10%8=2 11%8=3; Bit operation is the most efficient, while other methods are less efficient.

3. Stores Yang mi -28 in the hash table, calculates that there is no data in the index position, and inserts it directly.

4. To the hash table stored in the Andy lau – 40, assuming that Andy lau to calculate the index also is 3, then the index position at this time is not null, then the bottom will be Ada and Andy lau hash values are consistent, if not consistent, the index position on this draw a node to store the Andy lau – 40, this way is called zipper method.

P = TAB [I = (n-1) & hash], that is, index = hash value & (array length -1), by bit and operation equivalent to mod operation, because 51%16=3, 19%16=3, so there will be the same array, the same index value, but the hash value is different.

5. Finally, store liuyan -20 in the hash table. The corresponding index value of liuyan is 3. Because there is already data at this index location, the hash value of willow rock is compared to other data at this index location. If the hash value is equal, a hash collision occurs. The equals() method of liuyan’s String class is called to compare the contents of the two objects:

Same: then, the value of the data added later overwrites the value of the previous value, that is, Liuyan-20 overwrites Liuyan-18.

Different: Continue to compare with other objects in the index position, and if they are different, draw down a node store (zipper method).

Note: If an index position zips down, i.e. the list length is greater than the threshold 8 and the array length is greater than 64, the list is converted to a red-black tree. Because the time complexity of the linked list is O (N), and the time complexity of the red-black tree is O (logN), O (N) >O (logN) for the length of the linked list.

3. Capacity expansion mechanism of HashMap

1. When will HashMap be expanded?

First look at the put() method flow for adding elements:

Description:

In the picture abovesizesaidHashMapIn theK-VThe real-time quantity of, is not equal to the length of the array;
threshold(Critical value) =capacity(Array capacity) *loadFactory(load factor), the critical value represents the maximum number of arrays currently occupied.sizeIf this threshold is exceeded, the call is maderesize()Method To expand the capacity, the capacity after expansion is twice the original;
By default,16 * 0.75 = 12, i.e.,HashMapExceeds the number of elements stored in12It’s going to expand.

2. What is the expanded size of HashMap?

It is twice the original capacity, that is, HashMap is expanded by 2N.Copy the code

3. What is the default initial capacity of a HashMap?

Create a HashMap with no arguments. The default initial value is 16.

/** * Constructs an empty HashMap with the default initial capacity * (16) and the default load factor (0.75). * /
public HashMap(a) {
    this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
}
Copy the code

Default initial value source:

/** * The default initial capacity - MUST be a power of two
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; 
Copy the code

As you can see from the source code, the default initial capacity of a HashMap is 1 to the left by 4 bits, that is, 1*2 to the fourth power is 16. If the initialization is done using the no-parameter construction of a HashMap, the resize() method is triggered when the element is put for the first time, and the capacity is expanded to 16. This is similar to the initialization of an ArrayList (when initialized with the no-arguments construction of an ArrayList, an empty array is created, and the first time an element is added to the empty array, the grow() method is triggered to grow to 10).

4. Why must the specified initial capacity be a power of 2?

A HashMap can be initialized with a specified capacity.

/** * Constructs an empty <tt>HashMap</tt> with the specified initial * capacity and the default load factor (0.75). ** Constructs an empty <tt>HashMap</tt> with the specified initial * capacity and the default load factor (0.75)@param  initialCapacity the initial capacity.
 * @throws IllegalArgumentException if the initial capacity is negative.
 */
public HashMap(int initialCapacity) {
    this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
Copy the code

Construct an empty HashMap with a specified capacity and default load factor (0.75).

When adding elements to a HashMap, we first calculate the index position based on the hash value of the key combined with the array length. In order to access efficiently, HashMap needs to reduce hash collisions and distribute data evenly, and calculates index values by bits and **hash& (Length-1) **.

HashMap uses the mod algorithm to compute the index, namely hash%length. However, the mod operation is not as efficient as the bit operation. Therefore, the bottom layer uses the bit and **hash& (length-1) ** to compute the index. The two algorithms are equivalent only if length is 2 to the NTH power.

5. Why is it evenly distributed?

We need to know two conclusions:

2 to the NTH power is 1 followed by n zeros; For example, 2 to the fourth power is 16, the binary representation is 10000;
2 to the NTH power minus 1 is n ones; For example, 2 to the fourth -1 has 15 bits, which in binary is 1111.

An example of why arrays of length 2 to the NTH power can be uniformly distributed:

Bitwise and operations: both on the same binary bit1, the results for1, or for0. Let's say the array length is2the3power8, the hash value is3, i.e.,3& (8-1) =3, the index for3; Let's say the array length is2the3power8, the hash value is2, i.e.,2& (8-1) =2, the index for2; The operation process is as follows:3& (8-1)0000 0011 -->3
0000 0111 -->7
----------------
0000 0011 -->3

2& (8-1)0000 0010 -->2
0000 0111 -->7
----------------
0000 0010 -->2Conclusion: Data distribution is uniform in different index positions with different index values.Copy the code

Assuming that the array length is not 2 to the NTH power, such as 9, the operation is as follows:

Let's say the array length is9, the hash value is3, i.e.,3& (9-1) =3, the index for0; Let's say the array length is9, the hash value is2, i.e.,2& (9-1) =2, the index for0; The operation process is as follows:3& (9-1)0000 0011 -->3
0000 1000 -->8
----------------
0000 0000 -->0

2& (9-1)0000 0010 -->2
0000 1000 -->8
----------------
0000 0000 -->0Conclusion: The index values are0, resulting in a large number of data in the same index position, but no data in other index positions, resulting in a long linked list or red-black tree, reducing efficiency.Copy the code

Note: Hash %length is equivalent to hash& (length-1) if the array length is 2 to the NTH power. Since the bottom layer uses bitwise and operation to calculate the index value, it is necessary to ensure that the array length must be 2 to the NTH power.

6. What if the specified initial capacity is not 2 to the NTH power?

The HashMap then uses the bitwise operation and or to get a power of 2 that is the lowest power of the specified 2. For example, if you start with a capacity of 10, you end up with 16.Copy the code

The source code involved in this process is as follows:

// Create a HashMap collection object with a capacity of 10, not a power of 2
HashMap<String, Integer> map = new HashMap<>(10);
// Call parameterized constructs
public HashMap(int initialCapacity) {
        this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
//this keyword continues to be called
public HashMap(int initialCapacity, float loadFactor) {//initialCapacity=10
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);
        this.loadFactor = loadFactor;
        this.threshold = tableSizeFor(initialCapacity);//initialCapacity=10
}
// Call the tableSizeFor() method
/** * Returns a power of two sizes for the given target capacity. * /
static final int tableSizeFor(int cap) {
     int n = cap - 1;
     n |= n >>> 1;
     n |= n >>> 2;
     n |= n >>> 4;
     n |= n >>> 8;
     n |= n >>> 16;
     return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Copy the code

TableSizeFor ()

int n = cap - 1;Why do I subtract 1?

This is in case CPA is already a power of two. If 'CPA' is already a power of 2 and no subtraction of 1 is performed, the following unsigned right shift will return twice the value of 'cap'.Copy the code

If n equals 0, return 1. I’m not going to talk about the case where you equal 0.
|Represents bitwise or operation: the result is 0 if the operation rule is 0 on the same binary bit, or 1 otherwise.

1st operation:

int n = cap - 1;/ / cap = 10, n = 9
n |= n >>> 1;// Unsigned shift 1 bit to the right, and then do the or with n
00000000 00000000 00000000 00001001  //n=9
00000000 00000000 00000000 00000100  //9 moves unsigned 1 bit to the right to 4
-----------------------------------------------
00000000 00000000 00000000 00001101  // when n=13
Copy the code

Second operation:

int n = 13
n |= n >>> 2;
00000000 00000000 00000000 00001101  //n=13
00000000 00000000 00000000 00000011  //13 moves 2 bits to the right unsigned to 3
------------------------------------------------
00000000 00000000 00000000 00001111  // When n=15
Copy the code

Operation 3:

int n = 15
n |= n >>> 4;
00000000 00000000 00000000 00001111  //n=15
00000000 00000000 00000000 00000000  //15 moves 4 unsigned bits to the right to 0
------------------------------------------------
00000000 00000000 00000000 00001111  // When n=15
Copy the code

The result of the next operation is n=15, and since there is an n + 1 operation at the end, the result is 16.

Conclusion: It can be seen from the above operation process that if the specified initial capacity is not 2 to the n power, the smallest power from the initial capacity will be obtained after operation.

4. HashMap source code analysis

1. Member variables

private static final long serialVersionUID = 362498820763181265L; // Serialize the version number
Copy the code

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // Initialize the capacity, which must be 2 to the NTH power
Copy the code

static final int MAXIMUM_CAPACITY = 1 << 30; // Set maximum size: 2 to the 30th power
Copy the code

static final float DEFAULT_LOAD_FACTOR = 0.75 f; // Default load factor
/**1. The loading factor is used to measure the density of HashMap. The real-time loading factor of HashMap can be calculated using the following method: size/capacity; *2. If the loading factor is too large, it leads to low element search efficiency, and if the loading factor is too small, it leads to low array utilization. The default value is 0.75F, which is a good critical value officially given. *3. When the number of elements in the HashMap reaches 75% of the length of the HashMap array, it indicates that the HashMap is too crowded and needs to be expanded. The expansion process involves rehash, data replication and other operations, which greatly consumes performance, so The Times of expansion should be minimized during development. This can be avoided by specifying the initial capacity when creating a HashMap collection object; *4. You can also specify the load factor size in the HashMap constructor. * /
HashMap(int initialCapacity, float loadFactor) // Construct an empty HashMap with an initial capacity and load factor specified
Copy the code

static final int TREEIFY_THRESHOLD = 8; // The list length is greater than the threshold 8
Copy the code

static final int UNTREEIFY_THRESHOLD = 6; // Delete red black tree node, when the red black tree node is less than 6, convert to linked list
Copy the code

static final int MIN_TREEIFY_CAPACITY = 64; // The second condition for the list to turn into a red-black tree is that the array length is greater than 64
Copy the code

Five, often meet test questions

1. What are the conditions for hashing collisions?

Hash collisions occur when two objects have the same index and the hashCode (or hash value) is equal.Copy the code

2. How to resolve hash conflicts?

Before JDK1.8, use linked list solution; After JDK1.8, use linked list + red black tree solution.Copy the code

3. If two keys have the same hashCode, how can they be stored?

Equals is used to compare whether the contents are the same. Different: draw out a node storage (zipper method).Copy the code

4. The underlying data structure of HashMap?

JDK1.8: Array + linked list + red-black tree Where the array is the body, linked lists and red-black trees exist to resolve hash conflicts, as shown in the following figure:

5. Why are red-black trees introduced in JDK1.8? Isn’t the red-black tree structure more complicated?

Before JDK1.8, the underlying data of HashMap was array + linked list. As we know, even if the hash function is well done, it is difficult to achieve 100% uniform distribution of elements in the hash table. When a large number of elements in a HashMap all exist in the same bucket (same index position), a long linked list will be generated under the bucket. In this case, the HashMap is equivalent to the structure of a single linked list. If there are N elements in the single linked list, the traversal time complexity is O (n), and the traversal efficiency is very low. In this case, red-black tree is introduced in JDK1.8. The time complexity of traversing red-black tree is O (logn), because O (n) >O (logn). So this problem is optimized.

6. Why should a list length greater than 8 be converted into a red-black tree?

We know that 8 is the threshold from a linked list to a red-black tree, and there is a comment in the source code:

/** Because TreeNodes are about twice the size of regular nodes, we use them only when * bins contain enough nodes to warrant use (see TREEIFY_THRESHOLD). And when they * become too small (due to removal or resizing) they are converted back to plain bins. * In usages with well-distributed user hashCodes, tree bins are rarely used. Ideally, * under random hashCodes, the frequency of nodes in bins follows a Poisson distribution * (http://en.wikipedia.org/wiki/Poisson_distribution) with A parameter of about 0.5 on * average for the default resizing threshold of 0.75, although with a large variance * because of resizing granularity. Ignoring variance, The expected occurrences of list * size k are (exp(-0.5) * POw (0.5, k)/factorial(k)). The first values are: * * 0: 0.60653066 * 1: 0.30326533 * 2: 0.07581633 * 3: 0.01263606 * 4: 0.00157952 * 5: 0.00015795 * 6: 0.00001316 * 7: 0.00000094 * 8: 0.00000006 * More: less than 1 in ten million */
Copy the code

The value of the translation means:

The space of red-black tree nodes is twice that of common linked list nodes, and the frequency of data storage in the linked list conforms to poisson distribution. It can be seen that the probability of data storage on the node with the linked list 8 is 0.00000006, indicating that the probability of data storage on nodes beyond 8 is very small.

From the above analysis, it can be concluded that:

If it’s below the threshold of 8, you use a red-black tree, which makes the structure very complicated to begin with;
If the linked list is used when the threshold is greater than 8, the linked list nodes cannot be fully utilized.
Therefore, threshold 8 is a scientific and reasonable value, which is a tradeoff between space and time.

7. Why is the load factor set to 0.75? The boundary is 12, right?

If the loading factor is 0.4, then 16*0.4=6, resulting in a full 6 space expansion, resulting in the array utilization is too low;
If the loading factor is 0.9, then 16*0.9=14, which will make the array too full, and it is very likely to cause the linked list under a certain index node to be too long, which will lead to low efficiency of searching elements.
Therefore, considering both the array utilization and the list should not be too long, 0.75 is the best value after extensive testing.