Interview Ali, HashMap is enough

preface

One day, Jiong Hui and his colleagues two dogs decided to who is “room 11 on the 11th floor of the building &#￥* (only Jiong Hui and two dogs two people) HashMap the strongest” to launch a contest. The picture is too bloody,
adultsPlease watch in the company of minors.

The body of the

Two dog: listen to your silly force brag every day, it is time to let you know what call cruel.

Jiong Hui: two dog son, this excrement can disorderly eat, this word can’t say nonsense.

Erdog: Let’s start with something simple, introducing the underlying data structure of HashMap.

We are all using JDK 1.8 now, the bottom layer is composed of “array + linked list + red-black tree”, as shown below. Before JDK 1.8, it was composed of “array + linked list”.

Why change it to “array + linked list + red black tree”?

The main purpose is to improve lookup performance in hash collisions (long lists), which is O(n) for linked lists and O(logn) for red-black trees.

Two dogs: When do you use the linked list? When to use a red black tree?

For inserts, the default is to use linked list nodes. When the number of nodes in the same index position reaches 9 (threshold 8) : If the array length is greater than or equal to 64, the linked list node will be converted to red-black tree node (treeifyBin). However, if the array length is less than 64, the list will not be triggered to red black tree, but will be expanded because the amount of data is still small.

For removal, when the number of nodes in the same index position reaches 6 after removal and the nodes in the index position are red-black tree nodes, a red-black tree node untreeify is triggered.

Why is the threshold value of the linked list to red black tree 8?

Jiong Hui: There are two important factors that we have to take into consideration when designing a project: time and space. The same is true for HashMap. In short, a threshold of 8 is a tradeoff between time and space.

The size of red-black tree nodes is about twice that of linked list nodes. When there are too few nodes, the search performance advantage of red-black tree is not obvious, and it is not worth paying the price of twice space.

Ideally, using random hash codes, the frequency at which nodes are distributed in the hash bucket follows the Poisson distribution, which gives a probability of 0.00000006 for a list with 8 nodes (similar to winning the lottery, winning the Lottery? No), the probability is low enough, and the performance benefits of red-black trees begin to show up at 8 nodes, so 8 is a reasonable number.

Ergu :(youyouyou, the result of time and space balance, also installed B) then why switch back to the linked list node is used 6 instead of reuse 8?

If we set the number of nodes to be more than 8, the list will be switched immediately if the number of nodes is less than 8. When the number of nodes is hovering around 8, the red-black tree and the list will be frequently converted, resulting in performance loss.

What are the important attributes of HashMap? What are they used for?

In addition to storing our node table array, HashMap also has several important properties: 1) size: the number of nodes that HashMap has stored; 2) Threshold: expansion threshold. When the number of HashMaps reaches this threshold, expansion will be triggered. 3) loadFactor: loadFactor, capacity expansion threshold = capacity x loadFactor.

What other functions does threshold have besides storing capacity expansion threshold?

When we create a HashMap object, threshold will also be used to store the initial capacity. The HashMap does not initialize the table until we insert the node for the first time to avoid unnecessary waste of space.

What is the default initial capacity of a HashMap? Are there any limits to the capacity of a HashMap?

The default initial capacity is 16. The capacity of a HashMap must be 2 to the power of N, and the HashMap calculates a minimum of 2 to the power of N greater than or equal to that capacity based on the capacity we pass in, such as pass 9 for capacity 16.

Two dog :(you his niang is in tongue twister) you this *@%￥#& N power is how calculate?

J: Talk is cheap. Show you the code.

static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}Copy the code

Two dog: god, return biao English, come to come, this code you explain to me.

Int n = cap-1 int n = cap-1 int n = cap-1 int n = cap-1

| = (or equal) : this symbol is rare, but to be seen “+ =”, see this you should understand. For example: a | = b, can be converted to: a = a | b.

>>> (unsigned right shift) : For example, a >>> B refers to moving a to the right by the number of bits specified by B. The left bits left after the right shift are filled with zeros, and the bits left after the right shift are discarded.

Assuming the value of n is 0010 0001, the calculation is shown as follows:

I’m sure you can see that these five formulas get two ones, four ones, eight ones, sixteen ones, and thirty-two ones through the highest one. Of course, how many 1’s there are depends on how big our input is, but what we’re sure of is that after these five calculations, we’re going to get a value that’s all 1’s in the low order, and when we return plus 1, we’re going to get a 2 to the n that’s bigger than n.

It’s easy to look at cap-1 at the beginning, just to deal with the fact that cap itself is 2 to the N.

The bottom of the computer is binary, shift and or operation is very fast, so this method is very efficient.

PS: This is my personal favorite design in HashMap, so clever that I would like to give the author one.

You said the capacity of HashMap must be 2 to the power of N, why is this?

The formula for calculating the index position is: (n-1) &hash. When n is 2 to the power of n, n-1 is the low value of all 1s. At this time, any value with n-1 will be equal to itself by & operation, which achieves the same effect as modulo, realizing uniform distribution. In fact, the design is based on the formula: x mod 2^n = x & (2^ n-1), because & is much more efficient than mod.

As shown in the figure below, when n is not 2 to the n, the probability of hash conflict increases significantly.

Erdogg: You said that the default initial capacity of HashMap is 16, why 16 and not something else?

The main reason why I think it is 16 is that 16 is 2 to the N power, and it is a reasonable size. If you use 8 or 32, I think it’s OK. In fact, when we create a HashMap, it makes most sense to set the initial capacity based on how we use it.

Two dogs: just said the load factor default initial value is how much?

The default load factor is 0.75.

Two dogs: Why 0.75 and not others?

Jiong hui :(ask this silly question again) this is also the result of time and space. If the value is high, such as 1, the space cost will be reduced, but the probability of hash conflict will increase, increasing the search cost. If the value is low, such as 0.5, hash collisions are reduced, but half the space is wasted, so 0.75 seems a reasonable compromise.

Why not 0.74 or 0.76?

Sorry hui:

So let’s ask a different question. What is the process of inserting a HashMap?

J: Talk is cheap. Show you the picture.

There is a hash value for the key in the graph. How is it designed?

Take the hashCode of the key and XOR the high 16 bits of the hashCode with the hashCode to get the final hash value.

static final int hash(Object key) {
    int h;
    return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}Copy the code

Why do we use the high 16 bits of hashCode?

Jiong Hui: The main purpose is to allow the high table to participate in the calculation when the length of the table is small, and there will not be too much overhead.

For example, in the figure below, if the high level operation is not added, since n-1 is 0000 0111, the result only depends on the lower 3 bits of the hash value, and the result is the same no matter how the high level changes.

If we include high values in the calculation, the result of the index calculation does not depend only on low values.

Two dogs: Expansion (REsize) process introduction?

Sorry hui:

Red black trees and linked lists are located in the index of the new table by e.hash & oldCap == 0.

Please look at the following examples.

Before capacity expansion, the capacity of the table was 16, and nodes A and B were in the same index position.

After the expansion, the length of the table is 32, and the n-1 of the new table has only one more 1 in the higher order than the n-1 of the old table (marked in red).

Since the two nodes are in the same index position in the old table, the index position of the new table is calculated only based on the extra bit in the high level of the new table (marked in red), which is exactly equal to oldCap.

1) (e.hash & oldCap) == 0, then the new index position is “old index position”; 2) (e.hash & oldCap) == 1, then the new index is “old index + oldCap position”.

Is HashMap thread safe?

Jiong Hui: No. HashMap data coverage, traverse under concurrent modifications will throw ConcurrentModificationException problems at the same time, there are infinite loop problem before JDK 1.8.

Two dogs: Introduce the problem of dead circulation?

The root cause of the dead loop is that JDK 1.7 uses the “head plug” method, which causes the nodes in the same index position to be reversed after expansion. However, after JDK 1.8, the “tail insertion” method is adopted. After expansion, the order of nodes will not be reversed, and there is no problem of endless loop.

The expansion code for JDK 1.7.0 is shown below.

void transfer(Entry[] newTable) {
    Entry[] src = table;
    int newCapacity = newTable.length;
    for (int j = 0; j < src.length; j++) {
        Entry<K,V> e = src[j];
        if(e ! =null) {
            src[j] = null;
            do {
                Entry<K,V> next = e.next;
                int i = indexFor(e.hash, newCapacity);
                e.next = newTable[i];
                newTable[i] = e;
                e = next;
            } while(e ! =null); }}}Copy the code

PS: This process is difficult to understand, it is recommended to simulate the code to go through.

Example: We have a HashMap of capacity 2, loadFactor=0.75, thread 1 and thread 2 insert data into the HashMap at the same time, both trigger the expansion process, then the following process.

1) Before nodes are inserted into both threads and the capacity expansion process is triggered, the structure is shown in the following figure.

2) Thread 1 performs capacity expansion and is suspended after the code Entry

next = e.next is executed. The structure of thread 1 is shown below.
,v>

3) After thread 1 is suspended, thread 2 enters the capacity expansion process and goes through the entire capacity expansion process. The structure of thread 2 is shown below.

Since both threads operate on the same table, the diagram can be drawn as follows.

4) After thread 1 resumes, it continues to complete the first loop process, with the structure as shown in the following figure.

5) Thread 1 continues through the second loop, with the structure shown below.

6) Thread 1 continues to execute the third loop until e.next = newTable[I], forming a loop. The structure of the third loop is shown below.

If thread 1 calls map.get(11), the tragedy occurs — Infinite Loop.

Two dogs :(nima, did not understand, embarrassed) that summary JDK 1.8 mainly carried out what optimization?

We have just talked about the major optimizations in JDK 1.8, including the following:

1) The underlying data structure is changed from “array + linked list” to “array + linked list + red-black tree”, mainly to optimize the search performance of long linked list when hash conflicts are serious: O(n) -> O(logn).

2) The way to calculate the initial capacity of the table has been changed. The old way is to shift the table from 1 to the left until a value greater than or equal to the input capacity is found. The new method is calculated by “5 shift + or equal operations”.

/ / JDK 1.7.0
public HashMap(int initialCapacity, float loadFactor) {
    / / to omit
    // Find a power of 2 >= initialCapacity
    int capacity = 1;
    while (capacity < initialCapacity)
        capacity <<= 1;
    / /... omit
}
/ / JDK 1.8.0 comes with _191
static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}Copy the code

3) Optimized the calculation method of hash value. The old one is performed by a blind JB operation, while the new one simply involves the high 16 bits in the calculation.

/ / JDK 1.7.0
static int hash(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}
/ / JDK 1.8.0 comes with _191
static final int hash(Object key) {
    int h;
    return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}Copy the code

4) During capacity expansion, the insertion method is changed from “head insertion method” to “tail insertion method” to avoid the dead loop under concurrent conditions.

5) During capacity expansion, the index position of compute nodes in the new table is changed from “H & (Length-1)” to “Hash & oldCap”. The performance may not improve much, but the design is more clever and elegant.

What other maps have you used besides HashMap? How do you choose to use them?

Sorry hui:

Two dog :(not good, this B HashMap knows more than I also, have to hurriedly slip away) arrive time and girlfriend to have a meal, after we divide again win or lose.

Sorry hui:

Follow [programmer Jiong Hui], reply [interview] to get interview materials.

The harder you work, the luckier you get. Forwarding [circle of friends], point [at], is the biggest support for Jiong Hui.

preface

The body of the

Related Posts

The world’s first – Hongmeng open source platform OpenGL

What is RPC?

Configure the VM network if the host uses wifi