Deep parsing of the underlying implementation architecture of HashMap

Abstract:Analyze the detailed use of the Map interface and how the underlying HashMap is implemented?

This article is shared from the Huawei Cloud Community “[Illustrated] in-depth analysis of HashMap high-frequency interview and the underlying implementation structure!”, the original author: Grey Monkey Monkey.

You’ve all heard of the Map interface, right? It is a common way to store key-value pairs in Java, and I believe you should also be familiar with HashMap. When it comes to HashMap, I think everyone who knows a little bit about it should say: this is a way to store key-value pairs, and it is stored in the form of array and linked list. But do you have any idea how it’s actually stored and how the underlying architecture is implemented?

Many people would say that I only need to know how to use it, but I don’t need to know its underlying implementation. On the contrary, knowing how to use it is not enough, and it is common to ask questions and examine the underlying implementation of HashMap in Java development interviews. So today I’m going to take you through the details of how the Map interface is used and how HashMap is implemented under the ground.

Little friends slowly look down, after reading will definitely let you harvest!

1. What is the relationship between the Map interface and the List interface?

For this question, if there is any relationship between these two interfaces, there is only one, and they are both sets. For storing data. Among other things, the Map interface and the List interface are not very closely related. Why?

Let’s start with the List interface,As I mentioned before, List is a subinterface of the Collection interface, which is only used for single-column storage of data. The inheritance relationship is as follows:

The Map interface is a top-level interface,The following contains a number of different implementation classes, which are used to store key:value pairs. The inheritance relationship is as follows:

So don’t confuse the relationship between the Map interface and the List interface.

2. What are the common implementation classes of Map?

Now that we’ve looked at the inheritance structure of a Map, we’ve looked at a number of different implementation classes, many of which are familiar to us, such as HashMap, TreeMap, and Hashtable. In the interview, the interviewer will often ask what are the common implementation classes under the Map interface and their functions, then we will briefly introduce and analyze these interfaces.

HashMap: As mentioned above, the underlying implementation of HashMap is in the form of array + linked list + red-black tree, and the default initial size of the array is 16, the scale factor is 0.75, and the scale is 2x each time. That is, every time we reach 75% of our array storage capacity, we need to double the size of our array.

Hashtable: The Hashtable interface is thread-safe, but has been in use for a long time and is now almost a legacy class, and is not recommended for use in development.

ConcurrentHashMap: This is a thread-safe Map implementation class that is widely used today. Prior to 1.7, it was thread-safe using the piecewise locking mechanism. But since 1.8, thread safety has been implemented using the synchronized keyword.

HashMap is the most frequently asked and asked questions in interviews, and it’s the most important thing to understand and master in your daily development. So let’s take a closer look at how HashMap works and how it’s often asked in interviews.

3. Please explain the PUT process of HashMap?

We know that HaahMap uses PUT to store data, which has two parameters, namely key and value. How does this key-value pair be stored? So let’s go through the analysis.

In HashMap, the implementation method of array + linked list is used. In the upper part of HashMap, the “same” keys are stored in the form of array, while in the lower part, the corresponding keys and values are linked and stored in the form of linked list.

The key is the same, but the key has the same value. The key has the same value.

HashMap calculates the array index of the value to be stored by key. If there are no elements in the corresponding array index, then the stored element is stored in it. However, if there are already elements in the corresponding array index, then this requires the use of linked list storage as described above. The data is stored in the order of the linked list. This is the simple procedure of PUT, storing the results as follows:

However, sometimes we will store a lot of data, so if we always use the form of linked list to store data, or cause the length of our linked list is very large, so whether in the deletion or in the insertion operation is very troublesome, so what should we do in this case?

This involves a process of “treing” and “chaining” when data is stored in a linked list. What is “treing” and “chaining”?

When we are in the key/value pair for storage, if we are in the same array subscript stored data too much, can cause our list through long, lead to delete and insert more troublesome, so in Java rules, when the list length is more than 8, we will “tree” on the list for operation, Convert it into a red-black tree (a binary tree, the value of the left node is less than the root node, and the value of the right node is greater than the root node), so that when we search for elements, it is similar to binary search, so that the search efficiency will be greatly increased.

But what happens when we delete some of the nodes, and the length of the list is no longer greater than 8? Do you want to turn the red-black tree into a linked list? No, only when the length of the list is less than 6 will the red-black tree be converted back to the list. This process is called “chaining”.

The process is illustrated as follows:

So why treeing at length 8 and chaining at length less than 6? Why not just “chainize” when the length is less than 8?

The main reason is that: when an element is deleted and the length of the linked list is less than 8, it will be directly “linked”; while when another element is added and the length is equal to 8, it will be “treed” again. Such repeated “linked” and “treed” operations are particularly time-consuming and troublesome. Therefore, the program stipulate that only when the length of the linked list is greater than or equal to 8, it will be “treed”, and when the length is less than 6, it will be “chained”, and we hope to keep in mind the two thresholding of 8 treed and 6 chained.

4. In what order is the data stored in the linked list?

Now that we know how the elements in a HashMap are stored, interviewers may sometimes ask us, in a HashMap, are the elements stored in a linked list stored at the head node or at the tail node?

This is what we need to know about the storage of the linked list elements in a HashMap.

Before JDK1.7, it was inserted at the header node, and after JDK1.8, it was inserted at the tail node.

5. How is the Hash(key) method implemented?

Now that we know how the elements in a HashMap are stored, how do we evaluate array subscripts based on key values?

We know that the initial capacity of HashMap is 16 bits, so for the initial 16 data bits, if the data is calculated and stored according to the key value, the simplest method is generally to get an int value according to the key value, and the method is:

Int hashCode = key.hashCode() Then take the remainder of hashCode and 16, hashCode % 16 = 0~15

That’s always going to be 0 minus 15. This is the most primitive way to calculate a hash(key).

However, in practice, the hash(key) calculated by this method is not optimal, and the elements stored in the array are not the most scattered. Moreover, it is very inconvenient to carry out residual operation in the computer.

Therefore, in order to make the calculation result as discrete as possible, the most commonly used method to calculate array subscript is to first calculate into a hashCode according to the value of key, then perform XOR operation on the high 18-bit binary and the low 18-bit binary of hashCode, and then perform SUM operation on the obtained result with the current array length minus one. The result is an array index, as follows:

Int hashCode = key.hashCode() int hash = hash(key) = key.hashCode() Hash (key) = key.hashCode() Hash (key) = key.hashCode() Hash (key) = key.hashCode() Hash (key) = key.hashCode() Hash (key) Hash (key) = key.hashCode() Hash (key

In the meantime, a note of caution:

The hash(key) calculation is slightly different in JDK1.7 and 1.8

In JDK1.8, the computed hash(key) was perturbed twice

In JDK1.7, the computed hash(key) is perturbed nine times, four bit operations and five XOR operations

The perturbation may be understood as the number of operations

That’s how the Hash(key) method is implemented.

6. Why is HashMap always a multiple of 2?

The reason why HashMap capacity is always multiples of 2 is actually related to the hash(key) algorithm mentioned above.

The reason is that the resulting value is discrete only if the (n-1) value of the algorithm participating in the hash(key) is as close as possible to 1. If our current array length is 16, and the binary representation is 10000, then the value of n-1 is 01111, so that the value of n-1 is as much as possible 1, and the same is true for any multiples of 2 that are subtracted by 1.

So the hash(key) value can be relatively discrete only if the size of the array is a multiple of 2.

7. How to resolve Hash conflicts?

What is a Hash conflict? When I calculate the index of an array, the index already contains elements, which is called Hash conflict. Obviously, if the algorithm of calculating the index of an array is not good enough, it is easy to accumulate stored data on the same index, causing too many Hash conflicts.

So how do you resolve hash conflicts?

The most important thing to solve is to make the index of the array computed by the stored key as discrete as possible, that is, to require the hash(key) to be optimized as much as possible, and the array length to be a multiple of 2. This is the main solution to Hash conflicts.

For details, look at the underlying source code for key parts of the HashMap below:

The underlying implementation of Hash(key)

 /**
     * Applies a supplemental hash function to a given hashCode, which
     * defends against poor quality hash functions.  This is critical
     * because HashMap uses power-of-two length hash tables, that
     * otherwise encounter collisions for hashCodes that do not differ
     * in lower bits. Note: Null keys always map to hash 0, thus index 0.
     */
    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

The underlying implementation of the put(key,value) method

/** * Associates the specified value with the specified key in this map. * If the map previously contained a mapping for  the key, the old * value is replaced. * * @param key key with which the specified value is to be associated * @param value value to be associated with the specified key * @return the previous value associated with <tt>key</tt>, or * <tt>null</tt> if there was no mapping for <tt>key</tt>. * (A <tt>null</tt> return can also indicate that the map * previously associated <tt>null</tt> with <tt>key</tt>.) */ public V put(K key, V value) { if (key == null) return putForNullKey(value); int hash = hash(key.hashCode()); int i = indexFor(hash, table.length); for (Entry<K,V> e = table[i]; e ! = null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; addEntry(hash, key, value, i); return null; }

8. How does HashMap scale?

We mentioned above that the initial size of the HashMap array is 16, but it is clear that 16 storage bits are not enough. How can HashMap scale?

There is a parameter called “dilation factor” that is used. In the HashMap, the size of “dilation factor” is 0.75,

As we mentioned above, for an array with an initial length of 16, the length of the data stored in it is equal to 16*0.75=12. It will expand the size of the array element by twice the size of the original array, which is 15, and then expand the size by 32 data bits.

9. How to store the elements after capacity expansion?

We know that the size of the HashMap array increases, so the new size of the HashMap will be empty. But should we just leave the data bits behind empty at this point? This is obviously not possible, and would result in a large waste of memory.

Therefore, after the array size of the HashMap is expanded, the data elements stored in the original HashMap array will be reassigned and stored in the new array again. To make full use of the array space.

10. Comparison of JDK1.7 and 1.8 implementations of HashMap

The implementation of HashMap in JDK1.7 and 1.8 is slightly different, so let’s compare the differences between the implementation of HashMap in JDK1.7 and 1.8 based on the above explanation.

(1) Different underlying data structures

In the put process of HashMap, the concept of red-black tree was not introduced in JDK1.7. It was directly linked list storage. The concept of red-black tree was introduced after JDK1.8 to optimize storage and lookup.

(2) The linked list is inserted in different ways

When inserting elements into a linked list in a HashMap, JDK1.7 inserts them at the head node, and JDK1.8 inserts them at the tail node.

(3) The calculation method of Hash(key) is different

In the Hash(key) calculation, JDK1.7 performs nine perturbations (four bit operations and five XOR operations), while JDK1.8 only performs two perturbations.

(4) The calculation method of the storage location after capacity expansion is different

In the rearrangement of stored data after capacity expansion, JDK1.7 shuffled all the data positions and then re-calculated them according to hash(key). After JDK1.8, the original data subscripts were used for two for loops. The new subscript position can only be calculated at the original subscript position or at the original subscript position plus the original capacity position.

OK, about the Map interface and HashMap implementation of the underlying process, as well as in the interview to refer to the core questions and analysis of everyone here!

Click on the attention, the first time to understand Huawei cloud fresh technology ~