HashMap problem collection

Access principle

Let’s start with a few properties

Size: key-value Specifies the number of key-value pairs

Threshold: indicates the threshold, which can be calculated by capactity * load factor, where load factor is the load factor

Load factor: load factor. The default value is 0.75

LoadFactor indicates how crowded the HashMap is, affecting the probability of hash operations to the same array location. The default loadFactor is 0.75. If the HashMap contains 75% of the HashMap array, the HashMap is too crowded and needs to be expanded. You can customize loadFactor in the constructor of the HashMap.

Capactity: indicates the capacity. The default capacity is 16

Put principle of JDK1.8

Computes the hash() value of the key

A hash method is first called to compute the hash value of the key

Initialize the table array

Initializing the table array is done on the first PUT value. In this case, it will check whether the table array is empty. If it is empty, the resize method will be directly called to expand the table array

Calculates the key index

If it is not empty, or if it is empty and then having gone through the previous step, it is now time to compute an index of the key, which is computed by (n-1)&hash, which was originally computed by hash()).

Determines whether the position of the index value is empty

At this time, the index value calculated above will be found in the table. When this position is empty, it will be directly inserted. When this position is not empty, it will judge whether the key under the current position is the same as the key to be inserted at this time, if it is equal, the original value will be overwritten

When the key in the current position is not equal to the newly inserted key

Determines whether the current node is a tree node or a regular linked list node.

If it is a common linked list node, it will traverse the list for node insertion. When the length of the list is greater than 8, it will transform into a red-black tree and insert the node; otherwise, it will insert the node into the linked list.

If the node is a tree node, insert the node directly.

Do you need to add +1 to size

Size refers to the logarithm of key-value pairs in the container. When a new key-value pair is inserted, size+1 will be added. When a new key-value pair overwrites the value of the original key-value pair, no +1 will be added.

Determine whether capacity expansion is required

When the logarithm of key-value is about to reach a threshold, the resize() method is called

Thread insecurity

In JDK1.7, the header insertion method is used to insert values into the linked list, so the multi-threaded operation of HashMap may cause an infinite loop. The reason is that the order of the linked list is inverted after the expansion and transfer, and the reference relationship of the nodes in the original linked list is changed during the transfer.

In JDK1.8, the tail insertion method does not cause the above problems, but it is also insecure, can only maintain the consistency of data in a snapshot, when involving multiple threads, may get dirty data, so the thread safety is not guaranteed.

Suppose A thread index values calculated drum, thread B also index values calculated drum, two index value equal to, at this time the position is empty, then A thread ran out of time slice, thread B add data into, but when A thread “resurrection”, will not find this position has been occupied, So it overwrites the values that were put in by the original thread B, causing data inconsistencies.

How can I reduce hash collisions? Why two to the power?

  • The perturbation function can reduce collisions. The principle is that if two objects that are not equal return different Hashcodes, collisions are less likely. This means that the list structure is reduced, so that the equal method is not called as often, which improves the performance of the HashMap. (A perturbation is an algorithmic implementation inside a Hash method that allows different objects to return different Hashcodes.)
  • Using immutable objects that are declared final and using the appropriate equals() and hashCode() methods will reduce collisions. Immutability makes it possible to cache hashCode with different keys, which improves the speed of retrieving the entire object. Using Wrapper classes like String and Interger as keys is a good choice. Why are wrapper classes like String and Interger good for keys? Because strings are final, and you’ve overridden the equals() and hashCode() methods. Immutability is necessary because in order to evaluate hashCode(), the key is prevented from changing, and if the key returns a different hashCode when it is put in and when it is retrieved, then the object you want cannot be found in the HashMap.

Looking at the source code, we know that the default setting is 16. Why do I set 16? The reason is to reduce the chance of collisions and the uniformity of data storage. There is also to improve computational efficiency for bit operations.

As we know above, the index value is computed by the length of the array -1 and the hash value (n-1)&hash. Since the length of the array is specified to be a power of 2, the binary number after subtracting 1 will end in 1, and the and operation with the hash will result in the last bits of the hash. In this case, we only need to ensure that enough hash is generated for the hash value, and then the index we get is enough hash. So, to put it bluntly, the probability of collision is reduced and the storage is more uniform.

Hash calculation rules

Unsigned 16 bits to the right of the hash value returned by the hashcode() method of the object, and bitwise xor with the original hash value. The purpose is to xor the low and high 16bits of the hash so that the returned value is hash enough

When calculating subscripts in get and PUT, hashCode is hash first, and then subscripts are further calculated using the hash value, as shown in the following figure:

A HashMap capacity

In JDK1.7, capacity expansion occurs only if the array length exceeds a threshold and there are already key-value pairs in the index position calculated by the current insertion of the key-value pair.

In JDK1.8, if the list length reaches 8 and the array length is greater than 64, the list is turned into a red-black tree. If the array length is less than 64, the list is expanded.

How does HashMap solve the problem that the initialized capacity is not a power of two

JDK1.7

In JDK1.7, there are two methods to convert a capacity passed in that is not a power of two to a power of two

First of all, we can see that the comment is a number greater than or equal to the power of 2 toSize, that is, if you pass in 5 you get 8,

Click on this method, and it calls another method

This method is to compute a number that is less than or equal to I to the power of 2, so when I is 5, you get 4.

Combining the two methods, when number is 6, (number-1) << 1 is 11

0110(6) -1 = 0101 << 1 = 1010 = 11 (decimal)

And then we pass it in to the highestOneBit method, which we know is going to evaluate to something less than or equal to 11 to the power of 2, which is 8.

So we end up passing in an initial capacity of 6, and the hashMap will help us optimize to 8.

JDK1.8

In JDK1.8, HashMap is simplified as a way to optimize our custom capacity

The method is similar to highestOneBit in JDk1.7, the previous part is the same. We know that the front part is actually a n to low the binary number is 1, the JDK1.8 approach is directly to + 1, at this time to get the number of is the power to the power of 2, a little greater than the current value of 2 n times the power number

In order to introduce red-black trees instead of binary search numbers, AVL trees

Before JDK 1.8, the implementation of HashMap was array + linked list, even if the hash function is good, it is difficult to achieve 100 percent uniform distribution of elements. When a large number of elements in a HashMap are stored in the same bucket and there is a long linked list under the bucket, the HashMap is equivalent to a single linked list. If the single linked list has N elements, the traversal time complexity is O(n), completely losing its advantage. In response, JDK 1.8 introduced red-black trees (search time complexity O(logN)) to optimize this problem.

So why not introduce binary search trees?

The red-black tree was chosen to solve the problem of binary lookup trees, which in special cases can become a linear structure (as with the original linked list structure, causing deep problems), and traversal lookup can be very slow. And red and black tree after inserting new data may need to be by a left-handed, right-handed, color changing these operations to maintain a balance, the introduction of a red-black tree is to find data quickly, solve the problems of the chain table query depth, we know that the red-black tree to balanced binary tree, but in order to maintain the “balance” is the need to pay the price, but the price is the depletion of the resources less than traverse linear list, So when the length is greater than 8, we use a red-black tree, but if the list length is very short, we don’t need to introduce a red-black tree at all, it will be slow.

If it’s a binary balanced tree, why is a red-black tree better?

Red-black tree and AVL tree are the most commonly used balanced binary search trees, and their search, deletion and modification are O(logN) time

AVL tree and red-black tree have several comparisons and differences: (1) AVL tree is more strict balance, so can provide faster search speed, generally read search intensive tasks, apply to AVL tree. (2) Red-black trees are more suitable for insertion and modification intensive tasks. (3) In general, rotation of AVL trees is more difficult to balance and debug than rotation of red-black trees.

Conclusion: (1) AVL and red-black tree are highly balanced tree data structures. They are very similar, but the real difference is the number of rotations that can be done on any add/remove operation. (2) Both implementations scale to a O(log N), where N is the number of leaves, but in reality AVL trees are faster at finding intensive tasks: with better balance, tree traversals are shorter on average. AVL trees, on the other hand, are slower for insertion and deletion: a higher number of rotations are required to properly rebalance the data structure when it is modified. (3) In AVL trees, the difference between the shortest path and the longest path from root to any leaf is at most 1. In a red-black tree, the difference could be twice. (4) Both are given O (log n) lookups, but balancing AVL trees may require O (log n) rotations, whereas red-black trees will require up to two rotations to reach equilibrium (although O (log n) nodes may need to be checked to determine where the rotations are). The rotation itself is an O (1) operation, because you just move the pointer.

Talk about red black trees

  1. Each node is either red or black
  2. The root node is always black
  3. If a node is red, its children must be black (not necessarily vice versa)
  4. Every leaf node is a black NIL node
  5. Each path from root node to leaf node or empty child node must contain the same number of black nodes (that is, the same black height)

Why not use the hash value of hashCode() as the subscript of the table?

The hashCode() method returns an int integer ranging from -(2 ^ 31) to (2 ^ 31-1), which is about 4 billion mapping Spaces. The size of a HashMap ranges from 16 (the default value) to 2 ^ 30. It is also difficult to provide so much storage space on the device that the hash calculated by hashCode() may not be in the array size range and thus cannot match the storage location.

Interviewer: How do you solve it?

  1. HashMap implements its ownhash()Methods: By using two perturbations, the high and low bits of its own hash value can perform xOR operation, which can reduce the probability of hash collision and make the data distribution more even.
  2. When the array length is guaranteed to be a power of 2hash()The value after the operation and operation (&) (array length-1) to obtain the array subindex is stored, so it is more efficient than the mod operation, secondly, because only when the array length is 2 to the power, h&(length-1) is equivalent to h%length, Third, solve the problem of “hash value and array size range do not match”;

Fail-fast, a fast failure mechanism for Java collections?

Fail-fast is a Java collection error detection mechanism that can occur when multiple threads make structural changes to a collection.

Such as: Let’s say there are two threads (thread 1 and thread 2). Thread 1 Iterator iterates through the elements of set A, and thread 2 at some point changes the structure of set A (the structure changes, rather than simply changing the contents of the elements). So when the program will throw ConcurrentModificationException, resulting in a fail – fast mechanism.

Reason: The iterator accesses the contents of the collection directly during traversal and uses a modCount variable during traversal. If the contents of the collection change during traversal, the value of modCount is changed. Whenever the iterator iterates over the next element using hashNext()/next(), it checks whether the modCount variable is the expectedmodCount value and returns the traversal if it is; Otherwise, an exception is thrown and the traversal is terminated.

Solutions:

1. Add synchronized to all parts of the traversal that involve changing modCount.

2. Replace ArrayList with CopyOnWriteArrayList

ConcurrentHashMap?

ConcurrentHashMap differs from HashMap in that neither key nor value is allowed to be null.

Implementation in JDK 1.7:

In JDK 1.7, ConcurrentHashMap consists of the Segment data structure and the HashEntry array structure. Segment is a ReentrantLock that acts as a lock in ConcurrentHashMap, while HashEntry is used to store key-value pair data. A ConcurrentHashMap contains an array of segments, and a Segment contains a HashEntry array. A HashEntry array is an array and a linked list.

ConcurrentHashMap does not require locks for both PUT and GEI operations like HashTable does. Instead, ConcurrentHashMap is divided into segments. Every time a thread accesses one Segment, it does not affect the rest of the Segment.

A concurrentHashMap has as many segments as it can support simultaneous access by as many threads, but only if they access different segments.

We can see that the hash and key in HashEntry are declared final. This ensures the immutability of the courrentHashMap, meaning that we cannot add or remove nodes from the middle or end of the hash chain because this would require modifying the next reference value. Therefore, all node modification can only start from the header. The PUT operation can be added to the header of the Hash chain.

Value and Next are then declared volatile to ensure that they are visible and that the read thread can read the latest value, which is an important reason why the ConcurrentHashmap read operation does not need to be locked.

The put operation

In fact, our put operation on ConcurrentHashMap is delegated to a specific segment by ConcurrentHashMap. That is, when we put a Key/Value pair into ConcurrentHashMap, we first get the hash of the Key and hash it again, and then locate the segment to which the record should be inserted based on the final hash Value.

Then, when the segment is found, the put operation first attempts to obtain the lock. If it cannot obtain the lock, scanAndLockForPut() is called to spin. When the number of spins reaches MAX_SCAN_RETRIES, the put operation blocks until the lock is obtained.


Implementation in JDK 1.8:

The implementation of JDK1.8 has discarded the concept of Segment and implemented Node arrays + linked lists + red-black tree data structures. Concurrency control is implemented using Synchronized and CAS, and the whole thing looks like an optimized and thread-safe HashMap. Although you can still see the data structure of segments in JDK1.8, the attributes have been simplified to make them compatible with older versions.

The put operation

final V putVal(K key, V value, boolean onlyIfAbsent) {
    // Neither key nor value can be empty
        if (key == null || value == null) throw new NullPointerException();
    // Computes the hash
        int hash = spread(key.hashCode());
        int binCount = 0;
    
        for (Node<K,V>[] tab = table;;) {
            Node<K,V> f; int n, i, fh;
            // If the table array bucket is empty, call the method to initialize (spin +cas)
            if (tab == null || (n = tab.length) == 0)
                tab = initTable();
            else if ((f = tabAt(tab, i = (n - 1) & hash)) == null) {
                // If the bucket is empty, create a node and add cas to the bucket
                if (casTabAt(tab, i, null.new Node<K,V>(hash, key, value, null)))
                    break;                   // no lock when adding to empty bin
            }
            else if ((fh = f.hash) == MOVED)
                tab = helpTransfer(tab, f);
            else {
                V oldVal = null;
                synchronized (f) {
                    if (tabAt(tab, i) == f) {
                        if (fh >= 0) {
                            binCount = 1;
                            for (Node<K,V> e = f;; ++binCount) {
                                K ek;
                                if(e.hash == hash && ((ek = e.key) == key || (ek ! =null && key.equals(ek)))) {
                                    oldVal = e.val;
                                    if(! onlyIfAbsent) e.val = value;break;
                                }
                                Node<K,V> pred = e;
                                if ((e = e.next) == null) {
                                    pred.next = new Node<K,V>(hash, key,
                                                              value, null);
                                    break; }}}else if (f instanceof TreeBin) {
                            Node<K,V> p;
                            binCount = 2;
                            if((p = ((TreeBin<K,V>)f).putTreeVal(hash, key, value)) ! =null) {
                                oldVal = p.val;
                                if(! onlyIfAbsent) p.val = value; }}}}if(binCount ! =0) {
                    if (binCount >= TREEIFY_THRESHOLD)
                        treeifyBin(tab, i);
                    if(oldVal ! =null)
                        return oldVal;
                    break;
                }
            }
        }
        addCount(1L, binCount);
        return null;
    }
Copy the code
  1. First compute the Hashcode based on the key
  2. Check whether initialization is required.
  3. That is, the Node located for the current key. If it is empty, data can be written to the current position. CAS is used to attempt to write data.
  4. If the current position ofhashcode == MOVED == -1, you need to expand the capacity.
  5. If none is met, data is written using a synchronized lock.
  6. If the quantity is greater thanTREEIFY_THRESHOLDYou convert to a red-black tree.

Get operation

public V get(Object key) {
        Node<K,V>[] tab; Node<K,V> e, p; int n, eh; K ek;
    // Calculate the position
        int h = spread(key.hashCode());
        if((tab = table) ! =null && (n = tab.length) > 0 &&
            (e = tabAt(tab, (n - 1) & h)) ! =null) {
            // If the header is the target node, return the value directly
            if ((eh = e.hash) == h) {
                if((ek = e.key) == key || (ek ! =null && key.equals(ek)))
                    return e.val;
            }
            // If the hash value of the header node is less than 0, it indicates expansion or a red-black tree calls find
            else if (eh < 0)
                return(p = e.find(h, key)) ! =null ? p.val : null;
            // If it is a linked list, traverse the search
            while((e = e.next) ! =null) {
                if(e.hash == h && ((ek = e.key) == key || (ek ! =null && key.equals(ek))))
                    returne.val; }}return null;
    }
Copy the code
  1. Calculates the position based on the hash value.
  2. Find the specified location, if the head node is to find, return its value.
  3. If the hash value of the head node is less than 0, it indicates expansion or red-black tree.
  4. If it’s a linked list, walk through it and find it.

Conclusion:

In Java7, ConcruuentHashMap uses a segmented lock, which means that only one thread can operate on each Segment at a time. Each Segment is a hashmap-like array structure that can be expanded and its conflicts converted to a linked list. However, the number of segments cannot be changed once initialized.

Synchronized locking CAS mechanism used by ConcruuentHashMap in Java8. The structure also evolved from the Segment array + HashEntry array + linked list in Java7 to the Node array + linked list/red-black tree. Node is a HashEntry like structure. When its conflicts reach a certain size, it will be transformed into a red-black tree, and when the number of conflicts is less than a certain number, it will return to the linked list.