What is a commonly used HashMap structure

0 x00 preface

There should be no doubt that HashMap is one of the most commonly used containers. But do you know him at all? There are already many articles on the web summarizing HashMap, but I wrote this one to record my own thoughts after reading it. If there are any mistakes, please correct them. The following analysis is based on JDK1.8.

0x01 Introduction in one sentence

Inside a HashMap is an array of Node classes, each containing the corresponding data.

0 x02 overview

Let’s start with HashMap, based primarily on comments from HashMap (you can skip to section 0x03 if you’re familiar with it).

1. HashMap implements the Map interface and has all operations of Map. It has the following characteristics:

allownullKey and value of.
In general, it is the same as HashTable, but HashTable is thread-safe, whereas HashMap is not. For this reason, HashMap clearly outperforms HashTable in terms of performance.
The order is not guaranteed, because the underlying principle is that it calculates the position based on the hash value of the key, so the order is not guaranteed. (Read on to see how it works.)

2. Get and PUT of the HashMap. In the case of uniform hash values, operations are of constant time complexity. It is important not to set capacity too high and load factor too low. ✧(≖ plus-one item ≖✿).

3, because he is not thread-safe, so can pass Collections. SynchronizedMap to packaging, and become a thread-safe Map.

4. Have fail-Fast feature. In simple terms, an exception is thrown when an element is found to have changed while iterating.

0x03 Explain several variables

InitialCapacity in the constructor

This parameter is the initial Map length. The default is 16.

Node<K,V>[] table

Where the Map actually holds elements, you can see that it is an array of Nodes. Node is a simple list of key-values, hash variables, and next variables.

float loadFactor

As the name suggests, load factor. The default is 0.75, which is a tradeoff between space and time. It may be a complicated logical calculation.

int threshold

Threshold, the number of key-value pairs that a Map can hold. Is calculated based on the array length *loadFactor in the Map. This should give you an idea of what might happen if the loadFactor is set too small. Yes, if set too small, the capacity will be small, resulting in a waste of space, with most locations empty and underutilized. On the other hand, if the setting is too large, it will result in crowded elements and inefficient queries.

0x04 Method analysis

The constructor

HashMap has several constructors, but let’s take a look at one that’s important.

    public HashMap(int initialCapacity, floatLoadFactor) {// If the size of the initialization array passed in is less than 0, it is invalid.if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: "+ initialCapacity); // If it is greater than the maximum value, make it equal to the maximum value.if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: "+ loadFactor); this.loadFactor = loadFactor; // Use the tableSizeFor method to align the array length. this.threshold = tableSizeFor(initialCapacity); } // Array length alignment. static final int tableSizeFor(intcap) {
        int n = cap- 1; // The length of the array must be 2^n. n |= n >>> 1; // 1 n |= n >>> 2; // 2 n |= n >>> 4; // 3 n |= n >>> 8; // 4 n |= n >>> 16; / / 5return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }
Copy the code

TableSizeFor for example 🌰 :

If I setcapAt 13, 13-1 the binary is: 0000 0000 0000 1100 for the first step: at this time, 0000, 0000, 0000, 1100 > > > 1 0000 0000 0000 0110 | = 0000, 0000, 0000, 1110 the second step: 0000 0000 0000 1110 > > > 2 0000 0000 0000 0011 | = 0000, 0000, 0000, 1111 in the third step: 0000 0000 0000 1111 > > > 4 0000 0000 0000 0000 | = 0000, 0000, 0000, 1111... And you can see that it should be 1111 (2) all the way down here. Add a 1 to the end, 16,2 ^4. For those of you who don't believe this, try a bigger 🌰 : 0100 0110 0101 0110 0100 0110 0101 0110 > > > 1 0010 0011 0010 1011 | = 0110, 0111, 0111, 1111 in the second step: 0110 0111 0111 1111 > > > 2 0001 1001 1101 1111 | = 0111, 1111, 1111, 1111 in the third step: 0111 1111 1111 1111 > > > 4 0000 0111 1111 1111 | = 0111, 1111, 1111, 1111... As you can see, the end result is the same. Binary has a lot of fun features, and if you can take advantage of them, you can improve performance by more than a fraction.Copy the code

The Node Node class

Contains a hash, key, and value.

put

Public V put(K key, V value) {// Actually call putVal. Now, the question might be, doesn't Key himself have a hashcode method? Why write your own? I'm going to press that for now, so let's look at the putVal method. 1.return putVal(hash(key), key, value, false.true);
}

final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) { Node<K,V>[] tab; Node<K,V> p; int n, i; // If the TAB has not been initialized, or is empty, resize first.if((TAB = table) = = null | | (n = TAB. Length) = = 0) / / n is the length of the TAB. ③ n = (TAB = resize()).length; // Where is the key? (n-1)&hashTo locate. 2.if ((p = tab[i = (n - 1) & hash]) == null) // If null, a new node is created. tab[i] = newNode(hash, key, value, null);
    else{ Node<K,V> e; K k; // If the old node is the one that needs to be put, the value is replaced directly. == == == == == == =if (p.hash == hash&& ((k = p.key) == key || (key ! = null && key.equals(k)))) e = p; // If it is a red-black tree, putTreeVal of red-black trees is called.else if (p instanceof TreeNode)
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        else{// If the list is linked, go down one by one,for (int binCount = 0; ; ++binCount) {
                if((e = p.ext) == null) {// if no end is found, create a newNode.hash, key, value, null); // If the length of the list is already larger than the length needed to be turned into a red-black tree, turn it into a red-black treeif (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);
                    break; } // If found, the loop is terminatedif (e.hash == hash&& ((k = e.key) == key || (key ! = null && key.equals(k))))break; p = e; }} // If the corresponding node is found, make a replacementif(e ! = null) { V oldValue = e.value;if(! OnlyIfAbsent | | oldValue = = null) / / if mark inspection in accordance with, or original value is null, the assignment. e.value = value; AfterNodeAccess (e);returnoldValue; } // modify the change flag ++modCount; // If the total number of nodes exceeds the threshold, expand the capacityif(++size > threshold) resize(); AfterNodeInsertion (evict);return null;
}
Copy the code

Let’s talk about ②. Why is it positioned this way? N minus 1 hash. N is the length of the table, a 2 to the n number. In general, if we have a location that’s bigger than the length of the array, how do we put it in the array? That’s easy. Take the module. Modulo this position, you’re going to get something less than the length of the array. To highlight! So this (n – 1) & hash is also modulo! We all know that the binary of n minus one is composed of the high zero plus the low one. And the hash value, and the hash value, must be less than n-1, which is a modulo effect. Empty say without basis, still give 🌰.

If n = 16, the binary of n-1 is 1111. Let me just write a random 32-bit onehash. 0000 0000 0000 0000 1111&1001 1101 1110 0110 = 0000 0000 0110 = 6 < 16Copy the code

The method is very clever, avoid taking module, greatly improve the speed of index.

This leads to the reason of ①. Take a look at Hash’s face.

// hashMethods. static final inthash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code

Let’s talk about the hash method. You can see that if it is null, 0 is returned. If it is not null, key#hashCode is called and xor is shifted 16 bits to the right. When analyzing ②, we can see that the position is determined by the hash value and the n-1 phase, which is why we need a new hash. Huh? Why is that? As shown in 🌰 of ②, positioning is not related to the first few bits of the hash value, but only to the number of bits of n-1 binary length. The problem with this is that it’s very likely to cause conflicts, and the randomness is reduced, because the high xx bits are not involved in the calculation, so there are only a few bits that are likely to cause conflicts. The xor operation, which brings in the higher order, increases participation and hashes better. Here’s an example: 🌰.

1110 0010 0011 1101 1110 0000 0011 1101 >>16 0000 0000 1110 0010 0000 0000 0000 ^= 1110 0010 1101 1111 1110 0000 1101, 1101 you can see that if you don't do this, these two elements are definitely in the same position, but if you do the high position, they get separated.Copy the code

resize

Resize is used to expand a map. Corresponding to note 3 above.

final Node<K,V>[] resize() { Node<K,V>[] oldTab = table; // Get the old array length. int oldCap = (oldTab == null) ? 0 : oldTab.length; int oldThr = threshold; int newCap, newThr = 0;if(oldCap > 0) {// If the threshold is already greater than the maximum, set the threshold to the maximum, directly return.if (oldCap >= MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return oldTab;
        }
        else if((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= DEFAULT_INITIAL_CAPACITY) newThr = oldThr << 1; // The threshold is doubled. }else if(oldThr > 0) // If the threshold is greater than 0& the old array length is less than 1, the new array length is set to the threshold size. newCap = oldThr;else{// If none of the above conditions is met, the initial value is used. newCap = DEFAULT_INITIAL_CAPACITY; // The new threshold is the default load factor * the default array length value. newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY); } // If the new threshold is 0if(newThr == 0) {Assign a new threshold based on whether it is in the range, as above.float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                  (int)ft : Integer.MAX_VALUE);
    }
    threshold = newThr;
    @SuppressWarnings({"rawtypes"."unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
    table = newTab;
    if(oldTab ! = null) {// If it is expansion, not initialization, then element migration is required.for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if((e = oldTab[j]) ! = null) {// e already references the data, so empty the corresponding position of the array for garbage collection. oldTab[j] = null;if(e.ext == null) // If there is only one element in the list, place it in the corresponding position of the new array. newTab[e.hash & (newCap - 1)] = e;else if(e instanceof TreeNode) //② If it is a red-black tree, then use split method to split the tree. ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);else{// if the list is still linked. ③ Node<K,V> loHead = null, loTail = null; Node<K,V> hiHead = null, hiTail = null; Node<K,V> next;do{// next is the next element of e next = e.next;if((e.hash & oldCap) == 0) {// e ofhashThe value corresponds to the old array length 0 ③-1if(loTail == null) // If the low tail is null, the low head is e. loHead = e;else// Otherwise, the low tail is followed by the chain. loTail.next = e; loTail = e; // The tail points to the next one. }else{// Otherwise, for high tails, the operation is similar to that for low tails.if (hiTail == null)
                                hiHead = e;
                            elsehiTail.next = e; hiTail = e; }}while((e = next) ! = null);if(loTail ! = null) {// Place the low level list at position j. loTail.next = null; newTab[j] = loHead; }if(hiTail ! = null) {// Place the list at position j + oldCap. hiTail.next = null; newTab[j + oldCap] = hiHead; } } } } }return newTab;
}
Copy the code

The above part is easy to understand. Focus on the areas ① and ③. Let me draw a picture. Let’s say a map of length 4 places elements like this (for example, let’s assume we can put this many) :

[1] -> (1 5 25 9 17 21 29 13 33) // The list has been turned into a red black tree. So the tree needs to be taken down. [2] -> (2 -> 6 -> 10 -> 14) // This is the ③ case where the list needs to be broken down and placed in a new array. [3] - > (3)Copy the code

Let me focus on the third case. Again, the picture above.

[0] [0] [1] [1] [2] - > 2 - > (6-10-14) > > -- -- -- -- -- > [2] - > 2 - > (10) [3] [3] [4] [5] [6] - > (6 - > 14) [7] [8] here to see how the migration. Let's see if they're 0 with the old length 4. 2 6 10 14 binary 0010 0110 1010 1110&0100 0100 0100 0100 = 0000 0100 0000 0100 list the binary numbers of 4 and 8: 4 8 Binary 0100 1000 N-1 0011 0111Copy the code

You can see that the length of binary always has only one 1, and the rest of the bits are zeros. The new position is just hash&2n-1, which in binary terms is a bit of a complement to the left. So the only difference from where we were before, is in this one that’s moved to the left. This is why e.hash & oldCap is judged in ③-1. If the element at position 0 of length binary 1 is left in place, otherwise it is placed at position j +oldCap. Because I’m going to double it, so it’s going to be the original location plus the length of the original array.

get

The get method is very similar to the PUT method. But get is get and put is set. The getNode method is called internally.

remove

Remove and put are very similar, they just find the element and delete it.

containsKey

Public Boolean containsKey(Object key) {// This is an internal method called by the get method. So it's just a matter of whether or not the get method returns a value.return getNode(hash(key), key) ! = null; }Copy the code

size

The size variable directly returns the number of elements.

clear

Iterate through the array, null assignment one by one.

containsValue

Iterate over the number group and the corresponding linked list to see if the values are equal.

Red black tree dependent function

These functions are new to java8. If the list is too long, iterating through it becomes inefficient, so the map internally turns it into a red-black tree, which is not explained in detail in this article. This part will be described during TreeMap analysis.

0x05 Take a sip of water to sum up

This article has covered some of the core issues in HashMap, but not all of them. There are also resize thread safety issues, which are not covered in the red-black tree section. Thread safety will be explained in a separate article later. Red-black trees were placed into TreeMap analysis. If the article is wrong, please point out, thank you very much.