preface

A HashMap is a data structure that accesses memory storage locations directly by (key). That is, it accesses records by computing a function on (key) that maps the data required for the query to a location in the table, which speeds up lookups. This mapping function is called a hash function, and the array of records is called a hash table, or hash table.

A HashMap design

A HashMap should contain the following functionality:

  1. put(key, value): Inserts (key, value) pairs into a hash map. If the value for the key already exists, update it.
  2. get(key): Returns the value of the given key, or null if the key is not included in the map.
  3. remove(key): If the key exists in the mapping, delete the key-value pair.

There are two key issues to implement HashMap, namely hash functions and conflict handling.

  • Hash function: if the keyword is k, its value is stored in position f(k). Therefore, value can be found directly by (key) without comparison.
  • Conflict handling: if k1≠ k2k1\ neq k2k1=k2, and f(k1)=f(k2)f(k1) =f(k2)f(k1) =f(k2), that is, the same hash address is obtained for different (keys), this phenomenon is called conflict.

If for any keyword in the keyword set, the probability of mapping to any address in the address set through the hash function is equal, this kind of hash function is called uniform hash function, which makes the keyword get a random address through the hash function, so as to reduce the conflict.

Construct the hash function

Hash functions make access to a sequence of data faster and more efficient, and data elements can be located faster through hash functions.

Several common methods:

  1. Direct addressing: Hash (k)=a∗k+bhash(k) =a *k +bhash(k) =a∗k+b, where ab is a constant.
  2. Random number method
  3. Square the middle method: the middle bits after the key word is squared are the hash address.
  4. Divisor remainder methodThe hash address is modelled by a number p not larger than m in the hash table. namely
    h a s h ( k ) = k % p . ( p Or less m ) hash(k) = k \% p,(p \le m)
    . We can not only take the modulus of (key) directly, but also inRandom number method,Square the middleTake the modulus after the operation.The choice of p is very important, generally take prime number or m, if the choice of P is not good, it is easy to conflict. Why is it better to take prime numbers and seeThis article

To deal with conflict

In order to know the key corresponding to the address of the same hash function generated by the conflict, another hash function must be selected or the result of the conflict must be processed. And the possibility of not having a conflict is very small, so conflict is usually handled. The common methods are as follows:

  • Open addressing:
    h a s h i = ( h a s h ( k e y ) + d i ) % m . i = 1 . 2… k ( k Or less m 1 ) Hash = (hash(key) + d_i) \% m, I = 1,2… k(k \le m-1)
    , includinghash(key)Is the hash function,mIs the hash table length,
    d i d_i
    Is the incremental sequence, and I is the number of conflicts that have occurred. Incremental sequences can be taken as follows:

  1. d i = 1 . 2 . 3… ( m 1 ) D_i = 1, 2, 3… (m – 1)
    Referred to asLinear detection; namely
    d i = a i + b d_i = a*i +b
    , includingabIs a constant and
    a indicates 0 a\neq0
    . Assuming a = 1, it is equivalent to exploring the table storing addresses one by one in the event of a conflict until an empty cell is found and the hash address is stored in that empty cell.

  2. d i = 1 2 . 2 2 . 3 2 . . . k 2 ( k Or less m / 2 ) D_i = ^ 1 ^ 2, 2, 2, 3 ^ 2… k^2(k \le m/2)
    Referred to asSquare detection. Relatively linear detection, equivalent to conflict detection interval
    d i = i 2 d_i = i^2
    Is the location of the unit empty? If it is empty, store the address in it.

  3. d i d_i
    = a sequence of pseudorandom numbers calledPseudo-random detection.

The following figure shows the linear probe filling a hash table:

The keyword is {80,11,40}, and the linear detection equation is di=id_i = idi= I. Assume that the modulo of the keyword against 10 is a hash function.

When 40 is filled, address 0 has been filled with 80, take I = 1, search to address 1, and find that address 0 has been filled with 11, take I = 2, search to address 2, and find that it is empty, and fill 40 into the empty cell at address 2.

The size of the table is crucial, and choosing 10 as the size is more likely to cause a conflict than choosing 11 as the prime number. The more prime the number, the more likely the mod is to be evenly distributed across the table.

It is easy to find from the figure that in the table of function address, if the result of hash function distributes the units of the table unevenly and forms blocks, it will cause the aggregation of linear detection and square detection. The key words in the hash to the block need multiple probes to find empty units to insert, resulting in a waste of time.

  • Separate chaining: Hash values of the same value are linked together in a linked list, independent of each other.
  • Double hash: Use two hash functions to compute the hash value, choosing addresses with fewer collisions.
  • Hashi =hashi(key), I =1,2… K. Hashihash_i = hash_i(key), I = 1,2… K. Hash_ihashi = hashi (key), I = 1, 2… K. Hashi is some hash function. That is, when the last hash calculation conflicts, the hash function address of the conflict is used to generate a new hash function address until the conflict does not occur again. This method does not produce aggregation easily, but increases the calculation time.

Separate chaining implements HashMap

Here we use LinkedList in Java to implement a single LinkedList

    public static class MyHashMap<K.V> {
        private static int DEFAULT_BUCKET_SIZE = 10;// The default length is 10
        private Bucket<K, V>[] buckets;

        public MyHashMap(a) {
            buckets = new Bucket[DEFAULT_BUCKET_SIZE];
            for(int i = 0; i < buckets.length; i++) {
            	buckets[i] = newBucket<>(); }}public void put(K key, V value) {
            int hashKey = key.hashCode() % buckets.length;
            this.buckets[hashKey].put(key, value);
        }

        public V get(K key) {
            int hashKey = key.hashCode() % buckets.length;
            return buckets[hashKey].get(key);
        }

        public void remove(K key) {
            inthashKey = key.hashCode() % buckets.length; buckets[hashKey].remove(key); }}private static class Pair<K.V> {
        public K key;
        public V value;

        public Pair(K key, V value) {
            this.key = key;
            this.value = value; }}private static class Bucket<K.V> {
        private List<Pair<K, V>> bucket;

        public Bucket(a) {
            bucket = new LinkedList<>();
        }

        public V get(K key) {
            for (Pair<K, V> pair : bucket) {
                if (pair.key.equals(key)) {
                    returnpair.value; }}return null;
        }

        public void put(K key, V value) {
            boolean found = false;
            for (Pair<K, V> pair : bucket) {
                if (pair.key.equals(key)) {
                    pair.value = value;
                    found = true; }}if(! found)this.bucket.add(new Pair<>(key, value));
        }

        public void remove(K key) {
            for (Pair<K, V> pair : this.bucket) {
                if (pair.key.equals(key)) {
                    this.bucket.remove(pair);
                    break; }}}}/** * test */
    public void test(a){
        MyHashMap<Integer,String> hashMap = new MyHashMap<>();
        hashMap.put(80."A");
        hashMap.put(11."B");
        hashMap.put(40."D");
        hashMap.put(25."C");
        System.out.println(hashMap.get(40));
    }
Copy the code

Complexity analysis

  • Time complexity:
    O ( N k ) O(\frac{N}{k})
    . Where N refers to the number of possible values and K refers to the predefined hash address length.

    • Assuming that the values are evenly distributed, consider that the average length of each linked list is Nk\frac{N}{k}kN.
    • For each operation, the worst-case scenario is to traverse the entire list, so the time complexity is O(Nk)O(\frac{N}{k})O(kN).
  • Space complexity: O(K + M), where K refers to the predefined hash address length and M refers to the number of values that have been inserted into the HashMap.

A simple HashMap is completed to analyze the problems:

  1. The query, insert and delete time complexity of single linked list is O(N)O(N)O(N) O(N), and the efficiency is not high.
  2. The length of the hash address is fixed at 10, and the length of the linked list will be longer with the increase of inserted elements. However, our time complexity is O(Nk)O(\frac{N}{k})O(kN),k = 10 is fixed, and N keeps increasing, which can be imagined that the efficiency of the algorithm is getting lower and lower.

The first problem can be solved with a more efficient data structure, such as a red-black tree, which can reduce the time complexity from O(N) to O(logN).

The second question is, since the time complexity is O(NK)O(\frac{N}{K})O(KN), should we just take a higher value of K? It is true that the increase of K reduces the time complexity, but the space complexity is O(K + M), and the increase of K also increases the space complexity.

After comprehensive consideration, N and K should be kept in a certain proportion, so that N is not too large, and K is too small to lead to too high time complexity. Meanwhile, N is too small, and K is too large to waste space, so that the time complexity is relatively stable.

The load factor

The payload factor definition is defined as: a = number of elements filled in the table/length of the hash table

You can see that the load factor is actually equal to NK\frac{N}{K}KN.

A is an indicator of how full the hash table is. Since the table length is a constant value, the larger a is, the more elements are filled in, the larger the possible rows of conflicts will be; conversely, the smaller A is, the fewer elements are filled in, the smaller the possible rows of conflicts will be. Therefore, for the scenario with large space and emphasis on time, smaller A can be selected, and vice versa.

For open addressing, the load factor is particularly important and should be strictly limited to below 0.7-0.8. Above 0.8, the CPU cache missing increases exponentially. Therefore, some hash libraries that use open addressing, such as the Java system library, limit the load factor to 0.75, beyond which resize the hash table.

Java8 HashMap

Java8 HashMap is a hash table implemented in the JDK. It has a lot of optimization, so let’s see how it solves the problem mentioned above.

Bucket structure

Let’s take a look at the data structure stored in the hash address:

    /** * The table, initialized on first use, and resized as * necessary. When allocated, length is always a power of two. * (We also tolerate length zero in some operations to allow * bootstrapping mechanics that are currently not needed.) */
    transient Node<K,V>[] table;
    
    static class Node<K.V> implements Map.Entry<K.V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;

        Node(int hash, K key, V value, Node<K,V> next) { ... }
        public final K getKey(a){... }public final V getValue(a) {... }public final String toString(a) {... }public final int hashCode(a) {... }public final V setValue(V newValue) {... }public final boolean equals(Object o) {... }}Copy the code

The member variable has an array of Node key pairs that are clearly stored in the hash table of Node. As you can see from the comment, the array resized if needed and always sized to a power of 2! , it can be seen from the above mentioned division remainder method that the length of the hash table should be prime, which is an unconventional design. The reason for this design is to optimize the modulus taking and expansion, and to reduce conflicts, when HashMap is locating the index position of the hash table, high order is also added to participate in the operation process.

put(key, value)

Since hash tags store nodes, which can be either list nodes or tree nodes, what is used in Java?

    public V put(K key, V value) {
       return putVal(hash(key), key, value, false.true);
    }

    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;/ / initialization
        if ((p = tab[i = (n - 1) & hash]) == null)  / / 1
        	// If the hash address is empty, insert it directly
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            if(p.hash == hash && ((k = p.key) == key || (key ! =null && key.equals(k))))
                e = p;// The hash table has the same first key
            else if (p instanceof TreeNode)
            	// If the node is a red-black tree, then it is inserted into the red-black tree
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                    	// Find the empty position, insert
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        	// If the list length >=(TREEIFY_THRESHOLD = 8), convert the list to a red-black tree
                            treeifyBin(tab, hash);
                        break;
                    }
                    if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
                        
                        break;
                    p = e;//p points to the next node}}if(e ! =null) { // The node to insert already exists
                V oldValue = e.value;
                if(! onlyIfAbsent || oldValue ==null) //onlyIfAbsent Controls whether to update an existing value
                    e.value = value;/ / update the value
                afterNodeAccess(e);// the template method is implemented by subclasses
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
        	// Store the number of elements in the rain threshld, adjust the size
            resize();
        afterNodeInsertion(evict);// Template method
        return null;
    }
Copy the code

As you can see, Node is not fixed as a data structure and will be converted from a linked list to a red-black tree when the number of nodes is greater than or equal to 8. Number less than 8, in the node list draw find number less than 8/2 = 4, compared with red and black tree search times less than the log (8) = 3, so only when the list length greater than or equal to 8 it is necessary to convert the red-black tree, although the list number greater than or equal to 8 into the red-black tree when the average search times is still less than the list, but the gap is very small, It also takes time to convert to a tree structure. Let’s look at the experimental data in the JDK:

* Because TreeNodes are about twice the size of regular nodes, we * use them only when bins contain enough nodes to warrant use * (see TREEIFY_THRESHOLD). And when they become too small (due to * removal or resizing) they are converted back to plain bins. In * usages with well-distributed user hashCodes, tree bins are * rarely used. Ideally, under random hashCodes, the frequency of * nodes in bins follows a Poisson distribution * (http://en.wikipedia.org/wiki/Poisson_distribution) With a * parameter of about 0.5 on average for the default resizing * threshold of 0.75, although with a large variance because of * resizing granularity. Ignoring variance, The expected * occurrences of list size k are (exp(-0.5) * POw (0.5, k) / * factorial(k)). The first values are: * * 0: 0.60653066 * 1: 0.30326533 * 2: 0.07581633 * 3: 0.01263606 * 4: 0.00157952 * 5: 0.00015795 * 6: 0.00001316 * 7: 0.00000094 * 8: 0.00000006 * More: less than 1 in ten millionCopy the code

The JDK also takes into account that tree nodes take up about twice as much space as regular list nodes, so red-black trees are used to speed up queries only when the list length reaches a certain value, and are also turned back to the list when the length is less than a certain value. If the hash function is properly designed, the length of the list will conform to Poisson_distribution. As shown in the data, the probability that the length of the list is 8 in a HashMap is only 0.00000006. Poisson distribution function is shown as follows:

On the horizontal axis is index K, number of occurrences. The vertical axis is the probability that x equals k.

Therefore, we can also know that the probability of HashMap linked list length exceeding 8 is very small, that is to say, the probability of time complexity greater than O(8) is very low. Therefore, we can approximately say that the operation time complexity of HashMap is less than O(8). We know that the coefficient is often ignored when calculating time complexity. It can be said that the time complexity of each operation of HashMap is O(1).

P = TAB [I = (n-1) &hash] = hash % n, p = TAB [I = (n-1) &hash] = hash % n Tab. length is a power of 2. From the binary point of view, a power of 2 minus 1 is equal to a mask that is less than its most significant bit, which must be less than itself.

For example, 8 = 0x1000, 8-1 = 0x0111, the high order of positive numbers are all 0, so the maximum number that is subtracted 1 from the power of 2 is subtracted 1. For computers, modulo operations are much more complicated than and operations.

static final int hash(Object key) {
   int h;
   return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code

The hash function is implemented using the high or low 16 bits of hashCode(), which allows both the high and low bits to participate in the hash operation. This increases randomness, makes the hash more uniform, and effectively reduces collisions. At the same time, the operation efficiency is high.

The reason why this method can increase randomness is that the index of TAB mentioned above is obtained by (n-1) & hash, and when n is small, the binary high order of n is all 0, as shown in the figure below. The effective value after and operation is only 4 lower bits, so the high order of hash cannot be involved in calculation. It makes the hash result relatively concentrated.

Why ^ instead of | and ampersand?

Let’s start with the truth table for and, or, xOR:

It can be seen that “&” will output Y with a probability of 0 in 75%, “|” will output Y with a probability of 0 in 25%, and “^” will output Y with a probability of 50%. As mentioned above, the more uniform and random the result of hashing is, the fewer conflicts will occur, so “^” is selected here.

resize()

As seen in the PUT method, resize is performed when the number of stored data is greater than the threshold.

Before we understand the Hash and expansion process, we need to take a look at a few fields of the HashMap.

    // Default initialization capacity
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
    // Default load factor
    static final float DEFAULT_LOAD_FACTOR = 0.75 f;
    // Threshold capacity x load factor that triggers capacity expansion
    int threshold;    
    // Load factor
    final float loadFactor; 
    // The number of times the hash table structure has changed, mainly for quick iteration failures
    int modCount;
Copy the code

The loading factor has been mentioned above and will not be repeated here. Note that modCount does not +1 when the hash table structure is unchanged. For example, in the put method, if the value to be modified is found, return without increasing modCount.

    @Override
    public void forEach(BiConsumer<? super K, ? super V> action) {
        Node<K,V>[] tab;
        if (action == null)
            throw new NullPointerException();
        if (size > 0&& (tab = table) ! =null) {
            int mc = modCount;// Cache modCount for the current thread
            for (Node<K,V> e : tab) {
                for(; e ! =null; e = e.next)
                    action.accept(e.key, e.value);
            }
            if(modCount ! = mc)/ / if other threads did change the structure of a HashMap, then modCount will add 1, triggering ConcurrentModificationException
                throw newConcurrentModificationException(); }}Copy the code

Can see when there is in the process of traversing the other thread to add or delete a HashMap will trigger a ConcurrentModificationException.

    final Node<K,V>[] resize() {
        Node<K,V>[] oldTab = table;
        int oldCap = (oldTab == null)?0 : oldTab.length; // Expand the hash table length of the container
        int oldThr = threshold;
        int newCap, newThr = 0;
        if (oldCap > 0) {
            if (oldCap >= MAXIMUM_CAPACITY) { // If the capacity has reached MAXIMUM_CAPACITY = 1 << 30, set threshold to integer. MAX_VALUE = (1 << 31) -1
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            // Make it twice as large as before
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                newThr = oldThr << 1;  // double threshold
        }
        else if (oldThr > 0) // If you create a HashMap using a constructor with arguments, this is where put is called for the first time
            newCap = oldThr;  // 0 Set the new bucket capacity
        else {               // Call the no-argument constructor to create the HashMap and call it here for the first time
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        threshold = newThr;
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        if(oldTab ! =null) {
        	// Go through all buckets and assign them to new buckets
            for (int j = 0; j < oldCap; ++j) {
                Node<K,V> e;
                if((e = oldTab[j]) ! =null) {
                    oldTab[j] = null;
                    if (e.next == null)  // The bucket has only one node and is allocated directly to the new bucket
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode) 
                    	// Red-black tree related operations
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // List with length greater than 1
                        Node<K,V> loHead = null, loTail = null; // The first and last nodes of the original index
                        Node<K,V> hiHead = null, hiTail = null; // The first and last nodes of the new index
                        Node<K,V> next;
                        do {
                            next = e.next;
                            if ((e.hash & oldCap) == 0) { // 1 Original index
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {  // 2 New index
                                if (hiTail == null)
                                    hiHead = e;
                                elsehiTail.next = e; hiTail = e; }}while((e = next) ! =null);
                        if(loTail ! =null) {
                            loTail.next = null;
                            newTab[j] = loHead;  // The original index is in the original place
                        }
                        if(hiTail ! =null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead; // The new index is placed at the old index + oldCap
                        }
                    }
                }
            }
        }
        return newTab;
    }
Copy the code

Comments in the code are already very detailed, mainly to see where there are numerical comments.

Note 0: newCap is equal to the threshold initialized when a HashMap is created using a parameterized construct. The length of a TAB must be a power of 2. Take a look at the source code:

    public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);
        this.loadFactor = loadFactor;
        this.threshold = tableSizeFor(initialCapacity);
    }
    
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1; / / 1
        n |= n >>> 2; / / 2
        n |= n >>> 4; / / 3
        n |= n >>> 8; / / 4
        n |= n >>> 16;/ / 5
        return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

Copy the code

You can see that threshold is equal to the value returned by tableSizeFor(CAP). The number 8205 is calculated by tableSizeFor(CAP)

8250 is a random number. After 5 steps, the tabSizeFor(cap) returns the most significant digit by +1, making it a power of 2. N = cap-1 is used to solve the problem that the passed cap is itself a power of 2. If there is no -1, the number returned will be twice that of cap.

Optimization in resize

Note 1,2: this area is cleverly designed to optimize resize performance

Since oldCap is a power of 2, that is, only one bit is 1 in binary representation, any number and oldCap have only two results, either 0 or oldCap

So new bucket index = old bucket index or old bucket index + oldCap, and both possibilities are equal. The fact that the two possibilities are equal is important because it allows the expanded hash to remain evenly distributed, eliminating the need to hash again to find new positions, greatly increasing efficiency.

In addition, if you do not use this method to find the position after resize, then you will have the situation that the new bucket already has a linked list, then you have two choices: insert the tail, which requires traversal and is inefficient; Either use the head insertion method, insert the head of the linked list, but this way will reverse the original order of nodes, is a kind of unstable algorithm.

The red-black tree goes back to the linked list

The TreeNode#split method is called in the resize() section of the red black tree node processing section:

    static final int UNTREEIFY_THRESHOLD = 6;

    final void split(HashMap<K,V> map, Node<K,V>[] tab, int index, int bit) {
            TreeNode<K,V> b = this;
            // Relink into lo and hi lists, preserving order
            TreeNode<K,V> loHead = null, loTail = null;
            TreeNode<K,V> hiHead = null, hiTail = null;
            int lc = 0, hc = 0;
           	/ /... Omit red-black tree part data processing
            if(loHead ! =null) {
                if (lc <= UNTREEIFY_THRESHOLD)
                    // Convert red-black trees to linked lists when the number of nodes is less than UNTREEIFY_THRESHOLD
                    tab[index] = loHead.untreeify(map); 
                else {
                    tab[index] = loHead;
                    if(hiHead ! =null) // (else is already treeified)loHead.treeify(tab); }}if(hiHead ! =null) {
                if (hc <= UNTREEIFY_THRESHOLD)
                    tab[index + bit] = hiHead.untreeify(map);
                else {
                    tab[index + bit] = hiHead;
                    if(loHead ! =null) hiHead.treeify(tab); }}}// When untreeify is used in TreeNode, the red-black tree is converted to a linked list
       final Node<K,V> untreeify(HashMap<K,V> map) {
            Node<K,V> hd = null, tl = null;
            // When this is a TreeNode, you can see that the transition from here to here is also very neat, and the transition directly goes up to Node, dropping the treenode-specific properties
            for (Node<K,V> q = this; q ! =null; q = q.next) {
                Node<K,V> p = map.replacementNode(q, null);
                if (tl == null)
                    hd = p;
                else
                    tl.next = p;
                tl = p;
            }
            return hd;
        }
Copy the code

See from the split method that the red-black tree is converted to a linked list when the number of nodes is less than 6. It’s kind of weird to see here, the threshold for a list to become a red-black tree is 8, but why is the threshold for a red-black tree to become a list 6?

It may seem strange at first, but it’s easy to understand. First of all, in terms of efficiency, there’s really no difference between 6 and 8. Imagine if both transformations were 8? If a HashMap repeatedly inserts and deletes elements and the list length hovers around 8, tree and list conversions will occur frequently, which is inefficient.

Why is the table transient

The table column is transient, so the table column is not serialized directly.

    /** * The table, initialized on first use, and resized as * necessary. When allocated, length is always a power of two. * (We also tolerate length zero in some operations to allow * bootstrapping mechanics that are currently not needed.) */
    transient Node<K,V>[] table;
Copy the code

And for that reason let’s see what happens when we serialize and deserialize

writeObject

WriteObject, a method to perform when serializing

    private void writeObject(java.io.ObjectOutputStream s)
        throws IOException {
        int buckets = capacity();
        // Write out the threshold, loadfactor, and any hidden stuff
        s.defaultWriteObject(); // ObjectOutputStream Default serialization method
        s.writeInt(buckets); / / 1
        s.writeInt(size); / / 2
        internalWriteEntries(s);
    }
Copy the code

The main focus here is on comments 1 and 2, which serialize the size of the capacity and the number of data actually stored.

readObject

The method executed when the readObject deserializes

 private void readObject(java.io.ObjectInputStream s)
        throws IOException, ClassNotFoundException {
        // Read in the threshold (ignored), loadfactor, and any hidden stuff
        s.defaultReadObject(); //ObjectInputStream deserializes by default
        reinitialize(); / / initialization
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new InvalidObjectException("Illegal load factor: " +
                                             loadFactor);
        s.readInt();                // The size of the capacity written in the writeObject method is ignored
        int mappings = s.readInt(); // witeObject specifies the actual number of bytes written to the witeObject method
        if (mappings < 0)
            throw new InvalidObjectException("Illegal mappings count: " +
                                             mappings);
        else if (mappings > 0) { // (if zero, use defaults)
            // Generate a proper capacity based on the actual number of data and expand the threshold
            float lf = Math.min(Math.max(0.25 f, loadFactor), 4.0 f);
            float fc = (float)mappings / lf + 1.0 f;
            int cap = ((fc < DEFAULT_INITIAL_CAPACITY) ?
                       DEFAULT_INITIAL_CAPACITY :
                       (fc >= MAXIMUM_CAPACITY) ?
                       MAXIMUM_CAPACITY :
                       tableSizeFor((int)fc));
            float ft = (float)cap * lf;
            threshold = ((cap < MAXIMUM_CAPACITY && ft < MAXIMUM_CAPACITY) ?
                         (int)ft : Integer.MAX_VALUE);
            SharedSecrets.getJavaObjectInputStreamAccess().checkArray(s, Map.Entry[].class, cap);
           
            Node<K,V>[] tab = (Node<K,V>[])new Node[cap];
            table = tab;
			// Cyclic deserialize keys and values and add them to the HashMap
            for (int i = 0; i < mappings; i++) {
                K key = (K) s.readObject();
                V value = (V) s.readObject();
                putVal(hash(key), key, value, false.false); }}}Copy the code

Because the HashMap does not automatically shrink, some space in the table is wasted. No data is stored in the table. Direct serialization would be wasted. Therefore, serialization is only a sequence of the actual Node nodes and the number of data, deserialization is re-initialized and loaded.

summary

  1. All operations on HashMap do not lock and cache thread data, so it is not thread-safe.
  2. Capacity expansion is an O(N) operation, which takes a lot of performance compared with other operations, especially when there is a large amount of data. Therefore, when using HashMap, we should first estimate the amount of data to be cached, and give a rough value directly during initialization to reduce the capacity expansion operation of HashMap.
  3. The load factor should not be modified as far as possible. If the use scenario is very certain, a larger or smaller load factor can be selected according to whether it is time first or space first.
  4. A HashMap does not automatically shrink, so it is sometimes space-saving to choose multiple HashMaps instead of one.
  5. The power of two is a very useful number, not only used in HashMaps, but also in development where its properties can be used for unintended optimizations.

reference

The Java 8 series reimagines HashMap