Blog: bugstack.cn

Precipitation, share, grow, let yourself and others can gain something! ๐Ÿ˜„

One, foreword

Thanks to Doug Lea, HashMap has become the most frequently used and interviewed API in the world.

HashMap first appeared in JDK 1.2, and the underlying implementation is based on the hash algorithm. A HashMap allows null keys and null values, with a null key hash value of 0 when calculating the hash value of a hash key. A HashMap does not guarantee the order of key-value pairs, which means that the order of key-value pairs can change after certain operations. Also, it is important to note that HashMap is a non-thread-safe class, which can be problematic in a multithreaded environment.

HashMap first appeared in JDK 1.2, the bottom is based on the hash algorithm implementation, with several generations of optimization update so far its source part has been more complex, involves a lot of knowledge, including in JDK 1.8; Disturbance function, hash table, 2, 1, 3, initial capacity, and 4, load factor, 6, 5, and expansion element split, chain table tree, 7, red and black tree, 8, the search, insert, 9, 10, delete, 11, traversal, 12, lock, etc., because involves more knowledge you need to separate, this section we will first focus on the top five, That is, the use of data structures.

Data structure is often related to mathematics, the learning process is recommended to download the corresponding source code for experimental verification, this process may be a bit brain-burning, but after learning without rote memorization can understand this part of the knowledge.

Second, resource download

The source code and resources involved in this chapter are in Engineering, interview-04, including;

  1. 100,000 word test data in doc folder
  2. The disturbing function Excel is displayed in the Dock folder
  3. The test source section is ininterview-04In the project

Bugstack can be obtained by following the public account: bugstack wormhole stack, reply to download {reply to download after opening the obtained link, find ID: 19}

Third, source code analysis

1. Write the simplest HashMap

The best way to learn about a HashMap is to understand what kind of data structure it is to store data in. After iterating through multiple versions of HashMap, the code looks complex at first glance. Like you used to wear shorts, and now you have long Johns and a trench coat. So let’s take a look at the basics of HashMap, that is, wearing only pants, and then analyze its source code.

Problem: Suppose we have a set of 7 strings that need to be stored in an array, but require O(1) time to fetch each element. That is, you don’t loop through the array, you go to the array ID and get the element.

Solution: If we need to get an element from an array by ID, then we need to calculate a location ID for each string in the array. What can you think of to get an ID for a string? The most direct way to get numeric information for a string is with HashCode, but the range of HashCode values is too large [-2147483648, 2147483647] to use directly. Then you need to use HashCode and the length of the array to get a position that can appear in the array. If two elements have the same ID, then the array ID contains two strings.

So that’s the basic idea of hashing a string into an array, and we’re going to do that in code.

1.1 Code Implementation

// Initializes a set of strings
List<String> list = new ArrayList<>();
list.add("jlkk");
list.add("lopi");
list.add(Little Fuge.);
list.add("e4we");
list.add("alpo");
list.add("yhjk");
list.add("plop");

// Define the array to store
String[] tab = new String[8];

// loop storage
for (String key : list) {
    int idx = key.hashCode() & (tab.length - 1);  // Calculate index position
    System.out.println(String.format("The key value = % s independence Idx = % d", key, idx));
    if (null == tab[idx]) {
        tab[idx] = key;
        continue;
    }
    tab[idx] = tab[idx] + "- >" + key;
}
// Outputs test results
System.out.println(JSON.toJSONString(tab));
Copy the code

The code as a whole looks very simple and does not have much complexity, mainly including the following;

  1. Initializes a collection of strings, 7 of which are initialized.
  2. Define an array to hold strings. Note that the length is 8, which is a multiple of 2. That’s the length of the array0111It’s characteristic of 1 except for the high order, also for hashing.
  3. The next step is to loop through the data, figuring out where each string is in the array.key.hashCode() & (tab.length - 1).
  4. In the string into the array, if the same element is encountered, perform concatenationSimulate the process of a linked list.
  5. Finally output storage results.

The test results

The key value = JLKK independence Idx =2The key value = lopi independence Idx =4Key = Small Fourier Idx=7The key value = e4we independence Idx =5The key value = alpo independence Idx =2The key value = yhjk independence Idx =0The key value = the plop independence Idx =5Test Results: ["yhjk".null."jlkk->alpo".null."lopi"."e4we->plop".null.Little Fuge.]
Copy the code
  • The test results are first calculated for each element in the Idx of the array, and there are also repeated positions occurring.
  • Finally, the output of the test result, 1, 3, 6, is empty, 2, 5, there are two elements linkede4we->plop.
  • This achieves one of our most basic requirements, hash the string elements into the array, and finally get the corresponding string by the index ID of the string element. This is one of the most basic principles of HashMap, and with this foundation it will be easier to understand the source implementation of HashMap.

1.2 Hash Diagram

If the above test results do not create a good data structure in your mind, look at the following hash diagram to understand;

  • This is what the above code does, Hash each string element into an array.
  • The yellow index ID holds no elements, the green index ID holds one element, and the red index ID holds two elements.

1.3 What are the problems with this simple HashMap

Above we have implemented a simple HashMap, or not yet a HashMap, just a rudimentary hash store. But what are the problems with such a data structure in practice?

  1. All elements need to be stored in an index location, and if the location of the element is not enough for the hash collision, then the hash table is useless and the desired performance is not achieved.
  2. In the calculation formula to obtain the index ID, the array length needs to be a multiple of 2, then how to initialize the array size.
  3. The smaller the array, the bigger the collision, the bigger the array, the smaller the collision, how to choose between time and space.
  4. There are now 7 elements in the list, and there are already 2 strings in two places.
  5. As the elements are added, the array length is not enough to expand, how to split the original elements into new locations.

These problems can be summarized as; Perturbation function, initialization capacity, load factor, expansion method, use of linked list and red-black tree transformation, etc. And then we’ll do a problem by problem analysis.

2. Disturbance function

When storing elements in a HashMap, there is code to handle hash values. This is the Hash perturbation function of Java 8, used to optimize the hash effect.

static final int hash(Object key) {
    int h;
    return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code

2.1 Why use disturbance function

Theoretically the string hashCode is an int, and that can be used as an array subscript without collisions. But the hashCode is in the range of [-2147483648, 2147483647], which is nearly 4 billion in length, and no one can initialize an array that big, nor can it fit in memory.

DEFAULT_INITIAL_CAPACITY = 1 << 4, so the Hash value obtained cannot be used as a subscript directly. It needs to be modulo with the array length to get a subscript value, which is the Hash we did above.

(h = key.hashCode()) ^ (h >>> 16). By moving the hash 16 bits to the right, which is exactly half its length, and then xor with the original hash, you mix the high and low values of the original hash, increasing the randomness. The calculation method is shown below.

  • In plain English, the purpose of using perturbation functions is to increase randomness, make data elements more evenly hashed, and reduce collisions.

2.2 Experimental verification of disturbance function

As can be seen from the above analysis, the disturbance function uses the high and low half regions of the hash value to do xOR, and mixes the high and low parts of the original hash code to increase the randomness of the low parts.

But if you don’t see the experimental data, it’s still a theory, and I don’t know if this hash is actually randomised or not. So we’re going to do an experiment here, and this experiment goes like this;

  1. Choose a 100,000-word thesaurus
  2. Defines a 128-bit array lattice
  3. Calculate the number of subscripts of 100,000 words allocated to 128 cells under perturbation and unperturbation respectively
  4. The number of each grid is counted and the wave curve is generated. If the fluctuation curve under the perturbation function is relatively more stable, then the perturbation function is proved to be effective.
2.2.1 Disturbance code test

Perturbation function contrast method

public class Disturb {

    public static int disturbHashIdx(String key, int size) {
        return (size - 1) & (key.hashCode() ^ (key.hashCode() >>> 16));
    }

    public static int hashIdx(String key, int size) {
        return (size - 1) & key.hashCode(); }}Copy the code
  • disturbHashIdxUnder the perturbation function, the subscript value is calculated
  • hashIdxIn the case of undisturbed function, the subscript value is calculated

Unit testing

// 100,000 words have been initialized into words
@Test
public void test_disturb(a) {
    Map<Integer, Integer> map = new HashMap<>(16);
    for (String word : words) {
        // Use the perturbation function
        int idx = Disturb.disturbHashIdx(word, 128);
        // Do not use perturbation function
        // int idx = Disturb.hashIdx(word, 128);
        if (map.containsKey(idx)) {
            Integer integer = map.get(idx);
            map.put(idx, ++integer);
        } else {
            map.put(idx, 1);
        }
    }
    System.out.println(map.values());
}
Copy the code

The distribution of subscript values under the two functions was counted above, and the statistical results were finally put into Excel to generate charts.

2.2.2 Disturbance function hash diagram

The above two graphs are subscript assignments without and with the perturbation function respectively. Experimental data;

  1. 100,000 words that do not repeat themselves
  2. 128 cells, equivalent to a 128-length array

No perturbation function is used

Using a perturbation function

  • As can be seen from the comparison of the two types of graphs, data distribution is more uniform after the use of disturbance function.
  • Data is evenly distributed, which means hashes work better, reducing hash collisions and making data storage and retrieval more efficient.

3. Initialize the capacity and load factor

As we can see from our example of miming a HashMap and from the default initialization size of a HashMap, the hash array needs a multiple of 2, because the value 01111 only appears when the multiple of 2 is subtracted by 1.

If we pass a new HashMap<>(17); if we pass a new HashMap<>(17); What does it do?

3.1 Find the minimum multiple of 2

In the initialization of a HashMap, there is a method like this;

public HashMap(int initialCapacity, float loadFactor) {...this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}
Copy the code
  • The thresholdthreshold, by means oftableSizeForIt’s evaluated based on initialization.
  • The method is to find the smallest base 2 value that is greater than the initial value. Like 17, I should have found 32.

Method for calculating threshold size;

static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Copy the code
  • MAXIMUM_CAPACITY = 1 << 30, this is the critical range, that is, the largest set of maps.
  • At first glance may be a little dizzy ๐Ÿ˜ต how to shift to the right 1, 2, 4, 8, 16, this is mainly in order to fill in the various positions of the binary 1, when the various positions of the binary are 1, is a standard multiple of 2 minus 1, the final result plus 1 to return.

So here we put 17, such an initial calculation threshold process, with the diagram to show, easy to understand;

3.2 Load factor

static final float DEFAULT_LOAD_FACTOR = 0.75 f;
Copy the code

What does a load factor do?

Load factor, which can be understood as the load factor, is the load factor that a car can carry beyond a certain threshold.

So in a HashMap, the load factor determines how much data you need to expand. To mention the HashMap example we did above, we prepared 7 elements, but we ended up with 3 Spaces free and 2 Spaces for 2 elements. So even if you have more data than the array size, you may not be able to fill the array properly, but there are a lot of collisions in some of the small positions, and you can only store the list in the same place, so you lose the performance of the Map array.

Therefore, choose a reasonable size for expansion. The default value is 0.75, which means that when the threshold capacity is 3/4s, expand quickly to reduce Hash collisions.

While 0.75 is the default construct, it can be adjusted when creating a HashMap. For example, if you want to trade more space for time, you can make the load factor smaller to reduce collisions.

4. Split expansion elements

Why? Because the array length is insufficient. The most immediate problem with expansion is the need to split elements into new arrays. In the process of splitting elements, the original JDK1.7 will need to recalculate the hash value, but in jdK1.8 has been optimized, no need to recalculate, improve the performance of splitting, or very clever design.

4.1 Test Data

@Test
public void test_hashMap(a) {
    List<String> list = new ArrayList<>();
    list.add("jlkk");
    list.add("lopi");
    list.add("jmdw");
    list.add("e4we");
    list.add("io98");
    list.add("nmhg");
    list.add("vfg6");
    list.add("gfrt");
    list.add("alpo");
    list.add("vfbh");
    list.add("bnhj");
    list.add("zuio");
    list.add("iu8e");
    list.add("yhjk");
    list.add("plop");
    list.add("dd0p");
    for (String key : list) {
        int hash = key.hashCode() ^ (key.hashCode() >>> 16);
        System.out.println("String:" + key + "\ tIdx (16) :" + ((16 - 1) & hash) + \tBit value:" + Integer.toBinaryString(hash) + "-" + Integer.toBinaryString(hash & 16) + "\ t \ tIdx (32) :" + ((
        System.out.println(Integer.toBinaryString(key.hashCode()) +""+ Integer.toBinaryString(hash) + "" + Integer.toBinaryString((32 - 1) & hash)); }}Copy the code

The test results

String:jlkk 	Idx(16): 3 Bit Value: 1100011101001000010011-10000Idx(32): 19 1100011101001000100010 1100011101001000010011 10011 The value is lopiIdx(16): 14 Bit Value: 1100101100011010001110-0Idx(32): 14 1100101100011010111100 1100101100011010001110 1110 The value is a string of JMDW charactersIdx(16): 7 Bit Value: 1100011101010100100111-0Idx(32): 7 1100011101010100010110 1100011101010100100111 111 The value is e4WEIdx(16): 3 Bit Value: 1011101011101101010011-10000Idx(32): 19 101110101111101101111101 10111011101101010011 10011 The value is a string of io98 charactersIdx(16): 4 Bit Value: 1100010110001011110100-10000Idx(32): 20 1100010110001011000101 1100010110001011110100 10100 The value is a string of NMHGIdx(16): 13 Bit Value: 1100111010011011001101-0Idx(32): 13 1100111010011011111110 1100111010011011001101 1101 The value is a string of vFG6 charactersIdx(16): 8 Bit Value: 1101110010111101101000-0Idx(32): 8 1101110010111101011111 1101110010111101101101000 1000 Character string: GFRTIdx(16): 1 Bit Value: 1100000101111101010001-10000Idx(32): 17 110000010111111101100001 1100000101111101010001 10001 The value is alpoIdx(16): 7 Bit Value: 1011011011101101000111-0Idx(32): 7 1011011011101101101010 1011011011101101000111 111 The value is a string of VFBHIdx(16): 1 Bit Value: 1101110010111011000001-0Idx(32): 1 1101110010111011110110 1101110010111011000001 1 The value is the character string BNHJIdx(16): 0 Bit Value: 1011100011011001100,000-0Idx(32): 0 1011100011011001001110 1011100011011001100000 0 The value is zuioIdx(16): 8 Bit Value: 1110010011100110011000-10000Idx(32): 24 1110010011100110100001 1110010011100110011000 11000 The value is iU8eIdx(16): 8 Bit Value: 1100010111100101101000-0Idx(32): 8 1100010111100101011001 1100010111100101101000 1000 The value is a string of yHjkIdx(16): 8 Bit Value: 1110001001010010101000-0Idx(32): 8 1110001001010010000 111000100101001010101000 1000 The value is plopIdx(16): 9 Bit Value: 1101001000110011101001-0Idx(32): 9 1101001000110011011101 1101001000110011101001 1001 The value is dd0pIdx(16): 14 Bit Value: 1011101111001011101110-0Idx(32): 14 1011101111001011000000 101110111100101111101110 1110Copy the code
  • Here we use random strings to calculate their index allocations in 16-bit and 32-bit arrays to see which data is rerrouted to the new address.
  • At the same time, here can also observe a very important information ๐Ÿ•ต, the original hash value and the expansion of the length of 16, the operation &, if the value is 0, then the subscript position is unchanged. If it’s not zero, then the new position is going to add 16 to the old position. {this area needs to be well understood and looked at the experimental data}
  • This eliminates the need to recalculate the hash value of each element in the array.

4.2 Data Migration

  • This diagram shows the elements of the original 16-bit array, like the process of transferring the length of a 32-bit array.
  • Where yellow area elementzuioResult of calculationhash & oldCapIf it is not 1, it is migrated to subscript position 24.
  • It also verifies that the hash is actually assigned to position 24 by recalculating the hash value, and since this is the process of completing the 1 in the binary calculation, it is possible to determine the hash position in the simplified way above.

Four,

  • If you can stick with this section and follow the examples in the experiment, you should be sure to learn these five things in this chapter.1, hash table implementation,2. Disturbance function,3. Initialize the capacity,4. Load factor,5. Split expansion elements.
  • Personally, I also know this part of knowledge before, but have not verified, only know the concept is like this, just by writing the interview manual column, deepen learning, with data to verify the theory, so that knowledge points can be more in-depth understanding.
  • After this chapter, the next chapter will continue to explore other knowledge points of HashMap, so that if you understand it, you really understand it. All right, that’s it. Thank you for reading. If something is not clearly described, or there are points that you do not understand, welcome to discuss and communicate with me.

5. Recommended reading

  • Why does HashCode use 31 as a multiplier?
  • ยท What Do Interviewers Ask me?
  • Work two years resume to write like this, who want you!
  • Be reasonable, as long as you are a programmer who love to toss about, graduation job really do not need to spend money training!
  • College four years to graduation work 5 years of learning route resources summary
  • Source code analysis | Mybatis interface has no implementation class why can add and delete execution