Default capacity selection for HashMap

Collection is often used in the daily development of Java development, and as a typical K-V structure of data structure, HashMap must be familiar to Java developers.

In daily development, we often create a HashMap as follows:

However, did you remember that in the above code we did not specify the capacity of a HashMap? What is the default capacity of a newly created HashMap? Why is that?

This article will analyze this problem.

What is capacity

In Java, there are two relatively simple data structures for storing data: arrays and linked lists. The characteristics of arrays are: easy to address, insert and delete difficult; The characteristics of linked lists are: difficult to address, easy to insert and delete. A HashMap is a combination of an array and a list, taking advantage of the best of both. We can think of it as an array of lists.

In a HashMap, there are two key fields that are easily confused: size and capacity, where Capacity is the capacity of the Map and size is called the number of elements in the Map.

A HashMap is a bucket. Capacity is how many elements the bucket can currently hold, and size is how many elements the bucket already holds.

Such as the following code:

Output result:

Above we define a new HashMap, put an element into it, and then reflect capacity and size, which have a capacity of 16 and a number of elements 1.

From the previous example, we found that when we create a HashMap, if we do not specify its capacity, we get a Map with a default capacity of 16. How do we get this capacity? And why is that number?

Capacity and hashing

In order to explain the reason for this default capacity, we need to know what this capacity is for in the first place.

As we know, capacity is the number of “buckets” in a HashMap. Therefore, when we want to put an element into a HashMap, we need to calculate which bucket to put it into through a certain algorithm. This process is called hash, which corresponds to the hash method in the HashMap.

We know that the hash method’s function is to locate the k-V in the list array based on the Key. The input of the hash method should be an Object Key and the output should be an array index of type int. If you could design this method, what would you do?

Instead, we simply call the Object’s hashCode() method, which returns an integer and modulates the size of the HashMap.

If it were that simple, the capacity setting of a HashMap would be much simpler, but the implementation of the hash method of a HashMap is somewhat complicated in terms of efficiency.

The realization of the hash

Here’s how to implement hash methods in a HashMap.

It is implemented by two methods, int hash(Object k) and int indexFor(int h, int Length).

Hash: This method essentially converts Object to an integer.

IndexFor: This method primarily converts hash generated integers into indices in linked list arrays.

To focus the focus of this article, let’s just look at the indexFor method. Let’s take a look at the implementation details in Java 7 (Java8 does not have a separate method, but the algorithm for querying subscripts is the same as Java 7) :

The indexFor method essentially replaces hashCode with subscripts in a linked list array. The two parameters h represent the hashcode value of the element, and length represents the capacity of the HashMap. So what does return h & (length-1) mean?

In fact, he’s just taking modules. The main consideration in Java for using bit-arithmetic (&) instead of modulo arithmetic (%) is efficiency.

Bit operation (&) is much more efficient than substitute mod operation (%), mainly because bit operation directly operates on memory data, does not need to convert to decimal, so processing speed is very fast.

So why is it possible to use the bit operator (&) to implement the modulo operator (%)? This works as follows:

If n is 3, then 2 to the third is equal to 8, which is equal to 1000 in base 2. 2 to the third minus 1 is 7, which is 0111.

So X ^ 2^3-1 is the same thing as taking the last three digits of base 2 of X.

In base 2, X / 8 is the same thing as X >> 3, that is, if you move X to the right by 3 bits, you get the quotient of X / 8, and what you remove is X % 8, which is the remainder.

I don’t know if you understand the above explanation, but it doesn’t matter if you don’t, you just need to remember this technique. Or you could try it out with a couple of examples.

The operation process is as follows:

So, return h & (length-1); If length is 2 to the n, then you can do modulo.

Therefore, bitwise operation is more efficient than modulo operation since bitwise operation directly operates on memory data and does not need to be converted to decimal. Therefore, HashMap uses bitwise operation instead of modulo operation when calculating the index of an element to be stored in an array. The reason why we can make equivalent substitution is that the capacity of HashMap must be 2^n.

So, if it’s 2 to the n, why does it have to be 16? Why can’t it be four, eight or 32?

There is no official JDK explanation for this choice of default capacity, and I haven’t found anything useful online about it. (If anyone has relevant authoritative information or ideas, please feel free to leave a comment.)

According to the authors, this should be an Experience Value, and since a default Value of 2^n must be set as an initial Value, there is a trade-off between efficiency and memory usage. This value should be neither too small nor too large.

If it is too small, frequent capacity expansion may occur, affecting efficiency. It’s too big and it wastes space. It’s not worth it.

So, 16 was adopted as an empirical value.

Static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16, which is deliberately written as 1<<4, reminds developers that this place is a power of 2. It is interesting to note that AKA 16 is also new in 1.8.

So, how does a HashMap guarantee that its capacity can be 2 to the n? What if the user sets it up himself?

For this part, the HashMap is compatible in two places where it can change its capacity: when the specified capacity is initialized and when it is expanded.

Specifies capacity initialization

When we set the initialCapacity using a HashMap(int initialCapacity), the HashMap does not necessarily take the value we passed in directly, but evaluates it to a new value to improve hash efficiency. (1->1, 3->4, 7->8, 9->16)

In JDK 1.7 and JDK 1.8, HashMap initializes this capacity at different times. In JDK 1.8, when the HashMap constructor is called to define a HashMap, the capacity is set. In JDK 1.7, you don’t do this until the first PUT operation.

Take a look at how the JDK finds the first power of 2 greater than the specified value passed in:

The purpose of the above algorithm is quite simple: based on the capacity value passed in by the user (cap in the code), it calculates the first power of 2 greater than it and returns it.

Pay attention to the changes in the blue font in the examples above and you may notice some patterns. 5->8, 9->16, 19->32, 37->64 are mainly through two stages.

Corresponding to the above code, Step1:

Corresponding to the above code, Step2:

Step 2 is relatively simple, which is to make a judgment of the limit value, and then add the value obtained in Step 1 to 1.

What about Step 1? It’s essentially shifting a binary number to the right, and then taking or to the original value. The purpose is to set all subsequent bits to 1, starting with the first non-0 bit for the binary of a number.

Take any binary number and run through the above formula to find the purpose:

By a couple of unsigned right shifts and bitwise or operations, we convert 1100 1100 1100 to 1111 1111 1111, add 1111 1111 1111 plus 1, and we get 1 0000 0000 0000, that’s the first power of 2 greater than 1100 1100 1100.

Ok, so we’ve now explained the code for Step 1 and Step 2. You can convert a number to the first power of 2 larger than itself.

But there’s a special case where this formula doesn’t work, where these numbers are powers of 2 by themselves. If the number four follows the formula. The result will be 8, but this problem has been solved. See the verification method and JDK solution in a later article. I’m not going to expand it here.

In short, the HashMap calculates the first power of 2 greater than that number using unsigned right shifts and bitwise or operations based on the initialized capacity passed in by the user.

capacity

In addition to specifying the capacity of the HashMap during initialization, the capacity may change during expansion.

HashMap has a capacity expansion mechanism, that is, when the capacity expansion conditions are met, the capacity will be expanded. The expansion condition of a HashMap is that when the number of elements in the HashMap (size) exceeds the threshold, the HashMap will be automatically expanded.

In HashMap, threshold = loadFactor * capacity.

LoadFactor is the loadFactor that indicates how full the HashMap is. The default value is 0.75f. 0.75 has the advantage that 0.75 is exactly 3/4, and capacity is a power of 2. So, the product of two numbers is an integer.

For a default HashMap, expansion is triggered by default when its size is greater than 12(16*0.75).

Here is a section of the resize method in HashMap:

As can be seen from the above code, the size of the table after expansion is doubled. After this step is performed, the table after expansion will be adjusted. This part is not the focus of this article and omitted.

It can be seen that when the number of elements (size) in the HashMap exceeds the threshold, it will be automatically expanded to double the original capacity, that is, from 16 to 32, 64, 128…

Therefore, the capacity of the HashMap is guaranteed to always be a power of 2 by ensuring that the initial capacity is always a power of 2 and that the capacity is also expanded by twice the previous capacity.

conclusion

A HashMap is a data structure. During the putting process, elements need to perform hash operations to calculate the specific location of the element in the HashMap.

In fact, the process of hash operation is to hashcode the Key of the target element and then modulo the capacity of the Map. In order to improve the efficiency of modulo extraction, JDK engineers use bit operation instead of modulo extraction, which requires that the capacity of the Map must be a power of 2.

As a default, too large or too small is not appropriate, so 16 is used as an appropriate experience value.

To ensure that the Map’s capacity is a power of two in all cases, HashMap is limited in both places.

First, if the user has specified an initial capacity, the HashMap calculates the first power of 2 above that number as the initial capacity.

In addition, when you expand, you also multiply, so that 4 becomes 8 and 8 becomes 16.

In this article, by analyzing why the default capacity of HashMap is 16, we delved into the principle of HashMap and analyzed the principle behind it. From the code, we can find that JDK engineers use all kinds of bit operations to the extreme, and try to optimize the efficiency in various ways. Worth learning!

Default capacity selection for HashMap

What is capacity

Capacity and hashing

The realization of the hash

Specifies capacity initialization

capacity

conclusion

Related Posts

All WiFi is a potential “third eye” on your privacy

Overwatch’s architecture is designed to synchronize with the web

How does small white become hacker?