The following content is the author’s original work, the copyright belongs to the author, such as reprint deduction, please leave a message on the “Light change” wechat public account application, reprint the article, please indicate the source at the beginning.

We wrote HashCode because we hear it all the time. But do you really know hashCode? Where will it be used? How should it be written?

By the end of this article, you should be able to see a different HashCode.

The purpose of using HashCode is to use one object to find another. For data structures that use hashes, such as HashSet, HashMap, LinkedHashSet, LinkedHashMap, keys will not be handled properly without good hashcode() and equals() methods that overwrite keys.

Override the hashCode () method on Person in the following code and see what happens.

Hashcode @override public int hashCode() {return age; } @Test public void testHashCode() { Set<Person> people = new HashSet<Person>(); Person person = null; for (int i = 0; i < 3 ; i++) { person = new Person("name-" + i, i); people.add(person); } person.age = 100; System.out.println(people.contains(person)); people.add(person); System.out.println(people.size()); }Copy the code

The results are not true and 3 as expected, but false and 4! The HashSet cannot find the person object after changing the person.age. Overwriting hahcode affects the storage and query of the HashSet.

So how does HashCode affect the storage and query of a HashSet? And what impact will it have?

The inside of a HashSet is implemented using a HashMap, and all collection elements put into the HashSet are converted to the keys of the HashMap. A HashMap is stored using a hash table, that is, an array + a linked list + a red-black tree (red-black tree has been added to JDK1.8). The storage structure diagram is as follows:


HashMap stores schematic diagrams of structures

The default array length is 16, and each element in the array stores the head of a linked list. The node structure of the linked list is as follows:

static class Node<K,V> implements Map.Entry<K,V> { final int hash; final K key; V value; Node<K,V> next; . }Copy the code

Each Node holds a hashcode of the hash—- key object. If the keys are not stored in any particular order, the search is compared to each element of the array through equals(), and the time is O(n). The larger the array, the less efficient it is.

So the bottleneck lies in the key query speed, how to quickly locate the storage location through the key?

A HashMap maps the hash value of the key to the index of the array, using the hash function of the key object to generate a value that serves as the index of the array, so that we can quickly locate the storage location using the key. If the hash function is perfectly designed, and each position in the array has fewer values, then we can find the elements we need in O(1) time, so we don’t have to traverse the list. This greatly improves query speed.

So how does HashMap get array subscripts based on HashCode? It can be broken down into the following steps:

  • The first step:h = key.hashCode()
  • The second step:h ^ (h >>> 16)
  • Step 3:(length - 1) & hash

Analysis of the

The first step is to get the hashcode value for the key;

The second step is to set the high or low bits of the hashcode of the key to 16 bits, so that even if the length of the array table is relatively small, both the high and low bits can be used in the Hash calculation without too much overhead.

The third step is to modulo the hash value and the array length so that the elements are relatively evenly distributed. When length is always 2 to the n, h & (length-1) operation is equivalent to modulo length, so that modulo operation is converted into displacement operation faster.

However, the default array initialization size of a HashMap is 16. When the array length is much smaller than the number of keys, different keys may produce the same array subscript, which is a hash collision!

There are open address method, chain address method and public overflow zone method to solve hash conflict.

Open addressing is looking for the next empty hash address in the event of a conflict. The process can be described as follows:

fi(key) = (f(key) + di) mod m (di= 1, 2, 3,… ,m-1)

Key collection for,67,56,16,25,37,22,29,15,47,48,34 {12}, for example, table long n = 12, f (key) = key mod and 12.

There is no conflict in the first 5 computations, direct deposit. As shown in the table

The array subscript key
0 12
1 25
2
3
4 16
5
6
7 67
8 56
9
10
11

When key = 37, f(37) = 1, which conflicts with 25. Apply the formula f(37) = (f(37) + 1) mod 12 = 2, so 37 is stored at 2 in the array. As shown in the table

The array subscript key
0 12
1 25
2 37
3
4 16
5
6
7 67
8 56
9
10
11

At key = 48, it conflicts with 0 where 12 is. If you keep looking, you don’t get a vacancy until f(48) = (f(48) + 6 mod 12 = 6. As shown in the table

The array subscript key
0 12
1 25
2 37
3
4 16
5 29
6 48
7 67
8 56
9
10 22
11 47

Therefore, in the process of conflict resolution, conflicts of 48 and 37 will also occur, that is, there will be accumulation, and the efficiency of searching or saving will be greatly reduced.

If the hash table space is [0 ~ m-1], set a one-dimensional Array Array[m] composed of M pointer components. All data elements with hash address I are inserted into the linked list with pointer Array[I].

The basic idea is to create a single linked list for each Hash value and insert records into the list when conflicts occur. As shown in the figure:


Chain address method

The benefits of linked lists are as follows:

  1. The remove operation is efficient. Only the change of Pointers is maintained, and no shift operation is required
  2. When rehashing, elements that were scattered in the same slot may be scattered in different places. Arrays need to be shifted, whereas linked lists only need to maintain Pointers. However, this also introduces the performance cost of traversing a single linked list.

The public overflow method is where we put a separate public overflow area for all the conflicting keys. For example, if {37,48,34} had conflicts in the previous example, store them in the overflow table. As shown in the figure.


Common overflow method

In the search, first compared with the base table, if equal, the search is successful, if not equal, in the overflow table for sequential search. The public overflow method is suitable for situations where conflicting data is rare.

HashMap resolves conflicts using the chain address method. The overall flow chart (not considering capacity expansion) is as follows:


HashMap stores the process diagram

Now that we understand that hashcode and hash conflicts are the solution, how do we design our own hashcode() methods?

The book Effective Java gives the following guidance for overwriting hashCode () :

  • Assign a non-zero constant value to the int variable result

  • Computes an int hash code C for each meaningful field F within the object

Domain types To calculate
boolean c = (f ? 0:1)
Byte, char, short, int c = (int)f
long c = (int)(f ^ (f >>> 32))
float c = Float.floatToIntBits(f)
double long l = Double.doubleToIntLongBits(f)
c = (int)(l ^ (l >>> 32))
Object c = f.hashcode()
An array of Apply the above rules to each element
boolean c = (f ? 0:1)
boolean c = (f ? 0:1)
  • The hash code is computed by mergingresult = 37 * result + c

Modern ides can automatically generate hashCode methods by right-clicking the context menu. For example, hashcode generated by IDEA looks like this:

@Override public int hashCode() { int result = name ! = null ? name.hashCode() : 0; result = 31 * result + age; return result; }Copy the code

In enterprise code, however, it is best to use a third-party library such as Apache Commons to generate hashocde methods. The advantage of using third-party libraries is that you can validate the tried code repeatedly. The following code shows how to use Apache Commons Hash Code to generate hashCode for a custom class build.

public int hashCode(){ HashCodeBuilder builder = new HashCodeBuilder(); builder.append(mostSignificantMemberVariable); . builder.append(leastSignificantMemberVariable); return builder.toHashCode(); }Copy the code

As the code shows, the most important signed member variables should be passed first followed by the less important ones.

conclusion

From the above analysis, we should design HashCode () with the following considerations:

  • Whenever hashCode () is called on the same object, the same value should be generated.
  • Hashcode () tries to use meaningful identifying information within the object.
  • Good hashcode() should produce evenly distributed hash values.

Thank youawakeningandThe birdsValuable advice and hard proofreading.

Follow public account

If the article helps you, please give the author a candy bar. You can follow our public account and publish high-quality articles regularly.




Light change: Wechat official account