Hello, I’m Son Ya. More than ten years of technical career, all the way from technical novice to technical director to JVM expert to entrepreneurship. Technology stack such as assembly, C language, C++, Windows kernel, Linux kernel. Especially like to study the virtual machine bottom implementation, have an in-depth study of JVM. The articles shared tend to be hardcore, very hard.

JVMS, memory pools, garbage collection algorithms, synchronized, thread pools, NIO, tricolor tagging algorithms…

What are you going to talk about today? The string constant pool, which is exactly how strings in Java code are stored in the JVM.

When analyzing a problem, we need to distill a highly abstract problem into a problem that pursues the essence. When it comes to reasoning, the secret of one move against the enemy is to respond to all changes with the same change. These two pieces of code are the basis for you to play through Java strings, and the go term is “eye,” which means the string circle in the Java world, and everything is derived from these two pieces of code.

Problem analysis

There are two ways to study things: researcher’s point of view and designer’s point of view. The researcher’s point of view is that we start from the point of view of learning, to pursue the track of things, dig down along the track, speak human words is to read the source code to understand the designer’s idea. The designer’s perspective is that we imagine that we are going to do something, how we are going to do it, what are the options ahead, analyze the pros and cons of each option, and finally make a choice.

I believe that most of the time, we study problems from the perspective of researchers. Today, let’s change our thinking and study this problem from the perspective of designers.

To boil it down to the essence of the problem: if we were writing a JVM, how would we handle strings? The problem is simple: use a hashtable, or hashtable. Note that there are two types of Hashtable in the Java world: Java hashTable and HashMap, implemented purely in Java; Hotspot hashtable, pure C++ implementation. Today we are looking at hashtable in Hotspot source code.

A lot of friends see C++, the heart of fear… Or thinking: Is this something I can see for free? Is this all I can see at this level? I really understand…

Well, don’t be afraid, there is no source code today! Only son teeth learned for a long time after AE draw a small dynamic, to help you accurately, quickly and accurately pinched. Well, don’t lose heart if you want to see the source code, pure source code, meet you. As long as you want to see, I deliberately overturned car is not impossible, message reply: want to see overturned car

hashtable

For the sake of completeness, I’m going to cover Hashtable in detail. If you think you are familiar with this area, it is also recommended to read it, there will be a harvest

First up, how does hashtable work? What are the nouns and phenomena involved

Let’s talk about the underlying implementation of Hashtable. There are two main ways: array + single linked list, array + red-black tree. You have to say there is a third way: array + single linked list + red black tree. All of a sudden, you may have a lot of questions: Why do you do this? What are the differences between different implementations? Don’t worry, it’s all covered.

Let’s see how arrays and singly linked lists work. What we’re drawing here is an array plus a singly linked list.

Step 1: When the string ziya arrives, it is greeted by the hash algorithm. The hash algorithm is also called the hash algorithm. There are many ways to implement the hash algorithm. The result is a buckets index. If ziya is calculated using the hash algorithm, buckets index is equal to 2.

Step 2: Buckets’ data structure is an array. The first step is to generate the index of the array. In this case, the string ziya will be encapsulated into a linked list Node Node, where the string ziya is the first element of index=2. The linked list Node Node encapsulated by the string ziya will be used as the head of the linked list, and the memory address of Node will be stored at the location of index=2.

Buckets is an array. Anyone who has written C++ code knows that you cannot create an array without specifying its length. How big is that? Hotspot source code is 20011. Why 20011? I also have this question, but I tried to find the answer, failed, but I am sure that the number should have a special meaning. You can leave a message if you know about it. Of course I will not give up the pursuit of truth!

In terms of development, you can store a maximum of 20011 buckets indexes if the buckets index obtained by the hash algorithm is not the same. What if there are more than 20011 strings? In this case, different strings will get the same buckets index after the hash algorithm, which is called hash collision.

For example, the string [handwritten] and the string [JVM] in the diagram show buckets index 0 based on the hash algorithm. Write the memory address of the Node where index=0. Then encapsulate the string [JVM] as a linked list Node and insert it into the end of the list. Note: Linked list nodes are not contiguous in memory!

A string is written to the bucket index, and then the string is encapsulated as a linked list Node. If the incoming string is the first element of the index, write the memory address of the node to the index position, if not the first element, insert to the end of the list.

A special case

Look at the picture before you speak

In this diagram, the depth of the linked list corresponding to Buckets index=0 is too deep (345), so that if the string we are looking for happens to be [internal work], it will take 345 times to go through the linked list regardless of other operations. For today’s high-performance computers, 345 times may not seem too slow, but performance degrades with data growth, meaning that one day you will reach a performance bottleneck and need to optimize.

This led to an array + red-black tree structure, which later evolved into an array + linked list + red-black tree structure. I’ll explain why: Red-black trees perform better than linked lists for large numbers of data. Why the array + list + red-black tree structure? Note the premise: the case of large amounts of data. When the amount of data is relatively small, such as the underlying implementation of HashMap, the threshold is 8, that is, in the case of severe hash collision, the data is less than 8, the performance difference between the two is not big, but the linked list is relatively easier to implement, take up less space…

If the hash collision is severe, changing the underlying storage container is one idea. Are there any other ideas? There are! Changing the hash algorithm is what Hotspot source code does. The default algorithm is java_lang_String::hash_code, and AltHashing::murmur3_32 is used when rehash is triggered.

Hotspot triggers rehash if it looks up a string more than 100 times.

String of a string

There are two tables associated with strings in the JVM, SymbolTable and StringTable. When we say string constant pool, we mean StringTable. But StringTable runs closely with SymbolTable. Everything this article talks about is the underlying principle for SymbolTable. This article is long enough to cover the connection between the two.

The underlying SymbolTable implementation is based on hashTable, with an array + linked list structure, and is implemented by switching hash algorithms when enough data is stored and hash collisions are severe. The default algorithm is java_lang_String::hash_code, and AltHashing::murmur3_32 is used when rehash is triggered.

Here’s a question: Switching hash algorithms won’t work right away, until it’s safe. Why do you think? Welcome to leave comments and discuss