A, takeaway

Common compiled languages such as C++ often compile code directly into machine code that the CPU can understand. To implement the “compile once, run everywhere” feature in Java, the compilation process is divided into two parts. First, it is compiled by Javac into a common intermediate form called bytecode, and then the interpreter interprets the bytecode into machine code for execution. As a result, Java generally lags behind compiled languages like C++ in terms of performance.

To optimize Java performance, the JVM introduced a Just In Time compiler In addition to the interpreter: when the program runs, the interpreter kicks In first, and the code executes directly. Over time, just-in-time compilers come into play, compiling more and more code to optimize native code for more efficient execution. The interpreter can then be used as a means of downgrading the compile run, switching back to explain execution in the event of problems with unreliable compiler optimizations to keep the program running properly.

The just-in-time compiler greatly improves the speed of Java programs and optionally compiles hot code, saving a lot of compile time and space compared to static compilation. Today, just-in-time compilers are very mature and even comparable to compiled languages in terms of performance. In this field, however, the search is still on for smarter ways to speed up programs by combining different compilation methods.

Second, Java implementation process

Java execution process as a whole can be divided into two parts, the first step is compiled into bytecode by JavAC source code, in this process will carry out lexical analysis, grammar analysis, semantic analysis, compilation principle in this part of the compilation is called front-end compilation. Then don’t need to compile directly to interpret bytecodes executed, a detailed explaining the process of execution, the virtual machine for the application of information collection, on the basis of the information, the compiler will gradually play a role, it will be the backend compile – compile bytecode into machine code, but not all of the code can be compiled, only by the JVM as hot code, It can be compiled.

What does it take to be considered hot code? The JVM sets a threshold for methods or blocks of code to be compiled and stored in codeCache when they are called more times than this threshold in a certain amount of time. The next time it encounters this code, it reads the machine code from the codeCache and executes it directly, thus improving the performance of the program. The overall execution process is roughly shown in the figure below:

1. Compilers in the JVM

There are two types of compilers integrated into the JVM, Client Compiler and Server Compiler, which have different functions. Client Compiler pays more attention to startup speed and local optimization, while Server Compiler pays more attention to global optimization and has better performance, but the startup speed will be slower due to more global analysis. The two compilers have different application scenarios and work together in virtual machines.

Client Compiler

HotSpot VM comes with a Client Compiler C1. This Compiler is fast to start, but performs worse than Server Compiler. C1 will do three things:

  • Local simple and reliable optimizations, such as some basic optimizations on bytecode, method inlining, constant propagation, and so on, give up many of the global optimizations that take a long time.
  • Bytecode is constructed as high-level Intermediate Representation (HIR). HIR is platform-independent and usually adopts a graph structure, which is more suitable for PROGRAM optimization by JVM.
  • Finally, the HIR is converted into a low-level Intermediate Representation (hereinafter referred to as LIR). On the basis of LIR, register allocation and peephole optimization (local optimization mode, the compiler in one or more basic blocks, In view of the generated code, combined with the characteristics of CPU’s own instructions, through some conversion rules that may bring performance improvement or through the overall analysis, instruction conversion, to improve code performance) and other operations, the final generation of machine code.

Server Compiler

Server Compiler focuses on some global optimizations that take a long time to compile, and even perform some unreliable radical optimizations based on the information the program is running. This Compiler takes a long time to start and is suitable for long-running background programs. Its performance is usually 30% higher than that of the Client Compiler. Currently, there are two Server Compiler types used in the Hotspot VIRTUAL machine: C2 and Graal.

C2 Compiler

In Hotspot VM, the default Server Compiler is the C2 Compiler.

The C2 compiler uses a Graph data structure called the Ideal Graph that combines control flow with data flow for compilation optimizations. The Ideal Graph represents the dependencies between the flow of data and instructions in the current application. With this Graph structure, some optimization steps (especially those involving floating blocks of code) become less complex.

The Ideal Graph is built by adding nodes to an empty Graph based on the instructions in the bytecode as it parses the bytecode. The nodes in the Graph usually correspond to one instruction block, and each instruction block contains multiple associated instructions that the JVM optimizes using some optimization techniques. It is formulated as Global Value Numbering or constant folding. After parsing is completed, some dead code elimination operations are carried out. After the Ideal Graph is generated, some global optimizations are performed based on this and the application runtime information collected. If the JVM decides that global optimization is not necessary at this stage, it will skip this optimization.

With or without global optimization, the Ideal Graph is converted to a MachNode Graph that is closer to the machine level. The machine code is compiled from the MachNode Graph, and there are some operations including register allocation, peeping hole optimization, and so on. The Ideal Graph and various global optimization tools will be covered in more detail in a later chapter. The process of Server Compiler compilation and optimization is shown in the figure below:

Graal Compiler

Starting with JDK 9, Hotspot VM is integrated with a new Server Compiler, the Graal Compiler. Graal has several key features compared to the C2 compiler:

  • As mentioned earlier, the JVM collects information about the execution of the program as it interprets the execution, and the compiler then uses this information to perform radical optimizations based on predictions, such as branch prediction, which selectively compiles the most likely branches of the program based on how likely they are to run. Graal prefers this optimization over C2, so Graal’s peak performance is generally better than C2’s.
  • Written in Java, it is more friendly to the Java language, especially new features such as Lambda, Stream, etc.
  • Deeper optimizations, such as virtual function inlining, partial escape analysis, etc.

Graal compiler can through the Java virtual machine parameters – XX: + UnlockExperimentalVMOptions – XX: + UseJVMCICompiler enabled. When enabled, it replaces the C2 compiler in HotSpot and responds to compilation requests that C2 was responsible for.

2. Hierarchical compilation

Prior to Java 7, developers had to choose compilers based on the nature of the service. For services that need to be started quickly or that will not run for a long time, C1 with high compilation efficiency can be used, corresponding to the parameter -client. For long-term running services or background services that require peak performance, C2 with better peak performance can be used, corresponding to parameter -server. Java 7 is beginning to introduce the concept of hierarchical compilation, which combines the advantages of C1 and C2 to achieve a balance between startup speed and peak performance. Hierarchical compilation divides the JVM’s execution state into five levels. The five levels are:

  1. Explain execution.
  2. Execute C1 code without profiling.
  3. Execute C1 code with only the number of method calls and the number of looping back side executions profiling.
  4. Execute C1 code with all profiling.
  5. Execute C2 code.

Profiling is collecting data that reflects the state of a program’s execution. The most basic statistics are the number of times the method is called and the number of times the loop back edge is executed.

C2 code is typically 30% more efficient than C1 code. The code executed by C1 layer, in order of execution efficiency, is layer 1 > layer 2 > layer 3. Layers 1 and 4 are terminated, and when a method reaches the terminated state, the JVM does not issue a compile request for that method as long as the compiled code is not invalidated. When the service is actually running, the JVM will choose different compilation paths, starting with the explain execution, until it reaches the termination state, depending on how the service is running. Here are some common compilation paths:

  • The first path in the figure, representing the general case of compilation, shows that the hot method is executed from the interpretation to be compiled by layer 3 C1, and finally compiled by layer 4 C2.
  • If the method is small (such as getter/setter methods common in Java services) and tier 3 profiling does not collect valuable data, the JVM determines that the method is equally efficient for C1 and C2 code and executes path 2 in the figure. In this case, the JVM will skip C2 compilation after tier 3 and choose to run with tier 1 C1 compilation.
  • If C1 is busy, follow path ③ in the diagram, profiling the program during the interpretation execution, compiling directly from C2 at layer 4 based on the information.
  • As mentioned above, the execution efficiency of C1 is layer 1 > layer 2 > layer 3, and layer 3 is generally 35% slower than layer 2. Therefore, when C2 is busy, path 4 in the figure is executed. The method is then compiled by C1 at level 2, and then C1 at level 3 to reduce the execution time of the method at level 3.
  • If the compiler makes some radical optimizations, such as branch prediction, and finds that the prediction is wrong during the actual run, it will de-optimize and re-enter the interpretation execution, which is represented by the fifth execution path in the figure.

Overall, C1 compiles faster, C2 compiles better, and the different compilation paths for hierarchical compilation are a process by which the JVM finds the sweet spot for the current service based on how it is running. Starting with JDK 8, layered compilation is enabled by default on the JVM.

3. Just-in-time compilation triggers

The Java virtual machine triggers just-in-time compilation based on how many times the method is called and how many times the loopback edge is executed. Loopback is a concept in a control flow diagram, which can be simply interpreted as an instruction to jump back, as in this code:

Loop back to the edge

public void nlp(Object obj) { int sum = 0; for (int i = 0; i < 200; i++) { sum += i; }}Copy the code

The above code is compiled to generate the following bytecode. Where bytecode with offset 18 jumps back to bytecode with offset 4. When interpreting execution, the Java virtual machine increments the method’s loopback edge counter by one each time the instruction is run.

The bytecode

public void nlp(java.lang.Object);
    Code:
       0: iconst_0
       1: istore_1
       2: iconst_0
       3: istore_2
       4: iload_2
       5: sipush        200
       8: if_icmpge     21
      11: iload_1
      12: iload_2
      13: iadd
      14: istore_1
      15: iinc          2, 1
      18: goto          4
      21: return
Copy the code

During just-in-time compilation, the compiler identifies the head and tail of the loop. In the above bytecode, the head and tail of the body of the loop are bytecode with offset 11 and bytecode with offset 15, respectively. The compiler adds a loopback edge counter at the end of the loop body to count loops.

When the sum of method calls and loops to the edge exceeds the threshold specified by -xx :CompileThreshold (1500 when C1 is used; With C2, the default is 10000), triggering just-in-time compilation.

When hierarchical compilation is enabled, the -xx :CompileThreshold parameter will be invalid and CompileThreshold will be triggered by the following conditions:

  • Method call number is greater than the parameter – XX: TierXInvocationThreshold specified threshold value multiplied by coefficient.
  • Method call number is greater than the parameter – XX: TierXMINInvocationThreshold the specified threshold value multiplied by coefficient, and the number of method calls and the number of cycle back side is greater than the sum of the parameter – XX: TierXCompileThreshold specified threshold value multiplied by coefficient.

Compile the trigger condition formula hierarchically

I > TierXInvocationThreshold * s | | (I > TierXMinInvocationThreshold * s && I + b > TierXCompileThreshold * s) I call the number, B is the number of times we go back to the edgeCopy the code

Just-in-time compilation is triggered when one of these conditions is met, and the JVM dynamically adjusts the coefficient s based on the current number of compilation methods and the number of compilation threads.

Third, compilation optimization

The just-in-time compiler performs a series of optimizations on the running service, including analysis during bytecode parsing, local optimizations based on some intermediate form of code during compilation, global optimizations based on program dependency diagrams, and finally machine code generation.

1. Intermediate Representation

In the principle of compilation, the compiler is usually divided into front end and back end. The front end generates Intermediate Representation (IR) through lexical analysis, syntax analysis and semantic analysis, and the back end optimizes IR to generate object code.

Java bytecode is a kind of IR, but the structure of bytecode is complex, and the IR in the form of bytecode is not suitable for global analysis optimization. Modern compilers generally adopt graph-structured IR, and Static Single Assignment (SSA) IR is a more commonly used one at present. The characteristic of this IR is that each variable can only be assigned once, and can only be used after the variable has been assigned. Here’s an example:

SSA IR

Plain Text
{
  a = 1;
  a = 2;
  b = a;
}
Copy the code

In the code above we can easily see that the assignment of a = 1 is redundant, but the compiler cannot. Traditional compilers use data flow analysis to determine which variables are overwritten from back to front. However, if SSA IR is used, the compiler can easily recognize redundant assignments.

The pseudo-code of SSA IR form of the above code can be expressed as:

SSA IR

Plain Text
{
  a_1 = 1;
  a_2 = 2;
  b_1 = a_2;
}
Copy the code

Since each variable in SSA IR can only be assigned once, a in the code is assigned to a_1 and a_2 in SSA IR, so the compiler can easily scan these variables to see that a_1 is not used and the assignment is redundant.

In addition, SSA IR can be useful for other optimizations, such as the following Dead Code Elimination example:

DeadCodeElimination

public void DeadCodeElimination{
  int a = 2;
  int b = 0
  if(2 > 1){
    a = 1;
  } else{
    b = 2;
  }
  add(a,b)
}
Copy the code

SSA IR pseudocode can be obtained:

DeadCodeElimination

a_1 = 2;
b_1 = 0
if true:
  a_2 = 1;
else
  b_2 = 2;
add(a,b)
Copy the code

The compiler can see by performing the bytecode that the B_2 assignment will not be used and the else branch will not be executed. After dead code removal, you get code:

DeadCodeElimination

public void DeadCodeElimination{
  int a = 1;
  int b = 0;
  add(a,b)
}
Copy the code

We can think of each compiler optimization as a graph optimization algorithm that takes an IR graph and outputs the transformed IR graph. Compiler optimization process is a series of graph node optimization.

Intermediate expression in C1

As mentioned earlier, the C1 compiler internally uses the high-level intermediate representation HIR and the low-level intermediate representation LIR for various optimizations, both of which are SSA forms.

HIR is a control flow diagram structure composed of many Basic blocks, each Block contains many SSA instructions. The structure of the basic block is shown below:

Among them, the predecessors said precursor basic block (because the precursor is likely to be multiple, so is the BlockList structure, is composed of multiple BlockBegin expandable array). Similarly, successors denote a plurality of basic block ends. The other two parts are the body block, which contains instructions for program execution and a next pointer to the body block to be executed next.

The construction from bytecode to HIR finally calls GraphBuilder. GraphBuilder will traverse bytecode construction and store all code basic blocks as a linked list structure, but the basic block at this time only has BlockBegin, excluding specific instructions. In the second step, The GraphBuilder will use a ValueStack as the operand stack and local variation table to simulate the execution of bytecode, construct the corresponding HIR, and fill the previously empty basic block. Here is an example of the process of constructing HIR by simple bytecode block, as shown below:

Bytecode constructs HIR

HIR 5: ILOAD_1 [I1,i2] [i1] 6: iload_2 [i1,i2] [i1,i2] ................................................ I3: I1 * I2 7: IMul 8: istore_3 [I1, I2, I3] [I3]Copy the code

It can be seen that when iload_1 is executed, the operand stack is pressed into variable i1; when iload_2 is executed, the operand stack is pressed into variable i2; when the multiply instruction imul is executed, the top two values of the stack are popped up to construct HIR i3: i1 * i2, and the generated i3 is pushed into the stack.

C1 compiler optimizations are mostly done on top of HIR. When optimization is completed, it will convert HIR into LIR. LIR is similar to HIR and is also a kind of IR used by the compiler internally. HIR can generate LIR by eliminating some intermediate nodes through optimization, which is more simplified in form.

Sea-of-Nodes IR

The Ideal Graph in the C2 compiler is an intermediate representation called Sea-of-Nodes, which is also SSA. Its biggest characteristic is to remove the concept of variable, directly use value to carry on the operation. The Ideal Graph Visualizer (IGV) can be used to display specific IR graphs for easy understanding. Take this code for example:

example

public static int foo(int count) {
  int sum = 0;
  for (int i = 0; i < count; i++) {
    sum += i;
  }
  return sum;
}
Copy the code

The corresponding IR diagram is as follows:

Several sequential nodes in the figure will be contained in the same basic block, as shown in figure B0, B1, etc. Start node 0 in basic block B0 is the method entrance, and Return node 21 in B3 is the method exit. Red and bold lines are control flows, blue lines are data flows, and other colored lines are special control flows or data flows. Fixed nodes are connected to the edge of the control flow, while other nodes are floating nodes (floating nodes refer to nodes that can be placed in different positions as long as the data dependence can be satisfied, and the process of changing floating nodes is called Schedule).

This graph has a lightweight edge structure. Edges in a graph are only represented by Pointers to another node. A Node is an instance of a Node subclass with an array of Pointers that specify input edges. The advantage of this representation is that changing the input edge of a Node is very quick. If you want to change the input edge, you simply point to Node and store it into Node’s pointer array.

Depending on this graph structure, the JVM can Schedule the floating nodes for the best compilation results by gathering information about how the program runs.

Phi And Region Nodes

The Ideal Graph is SSA IR. Since there is no concept of variables, this creates a problem where different execution paths may set different values for the same variable. For example, the following code returns 5 and 6 in two branches of the if statement. At this point, the values read are likely to be different depending on the execution path.

example

int test(int x) {
int a = 0;
  if(x == 1) {
    a = 5;
  } else {
    a = 6;
  }
  return a;
}
Copy the code

To solve this problem, we introduced the concept of Phi Nodes, which can select different values based on different execution paths. Thus, the above code can be represented as the following graph:

Phi Nodes stores all values contained in different paths. Region Nodes obtains the values that should be assigned to variables in the current execution path from Phi Nodes according to the judgment conditions of different paths. The pseudo-code with Phi node in the form of SSA is as follows:

Phi Nodes

int test(int x) {
  a_1 = 0;
  if(x == 1){
    a_2 = 5;
  }else {
    a_3 = 6;
  }
  a_4 = Phi(a_2,a_3);
  return a_4;
}
Copy the code

Global Value Numbering

The Global Value Numbering (GVN) is an optimization technique that is made very easy because of sea-of-Nodes.

GVN refers to assigning a unique number to each computed value and then iterating through the instructions looking for optimization opportunities, which can find and eliminate optimization techniques for equivalent calculations. If multiple multiplications of the same operands occur in a program, the just-in-time compiler can combine these multiplications into one, reducing the size of the output machine code. GVN also saves redundant multiplication operations if these multiplications occur on the same execution path. In sea-of-Nodes, since only the concept of values exists, the GVN algorithm is very simple: the just-in-time compiler just needs to determine whether the floating node has the same number as the existing floating node and whether the input IR node is the same to merge the two floating Nodes into one. Take this code for example:

GVN

a = 1;
b = 2;
c = a + b;
d = a + b;
e = d;
Copy the code

GVN uses the Hash algorithm to number numbers. If a = 1, it will get number 1, if b = 2, it will get number 2, and if C = a + b, it will get number 3. These numbers will be saved in the Hash table. The calculated value is simply fetched from the Hash table. The final e = d can also be retrieved from the Hash table and reused.

GVN can be understood as Common Subexpression Elimination (CSE) on IR graphs. The difference is that GVN compares values directly, while CSE uses a lexical analyzer to determine whether two expressions are the same or not.

2. Method inlining

Method inlining refers to the optimization method that brings the method body of the target method into the compilation scope and replaces the original method call when a method call is encountered during compilation. Most JIT optimizations are carried out on the basis of inlining, method inlining is a very important part of the just-in-time compiler.

A large number of getter/setter methods exist in Java services. If there is no method inlining, when calling getter/setter, the program execution needs to save the execution position of the current method, create and push the stack frame for getter/setter, access the field, eject the stack frame, and finally resume the execution of the current method. With inline method calls to getter/setter, all that is left is field access. In the C2 compiler, method inlining is done during the parsing of the bytecode. When the method call bytecode is encountered, the compiler decides whether the current method call needs to be inlined based on some threshold parameters. If inline is required, the bytecode of the target method is parsed. Take this example (from the web) :

Method inline process

public static boolean flag = true; public static int value0 = 0; public static int value1 = 1; public static int foo(int value) { int result = bar(flag); if (result ! = 0) { return result; } else { return value; } } public static int bar(boolean flag) { return flag ? value0 : value1; }Copy the code

IR diagram of bar method:

Inlined IR diagram:

Inline does more than just copy the IR graph nodes of the called method into the IR graph of the caller method.

The arguments of the called method are replaced by the arguments passed in when the caller method makes the method call. In the above example, replace the 1 P(0) node in the bar method with the 3 LoadField node in the foo method.

In the IR diagram of the caller’s method, the data dependency of the method calling node becomes the return of the called method. If there are multiple return nodes, a Phi node is generated, and these return values are aggregated and used as a replacement object for the original method calling node. In the figure, the 8 == node and the 12 Return node are connected to the edge of the original 5 Invoke node and then point to the newly generated 24 Phi node.

If the called method is going to throw an exception of some type, and the caller method happens to have a handler for that exception type, and that exception handler overrides the method call, the just-in-time compiler needs to connect the path of the called method’s exception handler to the caller method’s exception handler.

Method inline conditions

Most of the compiler’s optimizations are based on method inlining. So in general, the more methods that are inlined, the more efficient the generated code will be. But for just-in-time compilers, the more methods that are inlined, the longer the compile time, and the later the program reaches peak performance.

Can through the virtual machine parameters – XX: inline MaxInlineLevel adjustment layers, and 1 layer of recursive call directly (by virtual machine parameters – XX: MaxRecursiveInlineLevel adjustment). Some common inline related parameters are shown in the following table:

Virtual functions are inlined

Inlining is the primary means by which the JIT improves performance, but virtual functions make inlining difficult because it is not known in the inlining phase which methods will be called. For example, we have a data processing interface. A Method in this interface has three implementations of Add, sub, and Multi. The JVM stores all Virtual functions in a class object by holding a Virtual Method Table (VMT). The instance object of class holds a pointer to the VMT. When the program runs, it first loads the instance object, then finds the VMT through the instance object, and finds the address of the corresponding method through the VMT. Therefore, the performance of the virtual function call is worse than that of the classic call that directly points to the address of the method. Unfortunately, all non-private member function calls in Java are virtual calls.

The C2 compiler is smart enough to detect this and optimize for virtual calls. Take the following code example:

virtual call

public class SimpleInliningTest { public static void main(String[] args) throws InterruptedException { VirtualInvokeTest  obj = new VirtualInvokeTest(); VirtualInvoke1 obj1 = new VirtualInvoke1(); for (int i = 0; i < 100000; i++) { invokeMethod(obj); invokeMethod(obj1); } Thread.sleep(1000); } public static void invokeMethod(VirtualInvokeTest obj) { obj.methodCall(); } private static class VirtualInvokeTest { public void methodCall() { System.out.println("virtual call"); } } private static class VirtualInvoke1 extends VirtualInvokeTest { @Override public void methodCall() { super.methodCall(); }}}Copy the code

After JIT compiler optimization, disassemble to get the following assembly code:

0x0000000113369d37: callq 0x00000001132950a0 ; OopMap{off=476} ; *invokevirtual methodCall // represents virtual calls; - SimpleInliningTest::invokeMethod@1 (line 18) ; {optimized virtual_Call} // Virtual call has been optimizedCopy the code

The JIT has optimized Virtual_Call for the methodCall method. The optimized method can be inline. But the C2 compiler is limited in its ability to “do nothing” about virtual calls to multiple implementation methods.

For example, in the following code, we add an implementation:

A multi-implementation virtual call

public class SimpleInliningTest { public static void main(String[] args) throws InterruptedException { VirtualInvokeTest  obj = new VirtualInvokeTest(); VirtualInvoke1 obj1 = new VirtualInvoke1(); VirtualInvoke2 obj2 = new VirtualInvoke2(); for (int i = 0; i < 100000; i++) { invokeMethod(obj); invokeMethod(obj1); invokeMethod(obj2); } Thread.sleep(1000); } public static void invokeMethod(VirtualInvokeTest obj) { obj.methodCall(); } private static class VirtualInvokeTest { public void methodCall() { System.out.println("virtual call"); } } private static class VirtualInvoke1 extends VirtualInvokeTest { @Override public void methodCall() { super.methodCall(); } } private static class VirtualInvoke2 extends VirtualInvokeTest { @Override public void methodCall() { super.methodCall(); }}}Copy the code

After decompiling, we get the following assembly code:

The code block

0x000000011f5f0a37: callq 0x000000011f4fd2e0 ; OopMap{off=28} ; *invokevirtual methodCall // represents virtual calls; - SimpleInliningTest::invokeMethod@1 (line 20) ; {virtual_call} // The virtual call is not optimizedCopy the code

You can see that multiple implementations of the virtual call are not optimized and are still virtual_call.

The Graal compiler collects information about this part of the execution. For example, if the previous interface method calls add and sub have a 50/50 chance each time, the JVM will inline the add function every time it encounters add and the sub function every time it encounters sub. This improves the efficiency of execution of both paths. Later, if other unusual cases are encountered, the JVM will de-optimize, mark that location, and then switch back to explain execution.

3. Escape analysis

Escape analysis is “a static analysis that determines the dynamic range of a pointer by analyzing where a pointer can be accessed in a program.” The just-in-time compiler of the Java virtual machine performs escape analysis on the newly created object to determine whether it has escaped from a thread or method. The just-in-time compiler determines whether an object has escaped in two ways:

  1. Whether an object is stored in the heap (a static field or an instance field of an object in the heap), once the object is stored in the heap, other threads can get a reference to the object, even if the compiler cannot track all the code locations that use the object.
  2. If an object is passed into unknown code, the just-in-time compiler will treat code that is not inlining as unknown because it cannot be sure that the method call will not store the caller or the parameters passed to the heap. In this case, the caller and parameters of the method call can be considered to have escaped.

Escape analysis is usually performed on an inline basis, and the just-in-time compiler can use the results of escape analysis to perform optimizations such as lock elimination, on-stack allocation, and scalar substitution. The following code is an example of an object not escaping:

pulbic class Example{ public static void main(String[] args) { example(); } public static void example() { Foo foo = new Foo(); Bar bar = new Bar(); bar.setFoo(foo); } } class Foo {} class Bar { private Foo foo; public void setFoo(Foo foo) { this.foo = foo; }}}Copy the code

In this example, two objects, foo and bar, are created, one of which is provided as an argument to the other method. The method setFoo() stores a reference to the received Foo object. If the Bar object is on the heap, the reference to Foo will escape. But in this case, the compiler can use escape analysis to determine that the Bar object itself will not call example() to escape. This means that a reference to Foo cannot escape either. Therefore, the compiler can safely allocate two objects on the stack.

Lock elimination

Lock elimination, which is based on escape analysis, will be learned in Java concurrent programming.

If the just-in-time compiler can prove that the lock object does not escape, then locking and unlocking the lock object does not make sense. Because the thread cannot acquire the lock object. In this case, the just-in-time compiler eliminates the lock and unlock operation on the non-escape lock object. In fact, the compiler only needs to prove that the lock object does not escape the thread to perform lock elimination. Due to the limitations of Java virtual machine just-in-time compilation, the above condition is reinforced to prove that the lock object does not escape the method currently compiled. However, lock elimination based on escape analysis is actually rare.

On the stack

We all know that Java objects are allocated on the heap, and the heap is visible to all objects. At the same time, the JVM needs to manage the allocated heap memory and reclaim the memory occupied by objects when they are no longer referenced. If escape analysis proves that some newly created object does not escape, then the JVM can simply allocate it to the stack and automatically reclaim the allocated memory space by popping the current method’s stack frame when the method in which the new statement is placed exits. In this way, we do not need to use the garbage collector to deal with objects that are no longer referenced. The Hotspot VIRTUAL machine, however, does not actually allocate on the stack, but instead uses scalar replacement technology. Scalars are variables that can store only one value, such as primitive types in Java code. Aggregations, on the other hand, may store multiple values at the same time, a classic example of which is Java objects. The compiler reduces heap allocation by splitting the unescaped aggregate into multiple scalars within the method. Here is an example of a scalar substitution:

Scalar replacement

public class Example{ @AllArgsConstructor class Cat{ int age; int weight; } public static void example(){Cat Cat = new Cat(1,10); addAgeAndWeight(cat.age,Cat.weight); }}Copy the code

After escape analysis, the cat object does not escape the call to example(), so we can decompose the aggregate cat to get two scalars age and weight, and the pseudocode after scalar substitution:

public class Example{ @AllArgsConstructor class Cat{ int age; int weight; } public static void example(){ int age = 1; int weight = 10; addAgeAndWeight(age,weight); }}Copy the code

Partial escape analysis

Partial escape analysis is also Graal’s application of probability prediction. Generally, if an object is found to have escaped from a method or thread, the JVM will not optimize it, but the Graal compiler will still analyze the current program’s execution path, and it will use escape analysis to collect and determine which paths the object will escape from and which will not. Based on this information, lock elimination and stack allocation optimizations are then performed on paths that do not escape.

4. Loop Transformations

As mentioned in the article introducing the C2 compiler, the C2 compiler does a lot of global optimizations after building the Ideal Graph, including loop transformations. The two most important transformations are loop unrolling and loop splitting.

Loop unrolling

Loop unwrapping is a loop conversion technique, which tries to optimize the execution speed of the program at the cost of the binary code size of the program.

Loop unrolling reduces computational overhead by reducing or eliminating instructions that control the program loop, such as adding Pointers to the next index or instruction in the array. If the compiler can calculate these indexes ahead of time and build them into machine-code instructions, the program can run without having to do this calculation. That is, some loops can be written in repetitive, independent code. Take this loop:

Loop unrolling

public void loopRolling(){ for(int i = 0; i<200; i++){ delete(i); }}Copy the code

The code above needs to be looped 200 times, and loop unrolled to get the following code:

Loop unrolling

public void loopRolling(){ for(int i = 0; i<200; i+=5){ delete(i); delete(i+1); delete(i+2); delete(i+3); delete(i+4); }}Copy the code

This unwinding reduces the number of loops, and computations within each loop can take advantage of the CPU’s pipeline to improve efficiency. Of course, this is just an example, and the JVM will evaluate the benefits of the expansion before deciding whether to expand.

Loop separation

Cycle separation is also a means of cycle conversion. It separates one or more particular iterations of the loop and executes them outside the loop. For example, the following code:

Loop separation

int a = 10; for(int i = 0; i<10; i++){ b[i] = x[i] + x[a]; a = i; }Copy the code

And you can see that this code is equal to I minus 1 in every case except the first one where a is equal to 10. So you can isolate the special case into this code:

Loop separation

b[0] = x[0] + 10; for(int i = 1; i<10; i++){ b[i] = x[i] + x[i-1]; }Copy the code

This equivalent transformation reduces overhead by eliminating the need for the A variable in the loop.

5. Peep hole optimization and register allocation

Mentioned above peephole optimization is to optimize the final step, will be the program will be converted into machine code, peephole optimization is the will of the intermediate code generated by the compiler in adjacent instruction (or object code), to replace some of these combinations for efficient instruction set, such as the strength of the common cutting, constant folding, etc., see the example below is an example of strength reduction:

Strength reduction

After intensity reduction, y1=(x1<<1)+x1Copy the code

The compiler uses shifts and addition to cut down on multiplications and uses more efficient instruction sets.

Register allocation is also a compilation optimization method and is commonly used in C2 compilers. It is through the frequently used variables stored in the register, CPU access register speed much faster than memory, can improve the running speed of the program.

Register allocation and peephole optimization are the last steps of program optimization. After register allocation and peep-hole optimization, the program is converted into machine code and stored in codeCache.

Four, the practice

Real-time compilers are complicated, and there is little real-world experience on the web. Here are some of the tweaks our team has made.

1. Compile related heavy * key parameters

  • -xx :+TieredCompilation: enable hierarchical compilation, by default after JDK8
  • -xx :+CICompilerCount=N: indicates the number of compiled threads. After setting the number, the JVM automatically allocates the number of threads. C1:C2 = 1:2
  • The compiler – XX: TierXBackEdgeThreshold: OSR threshold
  • – XX: TierXMinInvocationThreshold: open layered compiled threshold of each layer calls
  • -xx :TierXCompileThreshold: indicates the compilation threshold after hierarchical compilation is enabled
  • -xx :ReservedCodeCacheSize: indicates the maximum codeCache size
  • -xx :InitialCodeCacheSize: indicates the InitialCodeCacheSize

– XX: TierXMinInvocationThreshold is under the condition of stratified compilation, the compiler trigger threshold parameter, when a method call number is greater than the parameter – XX: TierXInvocationThreshold the specified threshold value multiplied by coefficient, Or when a method call number is greater than the parameter – XX: TierXMINInvocationThreshold the specified threshold value multiplied by coefficient, and the number of method calls and the number of cycle back side is greater than the sum of the parameter – XX: TierXCompileThreshold specified threshold value multiplied by coefficient, X layer just-in-time compilation is triggered. When hierarchical compilation is enabled, it is multiplied by a coefficient, which is determined by the number of compiled methods and the number of compiled threads. Lowering the threshold can increase the number of compiled methods. Some commonly used methods that cannot be compiled can be optimized to improve performance.

Due to the complexity of compilation, the JVM adjusts thresholds dynamically to ensure JVM performance, so it is not recommended to manually adjust compile-related parameters. Unless in certain cases, such as when the codeCache is full and compilation stops, you can increase the codeCache size appropriately, or if some very common methods are not inlined and drag performance, you can adjust the number of inlined layers or the size of the inlined methods.

2. Use JITwatch to analyze compilation logs

By adding – XX: + UnlockDiagnosticVMOptions – XX: + PrintCompilation – XX: XX: + PrintInlining – + PrintCodeCache -XX:+PrintCodeCacheOnCompilation -XX:+TraceClassLoading -XX:+LogCompilation The -xx :LogFile=LogPath parameter can output compilation, inline, and codeCache information to a file. However, the printed compilation logs are many and complex and it is difficult to get information directly from them. You can use the JITwatch tool to analyze the compilation logs. Open Log on the JITwatch home page select the Log file and click Start to Start analyzing the Log.

As shown in the figure above, area 1 is the entire project Java Class, including imported third-party dependencies; Timeline shows the JIT compiled Timeline in the form of graphics, Histo shows some information in the histogram, TopList shows the ordering of some objects and data generated during compilation, Cache is the free codeCache space, NMethod is the Native method. Threads are JIT-compiled Threads; Area 3 is the JITwatch’s display of log analysis results, where Suggestions for code optimization are given in the Suggestions, as shown in the figure below:

We can see that when we call the read method of ZipInputStream, the method is not marked as a hot method and is “too large” to be inlined. Use the -xx :CompileCommand inline directive to force methods to be inlined, but use it with caution unless you know that inline methods will improve performance and strain the compiler thread and the codeCache.

Optimizations made by the JVM to code after -allocs and -locks escape analysis in area 3, including on-stack allocation, lock elimination, and so on.

3. Use the Graal compiler

Since the JVM dynamically adjusts the compilation threshold based on the current number of compilation methods and compilation threads, there is not much room to adjust this part of the actual service, and the JVM does more than enough.

To improve performance, the latest Graal compiler was tried in the service. Only need to use – XX: + UnlockExperimentalVMOptions – XX: + UseJVMCICompiler can start Graal instead of C2 compiler compiler, and the response of C2 compile request, but note that the Graal compilers do not compatible with ZGC, It can only be used with the G1.

As mentioned earlier, Graal is a just-in-time compiler written in Java that has been integrated into the JDK since Java 9 as an experimental just-in-time compiler. The Graal compiler is free from GraalVM, a high-performance execution environment that supports multiple programming languages. It can be run on traditional OpenJDK, compiled into an executable file by AOT (Ahead-of-time), or even integrated into a database.

As mentioned several times before, Graal optimizations are based on assumptions. When assumptions are wrong, the Java virtual machine uses a mechanism called Deoptimization to switch from executing just-in-time compiler-generated machine code back to interpreted execution, or even discard the machine code if necessary, and compile it after recollecting the program profile.

These radical measures have resulted in Graal’s peak performance being better than C2’s, and Graal’s performance in languages like Scale and Ruby is even better. Twitter has already used Graal extensively in its service to improve performance, with GraalVM for enterprise offering a 22% performance improvement.

Performance with the Graal compiler

In our online service, after Graal compilation is enabled, TP9999 decreases by 10ms from 60ms -> 50ms, a decrease of 16.7%.

Peak performance is higher during operation. You can see that the Graal compiler provides some performance improvement for this service.

Problems with the Graal compiler

Graal compilers are optimized more aggressively, so more compilation is done at startup, and Graal compilers themselves need to be compiled just in time, so the service can start up with poor performance.

Solutions considered: JDK 9 began to provide the tool JAOTC, and GraalVM’s Native Image can be statically compiled to greatly improve the startup speed of the service. However, GraalVM will use its own garbage collection, which is a very primitive garbage collection based on replication algorithm. Compared with G1, ZGC and other excellent new garbage collectors, its performance is not good. GraalVM also has insufficient support for some Java features, such as configuration-based support. For example, reflection requires to configure all classes that need to be reflected into a JSON file, which would be a lot of work if a lot of reflection services are used. We’re doing some research on that as well.

Five, the summary

This paper mainly introduces the principle of JIT just-in-time compilation and some practical experience in the United States, as well as the use effect of the most advanced just-in-time compiler. JIT is a mature technique for improving performance in interpreted languages and is used in many languages. The JVM itself does more than enough for Java services, but we need to continue to understand the principles of JIT optimization and the latest compilation techniques to compensate for JIT’s weaknesses and improve the performance of Java services and strive for excellence.

Vi. References

  • In-depth Understanding of the Java Virtual Machine
  • Proceedings of the Java™ Virtual Machine Research and Technology Symposium Monterey, California, USA April 23 — 24, 2001
  • Visualization of Program Dependence Graphs by Thomas Wurthinger
  • In-depth Dismantling of Java Virtual Machine by Zheng Yudi
  • The JIT Profile artifact JITWatch

Author’s brief introduction

Heng Zhi, Hao Tian, Xue Chao, all from Meituan AI platform/search and NLP department.

Recruitment information

Meituan search and NLP department is looking for search, dialogue, NLP algorithm engineer in Beijing/Shanghai. If you are interested, please send your resume to [email protected] (please specify the email title: Search and NLP Department).

For more technical articles, please follow the official wechat account of Meituantech.