As we mentioned in the last post, JIT is Luajit’s performance killer. In this article, we will introduce the JIT.

In Luajit, just-in-time compilation is used to translate the Lua bytecode into machine instructions. In other words, the machine instructions generated by just-in-time compilation are executed without any need to explain the execution of Lua bytecode. That is, the explained schema, as well as the JIT schema, has the same input source, Lua Byte Code. There is a significant performance difference (an order of magnitude difference, which is quite common) between the two modes with the same bytecode input, which is quite a bit of work.

JIT can be broken down into the following steps:

  1. Counting, counting what the hot code is
  2. Record, record hot code path, generate SSA IR code
  3. Generate, SSA IR code optimization generates machine instructions
  4. Execute the newly generated machine instructions

JIT compiles the object

Before going any further, let’s introduce a basic concept. Luajit’s JIT is based on Trace, which means a byte code execution stream that can span functions. In contrast, Java’s JIT is method-based, and although functions are inlined, it is more restrictive (only very small functions are compiled by the inline JIT).

In my opinion, tracing JIT can theoretically have more scope than Method Base JIT, and it can be even more powerful if it’s just some case runs. However, the engineering implementation is much more complex, so the actual industrial results are hard to tell (there are many other factors that can affect the effectiveness of JIT, such as optimizer, etc.).

Take this small example:

local debug = false
local function bar()
  return 1
end

local function foo()
  if debug then
    print("some debug log works")
  end
  
  return bar() + 1
end

The foo() function has two obvious advantages when compiled by the JIT:

  1. print("some debug log works")This line is not actually executed, so it will not appear in the trace byte stream. In other words, the generated machine code will not be generated at all, so the generated machine code can be smaller.
  2. bar()Will be compiled inline and will not have the overhead of a function call (yes, at the machine instruction level, the overhead of a function call should be taken into account)

count

Next, let’s go through the phases of JIT one by one. Counting is easy to understand, but one of the main features of the JIT is that it only compiles hot code (AOT if compiled at all).

The usual JIT count has two statistical entries:

  1. A function call that triggers the JIT to compile a function when it has been executed a certain number of times
  2. The number of loops that trigger the JIT to compile a loop body when the number of times a loop body is executed reaches a certain threshold

That is to calculate the heat function and the heat cycle.

Luajit is based on trace, so there will be a trace exit, and then there will be a third trace exit: If a trace frequently exits a snap, start the JIT compilation from that snap (we’ll talk about snap later) and generate a side trace.

The recording

When a function/loop is hot enough, the JIT Compiler is activated. The first step is recording. The core process of recording is: generating IR code while explaining execution.

The specific process is as follows:

  1. Add a hook for bytecode interpretation execution by modifying the Dispatch
  2. In the hook, the corresponding IR code is generated for the byte code currently executed, and there will also be a judgment to complete/preterminate the recording
  3. Continue explaining the execution of the byte code

From the start of recording to the end of recording, this is the basic unit of the trace, interpreting the byte code stream of execution, which is the stream of execution that the trace needs to accelerate.

Since we are recording a real execution stream, trace does not assume that every subsequent execution will definitely go into the current branch. Instead, it will add guards to the IR. The snapshot will be recorded at the appropriate time, which will contain some contextual information. Context information is recovered from the snapshot if it is exited from the snapshot during subsequent execution.

Additional details: Not all byte codes are JIT-capable (see Luajit Nyi for details). Nyi, Luajit and the power of Stitch. For example, FUNCC supports stich, so the code before and after FUNCC will be recorded as two traces. The result is that the JIT executes the machine code for trace1 => and the JIT executes the machine code for trace2 for FUNCC =>. So if you glue these two traces together, that’s what Stitch looks like.

generate

With information such as IR code, machine code can be generated for its optimization.

There are two steps here:

  1. Luajit’s IR Code is also Static Single Assignment Form (SSA), a common intermediate representation code of the optimizer. A lot of optimization algorithms can be applied to optimize, such as common dead code elimination, loop variable extrapolation and so on.
  2. There are two main tasks in generating machine instructions from IR code: register allocation, translating into machine instructions according to IR operation, such as IR ADD translating into machine ADD instruction.

[guard] generates if… The jump logic instruction, after the jump stub instruction will complete the exit from a snapshot.

Here we can see why JIT-generated machine code can be more efficient:

  1. Based on the assumption of execution flow at the time of recording, it is possible to generate instructions that are CPU branch-friendly and, ideally, CPU equivalent to sequential execution of instructions
  2. Optimized for SSA IR code
  3. More efficient use of registers (there are now more registers available without the burden of the interpreter’s own state recording)

perform

After machine instructions are generated, the bytecodes that are modified, such as FUNCF, are changed to JFUNCF traceno. The next time the interpreter executes Jfuncf, it jumps to the corresponding machine instruction of traceno, thus completing the switch from interpreted mode to JIT mode, which is the main way to enter the execution of JIT instructions.

If the guard fails to execute the trace, the guard will exit the trace. If the guard fails to execute the trace, the guard will exit the trace. If the guard fails to execute the trace, the guard will exit the trace

In addition, if you exit from the trace, there will be a count of the number of exits. A SideTrace is generated from a snapshot if the number of exits reaches the HotSide threshold. The next time you exit from the snapshot, it will jump directly to the side trace.

As a result, for branched hot code, the effect is also full JIT coverage, but not at the beginning, but in steps as needed.

The last

As a small embedded language, Lua itself is relatively elegant and light, and the implementation of Luajit inherits these characteristics. In a single-threaded environment, the JIT Compiler consumes the time of the current workflow, and the efficiency of the JIT Compiler itself is also important. It is also unacceptable for the JIT Compiler to block the workflow for long periods of time, and balance is also important here.

In contrast, the Java JIT Compiler is done by a separate JIT compiler thread, allowing for more in-depth optimizations, while the C2 JIT Compiler in Java applies relatively heavy optimizations.

JIT is a great technology, and it is very confusing to understand the basic process/principle of how it works.

I heard that JS V8 engine, and the process of deoptimization, this is still very curious, can learn to learn the time.