In the previous article, “How Lua Code Works,” we looked at how the standard Lua virtual machine runs Lua code.

Today we introduce another virtual machine implementation of the Lua language, Luajit, which uses the Lua 5.1 language standard (or is compatible with Lua 5.2). This means that the same piece of Lua 5.1 compliant code can be run with either the standard Lua virtual machine or Luajit.

Luajit is focused on high performance, so let’s take a look at how Luajit can improve performance.

Explain pattern

First of all, Luajit has two modes of operation, one is interpretive mode, which is similar to the standard Lua virtual machine, but with some improvements.

First, as with the standard Lua virtual machine, the Lua source code is compiled into byte code, which is then interpreted and executed one by one. The compiled bytecode, however, is not the same as standard Lua, just similar. Modelically, Luajit is also based on virtual registers, although the implementation varies.

Explain the execution bytecode

From Lua source code to bytecode, there is not much difference, but interpreting and executing the bytecode is a big improvement in Luajit.

Lua interprets and executes bytecode in the C function luaV_execute, while Luajit executes bytecode by hand. Often, we simply assume that handwritten compilation is more efficient, but it all depends on the quality of the written code.

Compare the resulting machine code

This time, we compare the machine code generated by the two sides to experience how the handwritten assembly is efficient.

Let’s compare the implementation of the “bytecode parsing” section. First of all, the bytecode of Lua and Luajit is 32-bit fixed length. The basic logic of bytecode parsing is as follows: read 32-bit bytecode from PC registers maintained inside the virtual machine, and then parse OP opcodes and corresponding operation parameters.

LuaJIT

The “bytecode parsing” source code in the Luajit source code below is not written naked assembly code, but some macros are used to improve readability.

mov RCd, [PC]
movzx RAd, RCH
movzx OP, RCL
add PC, 4
shr RCd, 16

The final machine instructions generated on x86_64 are as follows, which are very concise.

Mov eax,DWORD PTR [RBX] # Mov eax,DWORD PTR [RBX] # Mov eax,DWORD PTR [RBX] # Mov eax,DWORD PTR [RBX] # Mov eax Is the value of operand A movzx ebp,al # low 8 bits is OP opcode add RBX,0x4 # PC points to the next bytecode SHR eax,0x10 # moves 16 bits to the right, is the value of operand C
Lua

In Lua’s luaV_execute function, there’s roughly the C source code to do part of the “bytecode parsing.”

const Instruction i = *pc++;
ra = RA(i);
GET_OPCODE(i)

After GCC compilation, we can find the following corresponding machine instructions from the executable file. Because GCC optimizes the whole function, the order of instructions is not as straightforward and registers are not used uniformly, so it can look a bit messy. Below are the machine instructions that I have extracted, and the order has been adjusted for the convenience of reading, instead of keeping the original order.

[R14 +0x4] # PC points to the next byte code in the EBX register LEA R12,[R14 +0x4] # PC points to the next byte code in the EBX register R12 MOV R14, R12 # and then copy to R14 MOV EDX, EBX # copy EDX to EAX and EDX, 0x3F # the lower 6 bits are the OP opcode # 7-14 bits are the operand A Mov eax,ebx # Copy ebx to eax SHR eax,0x6 # Move 6 bits to movzx eax,al # Move 8 bits to movzx eax,al # Move 8 bits to movzx eax,al # [r11+rax*1] # r11 is the value of the BASE operation that is corresponding to the address on the Lua stack

Comparison and analysis

Bytecode parsing is the most fundamental operation in Lua. By comparing the resulting machine code, it is clear that the implementation of Luajit can be much more efficient.

Handwritten assembly makes better use of registers, but not entirely because of handwritten assembly. Luajit takes high efficiency into consideration in bytecode design. The OP code is directly 8-bit, which can directly take advantage of the low 8-bit capability provided by CPU hardware such as AL, and can save some bit operation instructions.

JIT

Just-in-time is another mode for Luajit to run Lua code and is one of Luajit’s performance killers. The main principle is to dynamically generate more efficient machine instructions to improve runtime performance.

We will continue with this in the next post…