5

The last article "How does Lua code run" In , we introduced how the standard Lua virtual machine runs Lua code.

Today we introduce another virtual machine of the Lua language to implement LuaJIT, the language standard of Lua 5.1 used by LuaJIT (also compatible with Lua 5.2). This means that the same code that complies with the Lua 5.1 standard can be run with either the standard lua virtual machine or LuaJIT.

LuaJIT focuses on high performance. Next, let's take a look at how LuaJIT improves performance.

Interpretation mode

First of all, LuaJIT has two operating modes, one is interpreted mode, which is similar to the standard Lua virtual machine, but there are also improvements.

First, like the standard Lua virtual machine, Lua source code is compiled into byte code, and then these byte codes are interpreted and executed one by one.
However, the compiled bytecode is not the same as standard Lua, but similar.
In terms of mode, LuaJIT is also based on virtual registers, although the specific implementation methods are different.

Interpret execution bytecode

From the Lua source code to the bytecode, there is actually not much difference, but when the bytecode is interpreted and executed, the improvements made by LuaJIT are relatively large.

Lua interpreted byte code, in luaV_execute achieve this function in C, and Assembler LuaJIT is achieved by handwriting.
Usually, we simply think that hand-written assembly will be more efficient, but it also depends on the quality of the written code.

Compare the final generated machine code

This time we actually compare the machine code finally generated by both parties to experience how the handwritten assembly is efficient.

We compare the implementation of "bytecode analysis".
First of all, the bytecodes of Lua and LuaJIT are both 32-bit fixed-length. The basic logic of bytecode parsing is:
Read the 32-bit byte code from the PC register maintained inside the virtual machine, and then parse out the OP operation code and the corresponding operating parameters.

LuaJIT

Below is the source code of "bytecode analysis" in the LuaJIT source code,
This is not naked assembly code. In order to improve readability, some macros are used.

mov RCd, [PC]
movzx RAd, RCH
movzx OP, RCL
add PC, 4
shr RCd, 16

The final machine instructions generated on x86_64 are as follows, which is very concise.

mov    eax,DWORD PTR [rbx]  # rbx 里存储的是 PC 值,读取 32 位字节码到 eax 寄存器
movzx  ecx,ah               # 9-16 位,是操作数 A 的值
movzx  ebp,al               # 低 8 位是 OP 操作码
add    rbx,0x4              # PC 指向下一个字节码
shr    eax,0x10             # 右移 16 位,是操作数 C 的值
Lua

In Lua's luaV_execute function, there are roughly these C source codes to complete the "bytecode analysis" part of the work.

const Instruction i = *pc++;
ra = RA(i);
GET_OPCODE(i)

After gcc compilation, we can find the following corresponding machine instructions from the executable file.
Because gcc is an overall optimization of the entire function, the order of instructions is not so intuitive, and the use of registers is not so uniform, so it looks a bit messy.
The following are the machine instructions I extracted. For the convenience of reading, the order has also been adjusted and the original order is not maintained.

mov    ebx,DWORD PTR [r14]   # r14 里存储的是 PC 值,读取 32 位字节码到 ebx 寄存器
lea    r12,[r14+0x4]         # PC 指向下一个字节码,存入 r12
mov    r14,r12               # 后续再复制到 r14(因为 r14 中间还有其他用途)

mov    edx,ebx               # 复制 edx 到 eax
and    edx,0x3f              # 低 6 位是 OP 操作码

# 7-14 位是操作数 A 的值
mov    eax,ebx               # 复制 ebx 到 eax
shr    eax,0x6               # 右移 6 位
movzx  eax,al                # 此时的低 8 位是操作数 A 的值

# 此时对应操作数的使用,不属于字节码解析了,但是是 RA(i) 里的实现
shl    rax,0x4               # rax * 16
lea    r9,[r11+rax*1]        # r11 是 BASE 的值,取操作 A 对应 Lua
 栈上的地址

Comparative analysis

Bytecode analysis is the most basic operation in Lua.
By comparing the final generated machine code, we can clearly see that the implementation of LuaJIT can be more efficient.

Hand-written assembly can make better use of registers, but it is not entirely because of the hand-written assembly.
LuaJIT takes into account high efficiency from the bytecode design, OP code is directly 8-bit, so you can directly use the low 8-bit capability provided by CPU hardware such as al

JIT

Just-In-Time is another mode of LuaJIT running Lua code, and it is also the performance killer of LuaJIT.
The main principle is to dynamically generate more efficient machine instructions to improve runtime performance.

We will continue with the next article...


doujiang24
209 声望1k 粉丝

Core developer of OpenResty.