How LuaJIT works-JIT mode

Last article We mentioned that JIT is the performance killer of LuaJIT. In this article, we will introduce JIT.

Just-in-time just-in-time compilation technology, the specific use in LuaJIT is: Just-in-time compilation of Lua byte code into machine instructions, that is, it is no longer necessary to interpret and execute Lua bytecode, and directly execute machine instructions generated by just-in-time compilation.
In other words, the input source of the interpretation mode is the same as that of the JIT mode, which is Lua byte code. With the same bytecode input, the two modes can have obvious performance differences (a difference of an order of magnitude, which is also relatively common), which still requires skill.

JIT can be divided into several steps:

Count and count which hot codes are there
Record, record hot code path, generate SSA IR code
Generate, SSA IR code optimized to generate machine instructions
Execute newly generated machine instructions

JIT compilation object

Before proceeding further, first introduce a basic concept.
LuaJIT's JIT is trace-based, which means a byte code execution flow, and it can span functions.
In comparison, Java's JIT is based on method. Although there are also function inlining, it is more restrictive (only small functions will be compiled by inline JIT).

Personally, I think that tracing JIT can theoretically have more room to play than method base JIT. If it is just some cases, it should be more powerful.
However, the complexity of project implementation is much higher, so the final actual industrial effect is hard to say (there are many other factors that affect the JIT effect, such as optimizers, etc.).

For example, this small example:

local debug = false
local function bar()
  return 1
end

local function foo()
  if debug then
    print("some debug log works")
  end
  
  return bar() + 1
end

When the foo() function is JIT compiled, there are two obvious advantages:

print("some debug log works") not actually executed, it will not be included in the trace byte stream, that is, the machine code will not be generated for it at all, so the generated machine code can be smaller (the smaller the generated machine code, the CPU cache hits Higher rate)
bar() will be compiled inline, and there will be no function call overhead (yes, at the machine instruction level, the function call overhead actually needs to be considered)

count

Next, we introduce the various stages of JIT one by one.
Counting is easier to understand. One of the major characteristics of JIT is that it only compiles hot code (if fully compiled, it becomes AOT).

The usual JIT count has two statistics entries:

Function call, when the number of executions of a function reaches a certain threshold, the JIT is triggered to compile the function
The number of loops, when the number of executions of a loop body reaches a certain threshold, the JIT is triggered to compile the loop body

That is to calculate the thermal function and thermal cycle.

However, LuaJIT is based on trace, so there are cases where trace exits in the middle. At this time, there is a third trace exit statistics:
If a trace often exits from a snap, start JIT compilation from this snap (we will introduce snap later) to generate a side trace.

Record

When a function/loop is hot enough, the JIT compiler starts to work.
The first step is to record. The core process of recording is to generate IR code while explaining and executing.

The specific process is:

By modifying DISPATCH, add a hook for bytecode interpretation execution
In the hook, the corresponding IR code is generated for the currently executed byte code, and there will be judgments to complete/prematurely terminate the recording
Continue to explain and execute byte code

From the beginning of the recording to the completion of the recording, this is the basic unit of the trace. The bytecode stream during the interpretation and execution is the execution stream that the trace needs to accelerate.

Because what is recorded is the real execution flow, for branch code, trace certainly does not assume that every subsequent execution will definitely enter the current branch, but will add a guard to the IR.
And it will record a snapshot at the right time, and the snapshot will contain some contextual information.
If you exit from this snapshot during subsequent execution, the context information will be restored from the snapshot.

Additional details:
can be JIT (see 160d35a935e685 LuaJIT NYI ).
When I met NYI, LuaJIT also had the ability to stitch. For example, FUNCC supports stich, so FUNCC will be recorded as two traces. This will eventually be the result. JIT executes the machine code of trace1 => interprets and executes FUNCC => JIT executes the machine code of trace2. Gluing two traces together is the effect of stitch.

generate

After having information such as IR code, you can optimize and generate machine code for it.

There are two steps here:

Optimization for IR code
LuaJIT's IR code is also static single assignment form (SSA), a common optimizer intermediate representation code. Many optimization algorithms can be applied to optimize, such as common dead code elimination, loop variable mentioning and so on.
Generate machine instructions from IR code
There are two main tasks in this part: register allocation, which is translated into machine instructions according to IR operations, for example, IR ADD is translated into machine ADD instructions.

For the guard in the IR, an if...jump logic instruction will be generated, and the stub instruction after the jump will complete the exit from a snapshot.

Here we can understand why the machine code generated by JIT can be more efficient:

According to the execution flow assumption during recording, CPU branch prediction friendly instructions can be generated. Ideally, the CPU is equivalent to sequential execution of instructions
Optimized for SSA IR code
Use registers more efficiently (there is no interpreter's own state recording burden at this time, and there are more registers that can be used)

carried out

After the machine instruction is generated, the bytecode that will be modified, for example, FUNCF will be changed to JFUNCF traceno .
The next time the interpreter executes JFUNCF , it will jump to traceno , and complete the switch from interpretation mode to JIT mode, which is also the main way to enter JIT instruction execution.

There are two ways to exit trace:
1 Exit after normal execution, and then resume to explain mode to continue execution
2 If the guard in the trace fails, it will exit halfway through the trace. At this time, the context will be restored according to the corresponding snapshot, and then the execution

In addition, in the case of exiting from the trace, the number of exits will also be counted.
If the number of exits of a snapshot reaches the threshold of hotside, a sidetrace will be generated from this snapshot.
The next time you exit from this snapshot, you will directly jump to this side trace.

In this way, for the hot code with branches, it will also have the effect of full JIT coverage, but it is not full coverage at the beginning, but step by step as needed.

At last

As a small embedded language, Lua itself is relatively exquisite and lightweight, and the implementation of LuaJIT also inherits these characteristics.
In a single-threaded environment, the JIT compiler occupies the time of the current workflow, and the efficiency of the JIT compiler itself is also very important.
It is unacceptable for the JIT compiler to block the workflow for a long time, and balance here is also very important.

In comparison, Java's JIT compiler is completed by a separate JIT compilation thread, which can do more in-depth optimization. Java's c2 JIT compiler applies relatively heavy optimization.

JIT is a great technology, and it is also very confusing to understand the basic process/principle of its operation.

I heard that the v8 engine of JS, and the process of deoptimization, this is quite curious, I can learn when I have time.