About Author: Longbie
1 Introduction
With the development of IT infrastructure, modern data processing systems need to process more data and support more complex algorithms. The increase in data volume and the complexity of algorithms have brought severe performance challenges to the data analysis system. In recent years, we can see many performance optimization technologies in the fields of databases, big data systems, and AI platforms. The technologies cover the fields of architecture, compilation technology, and high-performance computing. As a representative of compilation optimization technology, this article mainly introduces LLVM-based code generation technology (Codeden for short).
LLVM is a very popular open source compiler framework that supports multiple languages and underlying hardware. Developers can build their own compilation framework based on LLVM and perform secondary development to compile different languages or logic into executable files that run on multiple hardware. For Codegen technology, we mainly focus on the format of LLVM IR and the API for generating LLVM IR. In the following part of this article, we first introduce LLVM IR, then introduce the principles and usage scenarios of Codegen technology, and finally we introduce the typical application scenarios of Codegen in the cloud-native data warehouse product AnalyticDB PostgreSQL developed by Alibaba Cloud.
2. Introduction to LLVM IR and hands-on tutorial
In the compiler theory and practice, IR is a very important part. The full name of IR is Intermediate Representation, which translates to "Intermediate Representation". For a compiler, from the high-level abstract language at the upper level to the assembly language at the bottom level, there are many passes and different manifestations. There are many kinds of compilation optimization techniques, and each technique has different functions in the compilation process. But IR is a clear watershed. Compilation and optimization above IR do not need to care about the details of the underlying hardware, such as hardware instruction set, register file size, etc. Compilation optimization below IR requires dealing with hardware. LLVM is most famous for its IR design. Thanks to the clever IR design, LLVM can support different languages upwards and different hardware downwards, and different languages can reuse the optimized algorithms of the IR layer.
The figure above shows a frame diagram of LLVM. LLVM divides the entire compilation process into three steps: (1) The front end, which converts the high-level language to IR. (2) In the mid-range, optimize at the IR layer. (3) At the back end, the IR is transformed into the assembly language of the corresponding hardware platform. Therefore, the scalability of LLVM is very good. For example, if you want to implement a language called toyc and want to run on the ARM platform, you only need to implement a toyc->LLVM IR front-end, and other parts of the LLVM module can be adjusted. Or if you want to build a new hardware platform, you only need to fix the LLVM IR->new hardware stage, and then the hardware can support many existing languages. Therefore, IR is the most competitive place for LLVM, and it is also the core place for learning to use LLVM Codegen.
2.1 Basic knowledge of LLVM IR
The IR format of LLVM is very similar to assembly. For students who have studied assembly language, it is very easy to learn to use LLVM IR to program. For students who have never learned assembly language, don't worry, assembly is actually not difficult. It is not learning how to compile, but engineering realization. Because the development difficulty of assembly language will increase exponentially with the increase of engineering complexity. Next we need to understand the three most important parts of IR, instruction format, Basic Block & CFG, and SSA. For complete LLVM IR information, please refer to https://llvm.org/docs/LangRef.html .
command format . LLVM IR provides a three-address code instruction format similar to assembly language. The following code snippet is a very simple function implemented with LLVM IR. The input of this function is 5 integers of type i32 (int32). The function of the function is to calculate the sum of these 5 numbers and return. LLVM IR supports some basic data types, such as i8, i32, floating point numbers and so on. The naming of variables in LLVM IR starts with "%". By default, %0 is the first parameter of the function, %1 is the second parameter, and so on. Machine-generated variables are generally named by numbers. If they are handwritten, you can choose a suitable naming method according to your preferences. The instruction format of LLVM IR includes operator, type, input, and return value. For example, the operation symbol of "%6 = add i32 %0, %1" is "add", the type is "i32", the input is "%0" and "%1", and the return value is "%6". In general, IR supports some basic instructions, and then the compiler completes some complex operations through these basic instructions. For example, if we write an expression in the form of "A * B + C" in C, it is completed by a multiplication and an addition instruction in LLVM IR, and it may also include some type conversion instructions.
define i32 @ir_add(i32, i32, i32, i32, i32){
%6 = add i32 %0, %1
%7 = add i32 %6, %2
%8 = add i32 %7, %3
%9 = add i32 %8, %4
ret i32 %9
}
Basic Block & CFG . After understanding the IR instruction format, next we need to understand two concepts: Basic Block (Basic Block, BB) and Control Flow Graph (CFG). The figure below (left) shows a simple C language function, the figure below (middle) is the corresponding LLVM IR compiled with clang, and the figure below (right) is the CFG drawn with graphviz. Combined with this picture, we explain the concepts of Basic Block and CFG.
In the high-level languages that we usually come into contact with, each language will have many branch jump statements. For example, there are keywords such as for, while, if in the C language, and these keywords all represent branch jumps. Developers implement different logical operations by branching and jumping. Assembly language usually implements logic operations through two jump instructions, conditional jump and unconditional jump, and LLVM IR is the same. For example, "br label %7" in LLVM IR means to jump to the label named %7 anyway. This is an unconditional jump instruction. "br i1 %10, label %11, label %22" is a conditional jump, which means that if %10 is true, jump to the label named %11, otherwise jump to the label named %22.
After understanding the concept of jump instructions, we introduce the concept of Basic Block. A Basic Block refers to a serially executed instruction stream. There will be no jump instructions except the last sentence. The first instruction at the entry of the Basic Block is called "Leading instruction". Except for the first Basic Block, each Basic Block will have a name (label). The first Basic Block can also be present, but sometimes it is not necessary. For example, there are 5 Basic Blocks in this code. The concept of Basic Block solves the problem of control logic. Through Basic Block, we can divide the code into different code blocks. In the compilation and optimization, some optimizations are for a single Basic Block, and some are for multiple Basic Blocks.
CFG (Control Flow Graph) is actually a graph composed of the basic block and the jump relationship between the basic block. For example, in the code shown in the figure above, there are a total of 5 Basic Blocks. The arrows list the jump relationships between Basic Blocks, which together form a CFG. If a Basic Block has only one arrow pointing to another Block, then the jump is an unconditional jump, otherwise it is a conditional jump. CFG is a relatively simple and basic concept in compilation theory. CFG is further DFG (Data Flow Graph). Many advanced compilation optimization algorithms are based on DFG. For students who use LLVM for Codegen development, it is enough to understand the concept of CFG.
SSA . The full name of SSA is Static Single Assignment, which is a very basic concept in compilation technology. SSA is a concept that you must be familiar with to learn LLVM IR, and it is also the most difficult concept to understand. When observing the IR codes listed above, careful readers will find that each "variable" will only be assigned once, which is the core idea of SSA. Because from the perspective of the compiler, the compiler does not care about "variables", the compiler is designed with "data" as the center. Each write of each "variable" generates a new data version, and the optimization of the compiler revolves around the data version. Next, we use the following C language code to explain this idea.
The picture above (left) shows a simple C code, and the picture above (right) is the SSA version of this short code, which is the "code in the eyes of the compiler". In the C language, we know that data is stored in variables, so the core of data manipulation is variables. Developers need to care about the survival time of variables, when they are assigned, and when they are used. But the compiler only cares about the flow of data, so every assignment operation will generate a new lvalue. For example, the code on the left has only one a, but the code on the right has 4 variables, because there are 4 versions of the data in a. Except that each assignment operation will generate a new variable, the last phi node will generate a new variable. In SSA, each variable represents a version of the data. In other words, the high-level language takes variables as the core, while the SSA format takes data as the core. Each assignment operation in SSA generates a version of data, so when writing IR, always keep in mind that IR variables are different from high-level languages. An IR variable represents a version of the data. Phi node is an important concept in SSA. In this example, the value of a\_4 depends on which branch was executed before. If the first branch is executed, then a\_4 = a\_1, and so on. The Phi node selects the appropriate data version by judging from which Basic Block this code jumped. LLVM IR naturally also requires developers to write Phi nodes. In loops and conditional branch jumps, many phi nodes need to be handwritten. This is a logically difficult place to deal with when writing LLVM IR.
2.2 Learn to use LLVM IR to write programs
The best way to become familiar with LLVM IR is to use IR to write several programs. Before starting to write, it is recommended to spend 30 minutes to 1 hour and then roughly read the official manual ( https://llvm.org/docs/LangRef.html ) to familiarize yourself with the types of instructions. Next, we are familiar with the entire process of LLVM IR programming through two simple cases.
The following is a function snippet of cyclic addition. This function contains a total of three Basic Blocks, loop, loop\_body and final. Where loop is the beginning of the entire function, loop\_body is the loop body of the function, and final is the end of the function. In lines 5 and 6, we use the phi node to implement the results and loop variables.
define i32 @ir_loopadd_phi(i32*, i32){
br label %loop
loop:
%i = phi i32 [0,%2], [%newi,%loop_body]
%res = phi i32[0,%2], [%new_res, %loop_body]
%break_flag = icmp sge i32 %i, %1
br i1 %break_flag, label %final, label %loop_body
loop_body:
%addr = getelementptr inbounds i32, i32* %0, i32 %i
%val = load i32, i32* %addr, align 4
%new_res = add i32 %res, %val
%newi = add i32 %i, 1
br label %loop
final:
ret i32 %res;
}
The following is a function snippet of array bubble sorting. This function contains two loop bodies. LLVM IR is more complicated to implement loops, and two loops nested will be more complicated. If you can implement a bubbling algorithm with LLVM IR, you will basically understand the entire logic of LLVM.
define void @ir_bubble(i32*, i32) {
%r_flag_addr = alloca i32, align 4
%j = alloca i32, align 4
%r_flag_ini = add i32 %1, -1
store i32 %r_flag_ini, i32* %r_flag_addr, align 4
br label %out_loop_head
out_loop_head:
;check break
store i32 0, i32* %j, align 4
%tmp_r_flag = load i32, i32* %r_flag_addr, align 4
%out_break_flag = icmp sle i32 %tmp_r_flag, 0
br i1 %out_break_flag, label %final, label %in_loop_head
in_loop_head:
;check break
%tmpj_1 = load i32, i32* %j, align 4
%in_break_flag = icmp sge i32 %tmpj_1, %tmp_r_flag
br i1 %in_break_flag, label %out_loop_tail, label %in_loop_body
in_loop_body:
;read & swap
%tmpj_left = load i32, i32* %j, align 4
%tmpj_right = add i32 %tmpj_left, 1
%left_addr = getelementptr inbounds i32, i32* %0, i32 %tmpj_left
%right_addr = getelementptr inbounds i32, i32* %0, i32 %tmpj_right
%left_val = load i32, i32* %left_addr, align 4
%right_val = load i32, i32* %right_addr, align 4
;swap check
%swap_flag = icmp sge i32 %left_val, %right_val
%left_res = select i1 %swap_flag, i32 %right_val, i32 %left_val
%right_res = select i1 %swap_flag, i32 %left_val, i32 %right_val
store i32 %left_res, i32* %left_addr, align 4
store i32 %right_res, i32* %right_addr, align 4
br label %in_loop_end
in_loop_end:
;update j
%tmpj_2 = load i32, i32* %j, align 4
%newj = add i32 %tmpj_2, 1
store i32 %newj, i32* %j, align 4
br label %in_loop_head
out_loop_tail:
;update r_flag
%tmp_r_flag_1 = load i32, i32* %r_flag_addr, align 4
%new_r_flag = sub i32 %tmp_r_flag_1, 1
store i32 %new_r_flag, i32* %r_flag_addr, align 4
br label %out_loop_head
final:
ret void
}
We compile the above LLVM IR into an object file with the clang compiler, and then link it with the program written in C language, and then it can be called normally. In the case mentioned above, we only used basic data types such as i32 and i64. LLVM IR supports advanced data types such as struct, which can realize more complex functions.
2.3 Implement Codegen using LLVM API
The compiler essentially calls various APIs to generate corresponding code based on the input, and LLVM Codegen is no exception. In LLVM, a function is a class, a Basic Block tries a class, and an instruction and a variable are all a class. Using LLVM API to implement codegen is to use the internal data structure of LLVM to implement the corresponding IR according to the needs.
Value *constant = Builder.getInt32(16);
Value *Arg1 = fooFunc->arg_begin();
Value *val = createArith(Builder, Arg1, constant);
Value *val2 = Builder.getInt32(100);
Value *Compare = Builder.CreateICmpULT(val, val2, "cmptmp");
Value *Condition = Builder.CreateICmpNE(Compare, Builder.getInt1(0), "ifcond");
ValList VL;
VL.push_back(Condition);
VL.push_back(Arg1);
BasicBlock *ThenBB = createBB(fooFunc, "then");
BasicBlock *ElseBB = createBB(fooFunc, "else");
BasicBlock *MergeBB = createBB(fooFunc, "ifcont");
BBList List;
List.push_back(ThenBB);
List.push_back(ElseBB);
List.push_back(MergeBB);
Value *v = createIfElse(Builder, List, VL);
The above is an example of implementing codegen using LLVM API. In fact, this is the process of writing IR in C++. If you know how to write IR, you only need to be familiar with this set of APIs. This set of API provides some basic data structures, such as instructions, functions, basic blocks, llvm builder, etc., and then we only need to call the corresponding functions to generate these objects. Generally speaking, we first generate the prototype of the function, including the function name, parameter list, return type, etc. Then, according to the function of the function, we determine which Basic Block and the jump relationship between Basic Block are needed, and then generate the corresponding Basic. Finally, we fill in instructions for each Basic Block in a certain order. The logic is that this process is similar to writing code with LLVM IR.
3. Codegen technical analysis
If we use the method described above to generate some simple functions, and write the corresponding versions in C for performance comparison, we will find that the performance of LLVM IR is not faster than C. On the one hand, the bottom layer of the computer executes assembly. The C language itself is very close to assembly. Programmers who understand the bottom layer can often guess from the C code what kind of assembly will be generated. On the other hand, modern compilers often do a lot of optimizations, some of which greatly reduce the programmer's optimization burden. Therefore, using LLVM IR for Codegen will not get better performance than handwritten C, and there are some obvious disadvantages using LLVM Codegen. To really use LLVM, we also need to be familiar with the characteristics of LLVM.
3.1 Disadvantage analysis
Disadvantage 1: Development is difficult. actual development of 160dd3dc8123b1, almost no project uses assembly as the main development language, because the development is too difficult, and interested friends can try to write a quick row to experience it. Even basic software such as databases and operating systems often use assembly in a few places. There are similar problems with LLVM IR development. For example, the most complex example shown above is the bubbling algorithm. It only takes a few minutes for developers to write a bubble in C, but it may take an hour to write a bubble in LLVM IR. In addition, LLVM IR is difficult to handle complex data structures, such as structures and classes. In addition to those basic data structures in LLVM IR, it is very difficult to add a complex data structure. Therefore, in actual development, the use of Codegen will cause the development difficulty to rise exponentially.
Disadvantage 2: difficult to debug . Developers usually use single-step tracing to debug code, but LLVM IR does not support it. Once the code has a problem, it can only be human flesh watching LLVM IR over and over again. If you know the assembly, you can debug by single-step tracking the generated assembly, but the relationship between assembly language and IR is not a simple mapping, so it can only reduce the difficulty of debugging to a certain extent, not completely solve the problem of debugging.
Disadvantage 3: Operating cost . Generating LLVM IR is often very fast, but the generated IR needs to be optimized by the tools in LLVM and compiled into a binary file. This process takes time (please think about the speed of GCC compilation). During the development of the database, our empirical value is that each function requires about 10ms-100ms of codegen cost. Most of the time is spent on the two steps of optimizing IR and IR to assembly.
3.2 Applicable scenarios
After understanding the shortcomings of LLVM Codegen, we can analyze its advantages and choose suitable scenarios. The following part is the suitable scenarios for using LLVM Codegen summarized by the team during the development process.
Scenario 1: Java/python and other languages . As mentioned above, LLVM IR is not faster than C, but it will be faster than Java/python and other languages. For example, in Java, sometimes in order to improve performance, some C functions are called through JNI to improve performance. In the same way, Java can also call functions generated by LLVM IR to improve performance.
Scenario 2: The hardware and language are not compatible with . LLVM supports a variety of backends, such as X86, ARM, and GPU. For some scenarios where hardware and language are not compatible, LLVM can be used to achieve compatibility. For example, if our system is developed in Java language and wants to call the GPU, we can consider using LLVM IR to generate GPU code, and then call it through the JNI method. This solution not only supports NVIDIA's GPU, but also AMD's GPU, and the correspondingly generated IR can also be executed on the CPU.
Scenario 3: Logic Simplification . Taking the database as an example, the database execution engine needs to make a lot of data type and algorithm logic related judgments during the execution process. This is mainly due to the data types and logic in SQL, many of which cannot be determined during database development and can only be determined at runtime. This part of the process is also called "interpretation and execution". We can use LLVM to generate code at runtime. Since the data type and logic have been determined at this time, we can delete unnecessary judgment operations in LLVM IR to improve performance.
4. Application of LLVM in the database
In the database, the team uses LLVM to process expressions. Next, we compare the PostgreSQL database and the cloud-native data warehouse AnalyticDB PostgreSQL to explain the application method of LLVM.
In order to realize the interpretation and execution of expressions, PostgreSQL adopts a set of "spelling functions" scheme. A large number of C functions are implemented in PostgreSQL, such as addition and subtraction, size comparison, etc., and there are different types. In the stage of generating the execution plan, SQL will select the corresponding function according to the type and data type of the expression symbol, save the pointer, and call it when it is executed. Therefore, for a filter condition such as "a> 10 and b <5", assuming that a and b are both int32, PostgreSQL actually calls a function such as "Int8AndOp(Int32GT(a, 10), Int32LT(b, 5))" Combination is like building blocks. There are two obvious performance problems with such a scheme. On the one hand, this solution will bring more function calls, and the function calls themselves are costly. On the other hand, this solution must implement a unified function interface, and some type conversions are required inside and outside the function, which is also an additional performance overhead. Odyssey uses LLVM for codegen, which can achieve minimal code. Because after SQL is issued, the database knows the symbol of the expression and the type of input data, so it only needs to select the corresponding IR command according to the demand. Therefore, only three IR instructions are needed to realize this expression, and then we encapsulate the expression into a function, which can be called during execution. This operation simplifies multiple function calls into one function call, greatly reducing the total number of instructions.
// 样例SQL
select count(*) from table where a > 10 and b < 5;
// PostgreSQL解释执行方案:多次函数调用
result = Int8AndOp(Int32GT(a, 10), Int32LT(b, 5));
// AnalyticDB PostgreSQL方案:使用LLVM codegen生成最小化底层代码
%res1 = icmp ugt i32 %a, 10;
%res2 = icmp ult i32 %b, 5;
%res = and i8 %res1, %res2;
In the database, expressions mainly appear in several scenarios. One is the filter condition, which usually appears in the where condition. One is the output list, which usually follows select. Some operators, such as join, agg, etc., may also have some more complex expressions in their judgment conditions. Therefore, the processing of expressions will appear in the various modules of the database execution engine. In the AnalyticDB PostgreSQL version, the development team abstracted an expression processing framework and used LLVM Codegen to process these expressions, thereby improving the overall performance of the execution engine.
5. Summary
As a popular open source compilation framework, LLVM has been used to accelerate the performance of databases, AI and other systems in recent years. Due to the high threshold of compiler theory, it is difficult to learn LLVM. And from the engineering point of view, it is also necessary to have a more accurate understanding of the engineering features and performance characteristics of LLVM in order to find a suitable acceleration scenario. Alibaba Cloud database team's cloud-native data warehouse product AnalyticDB PostgreSQL version implements a runtime expression processing framework based on LLVM, which can effectively improve the performance of the system when performing complex data analysis.
【about Us】
The OLAP platform team of the Alibaba Cloud Database Division focuses on providing the world's leading full-stack large-scale OLAP database products, including analytical database AnalyticDB, data lake analysis Data Lake Analytics, etc. The products serve Alibaba public cloud and proprietary cloud Numerous customer key businesses, while serving many data analysis businesses within the Alibaba Group.
【Job Responsibilities】
1. Optimize and develop cloud native technologies such as database access functions, product stability, and product technology.
2. Database SQL engine development, SQL optimizer development, SQL execution core algorithm optimization and Code Generation framework development.
3. Research, analysis and implementation of cutting-edge database technologies, including distributed computing architecture, intelligent diagnosis and analysis, HTAP database architecture, etc.
4. Work place can include Beijing, Hangzhou, Shenzhen.
【job requirements】
1. Familiar with C/C++ or Java programming under the linux platform, familiar with inter-process communication, memory management, and networking.
2. Good at high-performance server-side programming, proficient in distributed computing and mainstream big data system architecture, familiar with distributed consensus protocol algorithms such as raft or paxos.
3. Experience in database kernel development such as PostgreSQL/MySQL/Impala/Presto/Greenplum is preferred, and experience in database SQL engine, storage, transaction and other kernel development is preferred.
4. Those who have published papers in top database conferences such as VLDB, SIGMOD, ICDE are preferred.
5. Good communication and teamwork skills, able to write various technical documents proficiently.
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。