头图

Introduction

LLVM is a set of open source projects that provide compiler infrastructure. It is written in C++ and contains a series of modular compiler components and toolchains for developing compiler front-end and back-end. It is a program written for any programming language that uses virtualization to create compile-time, link-time, execution-time, and "idle-time" optimizations.

The name of LLVM is derived from the acronym of Low Level Virtual Machine, which leads people who do not know it to think that it is a virtual machine similar to JVM (Java Virtual Machine). In fact, the scope of this project is not limited. To create a virtual machine, it includes a series of compilation tools and low-end tool technologies such as LLVM intermediate code (LLVM IR), LLVM debugging tools, LLVM C++ standard library and so on.

The traditional static compiler design is a three-phase design, and its main components are front-end, optimizer, and back-end.

传统的静态编译器设计

The front end is responsible for functions such as lexical analysis, syntax analysis, semantic analysis, and generation of intermediate code.
The optimizer is responsible for making various transformations to try to improve the runtime of the code, such as eliminating redundant computations, and is usually more or less language and target independent.
The backend (also known as the code generator) is responsible for mapping the code to the target instruction set. In addition to writing correct code, it is also responsible for generating good code that takes advantage of the unusual features of the supported architecture. Common parts of a compiler backend include instruction selection, register allocation, and instruction scheduling.

This model applies equally to interpreters and JIT compilers. The JVM is also an implementation of this model, using Java bytecode as the interface between the front end and the optimizer.

While LLVM is designed to support multiple source languages or target architectures, it provides a set of intermediate languages suitable for compiler systems. If the compiler uses this intermediate language representation in its optimizer, it can be used for any code that can be compiled to it. language to write a front-end, and can write a back-end for any target that can be compiled from it.

LLVM 架构设计

Using this design, porting a compiler to support a new source language only requires implementing a new front end, ie, reusing the existing optimizer and back end; similarly, adding support for a new target architecture requires only implementing a new back end. If traditionally designed, front-end and back-end are actually coupled together, implementing a new source language or supporting a new target architecture would require starting from scratch, and supporting N target and M source languages would require N*M compilers.

LLVM IR

LLVM provides a set of intermediate language (Intermediate Representation, IR) suitable for the compiler system. A large number of transformations and optimizations are implemented around it. After transformation and optimization, the intermediate language can be converted into assembly language code related to the target platform.

The intermediate language has nothing to do with the specific language, instruction set, and type system, and each instruction is in the form of static single assignment (SSA), that is, each variable can only be assigned a value once. This helps simplify the analysis of dependencies between variables.

Here is the simple LLVM IR code:

define i32 @add1(i32 %a, i32 %b) {
entry:
  %tmp1 = add i32 %a, %b
  ret i32 %tmp1
}

define i32 @add2(i32 %a, i32 %b) {
entry:
  %tmp1 = icmp eq i32 %a, 0
  br i1 %tmp1, label %done, label %recurse

recurse:
  %tmp2 = sub i32 %a, 1
  %tmp3 = add i32 %b, 1
  %tmp4 = call i32 @add2(i32 %tmp2, i32 %tmp3)
  ret i32 %tmp4

done:
  ret i32 %b
}

The C language code corresponding to the above code is:

unsigned add1(unsigned a, unsigned b) {
  return a+b;
}

unsigned add2(unsigned a, unsigned b) {
  if (a == 0) return b;
  return add2(a-1, b+1);
}

As you can see from this example, LLVM IR is a strongly typed reduced instruction set (RISC). Like a true RISC instruction set, it supports linear sequences of simple instructions such as addition, subtraction, comparison, and branching. These instructions are in three-address form, which means they take a certain number of inputs and produce results in different registers. LLVM IR supports labels, which often look like a strange form of assembly language.

Unlike most RISC instruction sets, LLVM is strongly typed using a simple type system (eg, i32 is a 32-bit integer, i32** is a pointer to a 32-bit integer), and some details of the machine are abstracted away. For example, calling conventions are abstracted through directives and explicit parameters call . Another significant difference between ret and machine code is that LLVM IR does not use a fixed set of named registers, it uses an infinite set of temporary registers named with the % character.

LLVM IR supports three representations: human-readable assembly, object in C++, and serialized bitcode.

compile

LLVM allows code to be compiled statically, under the traditional GCC system, by converting the intermediate representation to machine code (similar to Java) through a just-in-time compilation (JIT) mechanism.

The LLVM type system includes basic types (integer or floating-point numbers) and five composite types (pointers, arrays, vectors, structures, and functions). The class used can be represented as an array of structures, functions, and function pointers.

LLVM provides Clang as the official compiler front end, and supports C, C++, Objective-C, and Objective-C++ languages. Mainly sponsored by Apple, the purpose of Clang is to replace the C/Objective-C compiler under the GCC system. In contemporary systems, it is easier to integrate with an integrated development environment (IDE) and has better threading capabilities. support. Many GCC front-ends already run with it, and LLVM currently supports compilation in languages such as Ada, C, C++, D, Fortran, Haskell, Julia, Objective-C, Rust, and Swift.


张凯强
24 声望2 粉丝

全栈工程师