Introduction
When looking at the babel document, I came across The Super Tiny Compiler 160d06a3e27417. The comments in it feel that the explanation is quite easy to understand, so
Why care about the compiler
In their daily work, most people actually don't have to think about compiler-related things. It's normal not to pay attention to the compiler. However, the compiler is very common around you, and many of the tools you use are based on the concept of the compiler.
The compiler is terrible
The compiler is really terrible. But this is our (those who write compilers) our own mistake. We abandoned simplicity and rationality, and made it so complicated and terrifying that most people think it is completely inaccessible, and only nerds can understand it.
Where should I start?
Start by writing the simplest compiler. This compiler is very small, if you remove all the comments, there are only 200 lines of code.
We are going to write a compiler whose role is to convert some LISP method calls into C language method calls.
If you are not familiar with the language, I will briefly introduce it.
If we have two methods add
and subtract
, they will be written like this:
example | LISP | C |
---|---|---|
2 + 2 | (add 2 2) | add(2, 2) |
4 - 2 | (subtract 4 2) | subtract(4, 2) |
2 + (4 - 2) | (add 2 (subtract 4 2)) | add(2, subtract(4, 2)) |
It's easy, right? Very good, this is exactly what we are going to compile. Although this is not a complete LISP or C syntax, its syntax is sufficient to demonstrate the main parts of most modern compilers.
Three stages of compilation
Most of the compilation can be divided into 3 main stages: parsing (Parsing), conversion (Transformation), code generation (Code Generation).
- Analysis: Get every line of code and turn it into a more abstract code.
- Conversion: Obtain abstract code and operate according to the intention of the compiler.
- Code generation: Obtain the converted code and transform it into a new code.
Parsing
Parsing is usually divided into two stages: lexical analysis (Lexical Analysis) and syntax analysis (Syntactic Analysis).
- Lexical analysis: acquiring code and with tokenizer (the tokenizer) to decompose it into individual mark (tokens). The token is an array with objects describing various parts of the grammar. They may be numbers, text, punctuation marks, operators, etc.
- Grammatical analysis: take those tokens and transform them into another form of expression, which describes their own grammar and interconnection. This is called the intermediate representation or abstract syntax tree (Abstract Syntax Tree or AST). The abstract syntax tree is a deeply nested object, representing a way of code operation, it provides us with a lot of information.
For example, the following syntax:
(add 2 (subtract 4 2))
The token might look like this:
[
{ type: 'paren', value: '(' },
{ type: 'name', value: 'add' },
{ type: 'number', value: '2' },
{ type: 'paren', value: '(' },
{ type: 'name', value: 'subtract' },
{ type: 'number', value: '4' },
{ type: 'number', value: '2' },
{ type: 'paren', value: ')' },
{ type: 'paren', value: ')' },
]
The abstract syntax tree might look like this:
{
type: 'Program',
body: [{
type: 'CallExpression',
name: 'add',
params: [{
type: 'NumberLiteral',
value: '2',
}, {
type: 'CallExpression',
name: 'subtract',
params: [{
type: 'NumberLiteral',
value: '4',
}, {
type: 'NumberLiteral',
value: '2',
}]
}]
}]
}
Transformation
The next stage of compilation is conversion. This step is just to get the AST from the previous step and change it again. You can operate the AST in the same language, or convert the AST into a completely new language.
Let us see how to convert an AST.
You may find that some elements in our AST are very similar. There are some type
, and each such object is called an AST node (AST Node). These nodes define the attributes of each individual part of the tree.
We have a NumberLiteral
node:
{
type: 'NumberLiteral',
value: '2',
}
Or it could be a CallExpression
node:
{
type: 'CallExpression',
name: 'subtract',
params: [...nested nodes go here...],
}
When converting the AST, we can add/remove/replace the attributes of the node, we can add new nodes, remove nodes, or create a completely new AST based on the existing AST.
Because our goal is a new language, we are going to create a completely new AST for the new language.
Traverse (Traversal)
In order to be able to find all the nodes, we need to traverse them. This traversal process has to reach every node of the AST.
{
type: 'Program',
body: [{
type: 'CallExpression',
name: 'add',
params: [{
type: 'NumberLiteral',
value: '2'
}, {
type: 'CallExpression',
name: 'subtract',
params: [{
type: 'NumberLiteral',
value: '4'
}, {
type: 'NumberLiteral',
value: '2'
}]
}]
}]
}
In the above AST, we will traverse like this:
- Program-Start from the top level of AST
- CallExpression (add)-Move to the first element in the body of the Program
- NumberLiteral (2)-Move to the first element of the params of CallExpression(add)
- CallExpression (subtract)-Move to the second element of the params of CallExpression(add)
- NumberLiteral (4)-Move to the first element of the params of CallExpression(subtract)
- NumberLiteral (2)-Move to the second element of the params of CallExpression(subtract)
If you directly manipulate this AST, various abstractions may be introduced here. But for what we are trying to do, it is enough to visit every node of the tree.
Visitors (Visitors)
The basic idea here is to create a "visitor" object with methods that can accept different types of nodes.
var visitor = {
NumberLiteral() {},
CallExpression() {},
};
When we traverse the AST, as long as we "enter" to a matching type node, we will call the visitor method.
To make this idea feasible, we will pass in a reference to a node and its parent node.
var visitor = {
NumberLiteral(node, parent) {},
CallExpression(node, parent) {},
};
Then, there is the possibility of "exit". Imagine our tree structure like this:
- Program
- CallExpression
- NumberLiteral
- CallExpression
- NumberLiteral
- NumberLiteral
As we traverse, we will eventually reach a dead end. So when we finish traversing each branch of the tree, we "exit". Therefore, we traverse the tree down, "enter" to the tree node, and when we return, we "exit".
-> Program (enter)
-> CallExpression (enter)
-> Number Literal (enter)
<- Number Literal (exit)
-> Call Expression (enter)
-> Number Literal (enter)
<- Number Literal (exit)
-> Number Literal (enter)
<- Number Literal (exit)
<- CallExpression (exit)
<- CallExpression (exit)
<- Program (exit)
In order to support this feature, our visitor will look like this in the end:
var visitor = {
NumberLiteral: {
enter(node, parent) {},
exit(node, parent) {},
}
};
Code Generation
The last stage of compilation is code generation. Sometimes the compiler will do things that overlap with the conversion at this stage, but most of the code generation just means getting the AST and converting it into a string code.
There are several different ways of running code generation. Some compilers will reuse the previous tokens, and some will create a separate code representation so that they can print nodes linearly, but from what I have learned, most of them will Using the AST we just created is also what we will focus on.
Our code generator will effectively know how to "print" all the different node types of the AST, and it will call itself recursively to print the nested nodes until it prints everything as a long string of codes.
that's all! These are all the different parts of the compiler. Not every compiler is as I describe here. Compilers are used for different purposes, and they may require more steps than I described. However, now you should have a higher understanding of what most compilers are like.
Now that I have explained so much, you should be able to write your own compiler very well, right? Just kidding, this is what I want to help. So let's get started!
Compiler example
See the-compiler-js .
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。