Go Compilation Principle Series 1 (Overview of Compilation Principles)

Preface

A series of articles on the principles of Go compilation, trying to understand the whole process of Go text file (.go) being compiled by the compiler, that is, the following eleven processes

Follow the public account: IT monkey circle, backstage reply: Go compilation principle series 1, available in pdf version

Image source: "Analysis of the underlying principles of the Go language"

This series of articles will first share the usual stages of compiling a high-level language from the perspective of compilation principles and what each stage is doing; then switch back to the process of compiling Go text files by the Go compiler to see how it compiles. What are their unique places; finally, because the author is also familiar with the PHP language, I will roughly share the parsing and execution process of the PHP code. It happens to be an interpreted language, which can be compared with a compiled language like Go.

long text warning! ! !

In summary, this series of articles will contain the following topics

Overview of compilation principles
Basic knowledge of lexical analysis & grammatical analysis
Go compilation process-lexical analysis
Go compilation process-syntax analysis
Go compilation process-abstract syntax tree construction
Go compilation process-type checking
Go compilation process-variable capture
Go compilation process-function inlining
Go compilation process-escape analysis
Go compilation process-closure rewriting
Go compilation process-traversal function
Go compilation process-SSA generation
Go compilation process-machine code generation
Interpretation and execution of PHP code-lexical & grammatical analysis
Interpretation and execution of PHP code-opcode
Interpretation and execution of PHP code-Zend
Compiled language and interpreted language comparison
Summarize

order to avoid the content being too boring, the relevant places will try to draw

Introduction to the compilation phase of traditional compilers

We know that the code written in a high-level language can be understood by ourselves, but the computer cannot. Therefore, it first needs to be translated into a form that can be executed by a computer. The software system that completes this translation work is collectively called compiler (compiler)

and the principle of compilation, in fact, introduces the method of designing and implementing the compiler. compiler design can also be applied to many fields other than compiler design

The most familiar one is, for example, the template engine used in PHP to realize the separation of interface design and code . The template engine compiles the template to form executable PHP code. If you understand the compilation technology, it will be easier to master these template engines, and even write template engines that are more in line with the needs of the domain

There are also ideas in the principles of compilation, such as database software and big data platforms. Therefore, learning the principles of compilation is not to write a compiler and learn other computer-based things. It is the same reason

Language processor

This part is mainly to share what is the compiler and interpreter? And what other processes might be involved in translating the source program into the code of the target machine? And what did these processes do?

translater

compiler is actually a program. On a macro level, it can read a program written in a certain language (source language) and translate the program into an equivalent program written in another language (target language).
Untitled 1.png

Note: If the target program is an executable machine language program, then it can be called by the user to process input and produce output

Untitled 2.png

Interpreter

interpreter (interpreter) is another common language processor, which does not generate target programs through translation. From the user's point of view, the interpreter directly uses the input provided by the user to perform the specified operation source program. In the process of mapping user input to output, a machine language target program produced by a compiler is usually much faster than an interpreter. However, the error diagnosis effect of the interpreter is usually better than that of the compiler, because it executes the program sentence by sentence

Untitled 3.png

Example

The Java language processor combines the compilation and interpretation process. A Java source program is first compiled into an intermediate representation called bytecode. Then a virtual machine interprets the bytecode obtained and executes . One of the advantages of this arrangement is that the bytecode compiled on one machine can be interpreted and executed on another machine. Migration between machines can be completed through the network

In order to complete the input to output processing faster, some just in time compilers , before running the intermediate program to process the input, first translate the bytecode into machine language, And then execute the program

Untitled 4.png

In addition to the compiler, to create an executable target program, some other programs are also needed. For example, a source program may be divided into multiple modules and stored in different files. The task of aggregating source files is usually done by a program called preprocessor (preprocessor). The preprocessor is also responsible for converting the abbreviated forms called macros into statements in the source language (C, C++)

Then, pass the preprocessed source program as input to a compiler. The compiler may produce an assembly language program as its output, because assembly language is easier to output and debug. Then, this assembly language program is processed by a program called assembler (assembler), and generates relocatable machine code

The starting position of the machine code generated by the assembler in the memory is not fixed. All addresses in the code are relative addresses relative to this starting position. Start address + relative address = absolute address (about what is relocatable machine code can refer to this article )

Large programs are often divided into multiple parts for compilation. Therefore, the relocatable machine code must be linked with other relocatable object files and library files to form the code that actually runs on the machine. The code in one file may point to a location in another file, and the linker (linker) can solve the problem of external memory address (external memory address refers to the code in one file, which may reference another file Relative to the current file, the address or process address of these data objects is the external memory address). Finally, the loader (loader) puts all executable object files into memory for execution

Untitled 5.png

The structure of a compiler

This part is to roughly share what are the steps of the compiler's compilation process? And what you are doing every step of the way. This part may be partial to theory, but I will try to combine examples to facilitate understanding. And will share some scenes of their daily work in some design algorithms or design places

Compiler structure overview

The following example reference: compilation principle (Harbin Institute of Technology)

How does the compiler translate a high-level language program into a machine language program? You can see how we manually translate English into Chinese

In the room, he broke a window with a hammer

This sentence of English can be understood as the source language, and Chinese is the target language. Our translation process is roughly divided into two steps

Untitled 6.png

The semantic process of obtaining sentences by analyzing the source language is semantic analysis . Semantic analysis usually starts by dividing the sentence components. First, it grasps the core predicate verbs of the sentence, because if the meaning of the predicate verb is known, half of the meaning of the sentence will be known. The predicate verb in the above sentence is "broke". Knowing the action of hitting, we will want to know, who performed the action of hitting? Who was the target of the beating? What is it used for? Why fight? What is the result of the fight, etc.

These can all be obtained by analyzing the context of broke. In the above sentence, broke uses the active voice, so its subject he is the actual person of the action, and the object window is the victim of the action. Conversely, if broken is in the passive voice be broken, then its subject he is the subject of the action

With a hammer is the complement, which means the tool used in the action, and in a room is the adverbial, which means the place where the action occurs. In this way, we can analyze the semantic relationship between these nominal components before and after break and the predicate verb break (this is actually the process of our semantic analysis). For example, the picture below

Untitled 7.png

The node in the center of the figure represents the action described in the sentence. The surrounding four nodes correspond to the entities in the sentence: he, window, hammer, and room. From the middle node to the surrounding four nodes, four edges are drawn respectively. The information on the edges represents the one-to-one correspondence between these entities and the core predicate verbs, where he is the agent of the action, and window is the action. The victim is the object, hammer is the tool used by the action, and room is the location where the action takes place

For the meaning of this picture, the Chinese translation is: In the room, he hit a window with a hammer. This completes the translation process. The above picture is a kind of middle representation , is independent of the specific language , that is to say, English can be represented by this graph, Chinese can also be represented by this graph, Japanese, French, and Italian. With this picture, no matter what the target language is, you can use this picture to translate. So the middle representation is very important, it acts as a bridge

According to the above analysis, we can know that if we want to perform semantic analysis, we must first divide the sentence components. We know that the subject and the object are usually composed of noun phrases, and the adverbial and complement are usually composed of prepositional phrases. Therefore, in order to divide the sentence components, you need to identify the various phrases in the sentence. This process is called grammar Analyze . To recognize all kinds of phrases in a sentence, you need to know the part of speech

Untitled 8.png

For example, an article + a noun can form a noun phrase, a pronoun itself, or a noun phrase. Therefore, in order to identify various phrases in a sentence, the key is to determine the part of speech of each word in the sentence. This process is lexical analysis

In summary, we can know that to translate a sentence, lexical analysis is first required, and then grammatical analysis is performed on the basis of lexical analysis, and then semantic analysis is performed. In other words, the specific translation steps are to first perform lexical analysis and analysis. Find out the part of speech of each word in the sentence

Untitled 9.png

Then perform grammatical analysis

Untitled 10.png

Then there is semantic analysis. According to the structure of the sentence, it analyzes what components each phrase plays in the sentence, so as to determine the semantic relationship between each nominal component and the core predicate verb

Untitled 11.png

Finally get the intermediate representation

Untitled 7.png

The compilation process of the compiler has also gone through the above stages

Untitled 12.png

Lexical analysis, syntax analysis, semantic analysis, and intermediate code generation form the front end compiler, which is related to the source language. Code target code generation, machine-related code optimization, composes the compiler backend , which is related to the target language

We can think of the compiler as a black box, which can map source programs to semantically equivalent target programs. In this mapping process, it is divided into two components: compiler front-end and compiler back-end

compiler front end

The front end of the compiler decomposes the source program into multiple constituent elements, and adds syntax structure these elements. Then, uses this structure to create an intermediate representation of the source program, . If the front-end part of the compiler checks that the source program is not constructed according to the correct syntax or is semantically inconsistent, it must provide useful information so that the user can make corrections accordingly. The front-end part of the compiler also collects information about the source program and stores the information in a symbol table (symbol table). symbol table will be sent to the compiler back-end part

compiler backend

The back-end part of the compiler constructs the target program that the user expects based on the intermediate representation and the information in the symbol table

<aside>
💡 Tips: Some compilers have a machine-independent optimization step between the front end and the back end. The purpose of this optimization step is to transform on the intermediate representation so that the back-end program can generate a better target program. Optimization is optional

Tips: The above stages are the logical organization of the compiler. In the process of implementation, multiple stages may be combined. For example, the result of semantic analysis is usually directly expressed in the form of intermediate code, so these two stages are usually implemented together

</aside>

Lexical analysis

The task of lexical analysis is to scan the characters of the source program line by line from left to right to identify each word. determines the type of word (morpheme) . Convert the recognized words into a unified on- representation --- 161c13e6d77e63 lexical unit (token) form

〈token-name, attribute-value〉 <种别码， 属性值>

This lexical unit is passed to the next step, parsing . In this lexical unit

token-name: This represents the type of recognized word. For example, in natural language, every word has a part of speech. The words in the programming language basically have several types in the following table

Serial number	Word type	Species	Species code	Remark
1	Keyword	if、else、for、then....	One word and one code	If the programming language is given, the keywords are determined, so you can assign a category code to each keyword (the category code of the Go language is defined here src/cmd/compile/internal/syntax/ tokens.go)
2	Identifier	Variable name, array name, function name...	One more word	Because identifiers are an open set, it is impossible to enumerate all identifiers in advance, so all identifiers are assigned the same category code (the category code in Go is _Name). In order to distinguish between different identifiers, the second component of the token, the attribute value, is used. It is actually a pointer, which points to a record in the symbol table (the symbol table will be described in detail below)
3	constant	Integer, floating point, character, boolean...	One Type One Code	Constants are the same as identifiers. The implementation cannot enumerate all constants, but the types of constants are limited, so each type of constant is assigned a category code. In order to distinguish different constants of the same type, the attribute value of the token is also used
4	Operator	Arithmetic operators (+-* / ...)

Relational operators (> <= ≠ ≤ ≥)
Logical operator (& | ~) | One word, one code
or
One type one code| Can be determined in advance|
| 5 | Delimiter |; () {} ... | One word and one code| Can be determined in advance|

attribute-value: points to the entry about this lexical unit in the symbol table. Symbol table entry information will be used by semantic analysis and code generation steps

将下边这个语句进行词法分析之后，得到的结果
for(i:=0;i<10-2.5;i=i-1){println(i)}

1      for      < _For, - >
2      (        < _Lparen, - >
3      i        < _Name, addr >
4      :=       < _Define, - >
5      0        < INT, addr>
6      ;        < _Semi, - >
......

Syntax analysis

syntax analyzer recognizes various phrases from the token sequence output by the lexical analyzer, and constructs a syntax analysis tree

Suppose a source file contains the following assignment statement

position = initial + rate * 60   （1.1）

The characters in this assignment sentence can be combined into the following morphemes (word types), and mapped into the following lexical units. These lexical units will be passed to the grammatical analysis stage

position is a morpheme, is mapped to a lexical unit <id,. 1> , where id is identifier (identifier) abstract symbol , and. 1 point position corresponding to the symbol table entries . The symbol table entry corresponding to an identifier stores information related to the identifier, such as its name and type
The assignment symbol = is a morpheme, which is mapped into a lexical unit <=>. Because this lexical unit does not require attribute values, we omit the second component. You can also use abstract symbols like assign as the name of the lexical unit, but for the convenience of marking, we choose to use the morpheme itself as the name of the abstract symbol
initial is a morpheme, which is mapped to the lexical unit <id, 2> , where 2 points to the symbol table entry corresponding to initial
+ Is a morpheme, which is mapped into a lexical unit<+>
rate is a morpheme, which is mapped to the lexical unit <id, 3>, where 3 points to the symbol table entry corresponding to rate
- Is a morpheme, which is mapped into a lexical unit<*>
60 is a morpheme, which is mapped into a lexical unit <60>

<aside>
💡 Tips: Spaces separating morphemes will be ignored by the lexical analyzer

</aside>

After lexical analysis, the assignment sentence (1.1) is expressed as a sequence of lexical units as follows

**<id, 1> <=> <id, 2> <+> <id, 3> <*> <60>**   （1.2）

In this representation, the lexical unit names =, +, and are abstract symbols representing assignment, addition operators, and multiplication operators (for example, in Go language, the abstract symbols of _Assign , _Operator )

As can be seen from the figure, an identifier or a constant itself can form an expression, and an expression plus another expression, or multiplied by another expression, can form a larger expression. An identifier, concatenated with an assignment number, and then concatenated with an expression, can form an assignment statement

compiler use this grammatical structure to help analyze the source program and generate the target program

(You don’t need to look at the picture below)

Analysis tree of variable declaration statement

Grammar (grammar is composed of a series of rules):

<D> →  <T><IDS>;
<T> → int | real | char | bool
<IDS> → id | <IDS>, id

D：是declaration的首字母，声明的意思，表示声明语句
T：是type的首字母，类型的意思，表示类型
IDS：是Identifier Sequence的缩写，表示标识符序列

Therefore, as can be seen from the first rule above, a declaration statement D is composed of a type T connected to an identifier sequence and a semicolon. T here can be int or real or char or bool, so the vertical line in the second rule above means OR. According to the third rule, it can be seen that an identifier id itself can form an identifier sequence; an identifier sequence, concatenated with a comma, and then concatenated with an identifier id, can also form an identifier sequence IDS

According to this grammar, suppose there is such a piece of code

int a, b, c;

According to the above grammar, you can get its analysis tree

It can be seen from a that an identifier itself can form an identifier sequence IDS, an IDS concatenated with a comma, and then concatenated with an identifier, can form a larger IDS

How does the grammatical analyzer construct a parse tree for the input source program according to the grammatical rules? This requires a detailed understanding of the grammar-related rules in the compilation principle. I will not go into the study here. If you are interested, see the fourth chapter of the book compilation principle. The grammar parser uses the top-down recursive descent algorithm when scanning the grammar to achieve efficient grammar scanning without backtracking

<aside>
💡 Unexpected harvest: I happened to have a problem with the binary tree in LeetCode last week, and I encountered the grammar rule + recursive descent algorithm in the compilation principle to solve the problem: 297. Serialization and deserialization of the binary tree

</aside>

Semantic Analysis

Semantic analyzer

semantic analyzer (semantic analyzer) uses the information in the syntax tree and symbol table to check whether the source program is consistent with the semantics defined by the language. It also collects type information and stores this information in the syntax tree or symbol table for use in the subsequent intermediate code generation process

Statements in high-level language programs are roughly divided into two categories: statement statement and executable statement . In the declaration statement, some data objects or procedures will be declared and named for them, that is, the identifier

For the statement statement, the main task of semantic analysis is to collect the attribute information of the identifier. The attributes of an identifier are:

Species : simple variables, compound variables (arrays, maps), functions...
type : integer, character, boolean...
storage location and length : the data objects and procedures declared in the program will allocate a storage space in the memory for it, so there will be a storage location and the size of the required memory space
value
scope
parameters and return value information : This is for the function (number of parameters, parameter types, parameter transfer methods, return value types, etc.)

Suppose there is such a piece of code

var x[8] int
var i, j int
......

In the semantic analysis stage, the collected attribute information of these identifiers will be stored in a data structure called the symbol table. Each identifier corresponds to a record in the symbol table.

The symbol table usually has a string table to store the identifiers and character constants used in the program, so that the Name is divided into two parts. One part stores the starting position of the identifier in the string table, and the second part stores the length of the identifier (for example, the length of SIMPLE is 6 characters, the length of identifier SYMBLE is also 6 characters, and the length of identifier TABLE is 5. character)

<aside>
💡 Question: An interesting question, why is a data structure such as a string table designed in the Instead of storing the Name string directly in the Name field?

</aside>

Semantic check

An important part of semantic analysis is semantic check

Variables or functions are used without being declared
Repeated declaration of variable or function name
The type of the operation component does not match (for example, the name of the array and the name of the function are added, of course, there may also be type conversion)
The type between the operator and the operand does not match (the subscript of the array is not an integer, the parameter type or number of the function call does not match)

The programming language may allow some type conversion, which is called automatic type conversion (coercion). For example, a binary arithmetic operator can be applied to a pair of integers or a pair of floating-point numbers. If this operator is applied to a floating point number and an integer, then the compiler can convert the integer to a floating point number

In the above figure, there is actually automatic type conversion. It is assumed that position, initial, and rate have been declared as floating-point number types, and the morpheme 60 itself is an integer. The type checker of the semantic analyzer found that the operator * was used for a floating-point number rate and an integer 60. In this case, the integer can be converted into a floating point number

intermediate code generation

In the process of translating a source program into object code, a compiler may construct one or more intermediate representations. These intermediate representations can take many forms. syntax tree is an intermediate representation, and they are usually used in syntax analysis and semantic analysis. Another is three-address code for

After the syntax analysis and semantic analysis of the source program are completed, many compilers generate a clear intermediate representation of low-level or machine-like language. We can think of this representation as a program of some abstract machine. The intermediate representation should have two important properties: it should be easy to generate and can be easily translated into the language on the target machine

three address code

This intermediate representation consists of a set of instructions similar to assembly language, each instruction has three arithmetic components (up to three) . Each operation component is like a register. The output of the intermediate code generator in the above figure is the following three-address code sequence

position = initial + rate * 60

t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3                （1.3）

There is at most one operator on the right of each three-address assignment instruction. Therefore, these instructions determine the order in which the operations are completed. In the source program (1.1), the multiplication should be completed before the addition
The compiler should generate a temporary name to store the value calculated by a three-address instruction
Some three-address instructions have less than three arithmetic components (such as the first and last instructions in the sequence (1.3) above)

commonly used three-address instruction (instruction operator in red)

Serial number	Instruction type	Instruction form	Remark
1	Assignment instruction	x = y op z
x = op y	op is a binary operator, y and z are the addresses of the two operational components, and x is the storage address of the result of the operation (the op below is a unary operator)
2	Copy instructions	x = y
3	Conditional jump instruction	if x relop y goto n
4	Unconditional jump	goto n
5	Parameter passing	param x
6	Function call	call p, n	p is the name of the function, n is the number of parameters
7	Function returns	return x	Jump to the instruction corresponding to address x
8	Array reference	x = y[i]	y is the name of the array and represents the base address of the array. i is the offset address of the array element, not the subscript
9	Array assignment	x[i] = y
10	Address and pointer operations	x = &y

x = *y
*x = y | |

In fact, can use the name (that is, the identifier) in the source program as the address in the three-address instruction, because the address of each identifier is stored in the symbol table, so you can find their address by name. Constants and temporary variables generated by the compiler can also be used as the address of the three-address instruction

three-address instruction

Quaternion
Ternary
Indirect ternary

The quaternion is mainly shared here. It looks like this (op, y, z, x). The first component corresponds to the operator in the three-address instruction, and the following three components represent the three operands in the three-address instruction

Four-element representation of three-address instruction

Three address instruction	Quaternion representation	Remark
x = y op z	(op, y, z, x)	The last component of the quaternion represents the target address in the three-address instruction. The last two and three components represent the operand address
x = op y	(op, y, _, x)
x = y	(=, y, _, x)
if x relop y goto n	(relop, x, y, n)
goto n	(goto, _, _, x)
param x	(param, _, _, x)
call p, n	(call, p, n, _)
return x	(return, _, _, x)
x = y[i]	(=[], y, i, x)
x[i] = y	([]=, y, x, i)
x = &y	(=&, y, _, x)
x = *y	(=*, y, _, x)
*x = y	(*=, y, _, x)

In fact, the quaternary representation of the three-address instruction is similar to the intermediate representation of the natural language above. For example, given an action hit , it involves the perpetrator, recipient, tool, and location. In the three-address instruction, the operator is equivalent to the core predicate verb of the sentence, and the operand is equivalent to each semantic role, but there are at most three operands here

You will find that in addition to the assignment instructions, each instruction has only one operator, which means that only one action can be completed. Therefore, a three-address instruction sequence uniquely determines the order of completion of the operation

Intermediate code generation example

while a < b do
    if c < 5 do
        while x > y do
            z = x + 1;
    else x = y;

The analysis tree generated by the above code is as follows:

The above analysis tree is translated into the intermediate code like this

指令编号：指令   

100: (j<, a, b, 102)  //条件跳转指令，j是jump的缩写。意思就是，如果a < b就跳转到102号指令。否则就往下执行101号指令    
101: (j, -, -, 112)   //无条件跳转指令，也就是跳转到112号指令（就跳出了整个while循环语句）
102: (j<, c, 5, 104)  //条件跳转指令，如果c<5，跳转到104号指令，否则往下执行103号指令
103: (j, -, -, 110)   //无条件跳转指令，也就是跳转到110号指令
104: (j>, x, y, 106)  //条件跳转指令，如果x>y，跳转到106号指令，否则往下执行105号指令
105: (j, -, -, 100)   //无条件跳转指令，也就是跳转到100号指令
106: (+, x, 1, t1)    //将x的值，加上1，然后赋给t1。然后往下执行107号指令
107: (=, t1, -, z)    //将t1的值，赋给z
108: (j, -, -, 104)   //无条件跳转指令，也就是跳转到104号指令
109: (j, -, -, 100)   //无条件跳转指令，也就是跳转到100号指令
110: (=, y, -, x)     //赋值指令，将y的值，赋给x。执行完之后，往下执行111号指令
111: (j, -, -, 100)   //执行110号指令
112:

Regarding how the compiler generates the intermediate code based on the analysis tree, this involves more abstract concepts and regular expressions such as the grammar and the context-free grammar My main purpose here is to share what each process is doing without in-depth research and implementation. Those who are interested can take a look at Chapter 6 of "Principles of Compilation" for themselves

code optimization

The machine-independent code optimization step aims to improve the intermediate code in order to generate better target code. "Better" usually means faster, but there may also be other goals, such as shorter or less energy-consuming object codes. For example, a simple and straightforward algorithm will generate intermediate code (1.3). It is that each operator in the tree-shaped intermediate representation obtained by the semantic analyzer uses an instruction

Using a simple intermediate code generation algorithm followed by code optimization step is a reasonable way to generate high-quality target code. The optimizer can conclude that the operation of converting 60 from an integer to a floating point number can be done once and for all at compile time. Therefore, replacing the integer 60 with the floating-point number 60.0 can eliminate the corresponding inttofloat operation. Moreover, t3 is used only once to pass its value to id1. Therefore, the optimizer can convert the sequence (1.3) into a shorter instruction sequence

t1 = id3 * 60.0
id1 = id2 + t1       （1.4）

The amount of code optimization work done by different compilers varies greatly. Those compilers that do the most optimization work, the so-called "optimizing compilers", spend a considerable amount of time in the optimization phase. Some simple optimization methods can greatly improve the operating efficiency of the target program without reducing the compilation speed too much

code generation

code generator takes the intermediate representation of the source program as input and maps it to the target language . If the target language is machine code, then must select a register or memory location for each variable used by the program. Then, the intermediate instructions are translated into a sequence of machine instructions that can accomplish the same task. code generation is the reasonable allocation of registers to store the value of the variable

For example, the intermediate code in (1.4) can be translated into the machine code below (R1, R2 are registers)

LDF      R2,   id3
MULF     R2,   R2,  #60.0
LDF      R1,   id2
ADDF     R1,   R1,  R2
STF      id1,  R1                    （1.5）

The first arithmetic component of each instruction specifies a target address . The F in each instruction tells us that it is dealing with floating-point numbers. The code (1.5) loads the contents of address id3 into register R2, and then multiplies it with the floating-point constant 60.0. The hash sign "#" indicates that 60.0 should be treated as a immediate number . The third instruction moves id2 to register R1, and the fourth instruction adds the value calculated and stored in R2 to R1. Finally, the value in register R1 is stored in the address of id1

ignores the important issue of storage and allocation of identifiers in source programs . In fact, the storage organization method at runtime depends on the language being compiled. The compiler makes storage allocation decisions during the intermediate code generation or code generation stage

Symbol table management

of the important functions of the 161c13e6d78aa1 compiler is to record the names of variables used in the source program, and to collect information about the various attributes of each name . These attributes can provide information such as the storage allocation of a name, its type, and scope (that is, where the value of the name can be used in the program). For the function name, this information also includes: the number and type of its parameters, the method of passing each parameter (such as by value or by reference), and the return type

The symbol table data structure creates a record entry for each variable name. The fields of the record are the attributes of the name. This data structure should allow the compiler to quickly find the record of each name, and quickly store and retrieve the data in the record (fast retrieval and insertion, what kind of data structure do you think of?)

symbol table (symbol table) is a data structure for the compiler to save various information about the source program structure. This information is gradually collected and put into the symbol table in the front-end stage of the compiler, and they are used to generate the target code in the back-end stage of the compiler. symbol table contains information related to an identifier, such as its character string (or morpheme), its type, its storage location and other related information . Symbol tables usually need to support multiple declarations of the same identifier in a program

The scope of a statement refers to the part of the program in which the statement works. It will establish a separate symbol table for each scope to realize the scope. Each program block with declarations (such as the program block in C, which is either a function or a part of the function separated by curly braces) will have its own symbol table, and every declaration in this block is in this symbol There is a corresponding entry in the table. This method is equally effective for other programming language constructs that can establish scope. For example, each class can also have its own symbol table, and each of its fields and methods has a corresponding entry in this table

<aside>
💡 Tips: The symbol table entry is created and used by the lexical analyzer, syntax analyzer and semantic analyzer during the analysis phase

</aside>

Lexical analysis in detail

action lexical analyzer

The main task of the lexical analyzer is to read the input characters of the source program, compose them into morphemes , generate and output a lexical unit sequence , each lexical unit corresponds to a morpheme. This lexical unit sequence is output to the parser for grammatical analysis. The lexical analyzer usually also interacts with the symbol table. When the lexical analyzer finds a morpheme of an identifier, it will add this morpheme to the symbol table. In some cases, the lexical analyzer will read information about the type of identifier from the symbol table to determine which lexical unit to send to the lexical analyzer

You can understand the interaction process of the lexical analyzer and the grammar analyzer through the figure below. Usually, the interaction is realized by the syntax analyzer calling the lexical analyzer. The call indicated by the command getNextToken in the figure causes the lexical analyzer to continuously read characters from its input until it recognizes the next morpheme. The lexical analyzer generates the next lexical unit based on this morpheme and returns it to the grammatical analyzer

The lexical analyzer is responsible for reading the source program in the compiler, and it also completes some other tasks besides recognizing morphemes

Filter out comments and blanks in the source program
Associate the error message generated by the compiler with the location of the source program. For example, the lexical analyzer can be responsible for recording the number of newline characters encountered in order to assign a line number to each error message
If the source program uses a macro preprocessor, the expansion of the macro can also be done by the lexical analyzer

Lexical and grammatical analysis

There are several reasons for dividing the analysis part of the compilation process into lexical analysis and grammatical analysis stages.

most important consideration of 161c13e6d89783 is to simplify the design of the compiler . Separating lexical analysis from grammatical analysis usually allows us to simplify at least one of the tasks. For example, if a parser must treat whitespace and comments as grammatical units, it will be much more complicated than those processors that assume that whitespace and comments have been filtered out by the lexical analyzer. If we are designing a new language, separate lexical and grammar considerations will help us get a clearer language design plan (much like network layering)
improves the efficiency of the compiler . Separating the lexical analyzer allows us to use techniques dedicated to lexical analysis tasks without grammatical analysis. In addition, we can use special buffer technology for reading input characters to significantly increase the speed of the compiler (also one of the reasons for network layering)
enhances the portability of the compiler . The specificity of the input device can be restricted to the lexical analyzer

Lexical units, patterns, morphemes

lexical unit 161c13e6d89892 consists of a lexical unit name and an optional attribute value . The lexical unit name is an abstract symbol that represents a certain lexical unit (explained in the lexical analysis part of 2.2.1, for example, in Go language, the abstract symbol of the assignment symbol = is _Assign ), such as a specific keyword, or A sequence of input characters representing an identifier. The lexical unit name is the input symbol processed by the parser. Usually use the name of the lexical unit to refer to a lexical unit
pattern describes the possible form of morphemes of a lexical unit . When the lexical unit is a keyword, its pattern is the sequence of characters that make up the keyword. For identifiers and other lexical units, the pattern is a more complex structure, which can be matched with many symbol strings (in fact, I think it can be understood as a regular expression pattern string, matching different types of morphemes according to the pattern, such as matching variable names. Morphemes such as morphemes and matching expression symbols, their patterns are different)
A morpheme is a sequence of characters in the source program, which matches the pattern of a certain lexical unit, and is recognized by the lexical analyzer as an instance of the lexical unit (this is well understood, that is, look at the ability of each morpheme in the source program and That pattern match, if it matches the pattern of a certain lexical unit, it is recognized as an instance of that lexical unit)

example

The following figure shows some common lexical units, informally described lexical unit patterns, and some example morphemes. Use the example below to illustrate how the above concepts are applied. Suppose there is such a C statement

printf("Total = %d\n", score);

Picture source: "Principle of Compilation"

Both printf and score are morphemes that match the pattern of lexical unit id, and "Total = %d\n" is a morpheme that matches literal

In many programming languages, the following categories cover most of the lexical units:

each keyword has a lexical unit . The pattern of a keyword is the keyword itself
represents the lexical unit . It can represent a single operator, or it can represent a class of operators like the canparison in the figure above
A lexical unit representing all identifiers
One or more lexical units representing constants, such as numbers and literal strings
Each punctuation mark has a lexical unit , such as left and right brackets, commas and semicolons

lexical unit attributes

If there are multiple morphemes that can match a pattern, the lexical analyzer must provide additional information about the matched morphemes to the subsequent stages of the compiler. For example, both 0 and 1 can match the pattern of the lexical unit number, but for the code generator, it is important to know which morpheme is found in the source program. Therefore, in many cases, the lexical analyzer not only returns a lexical unit name to the grammar analyzer, but also returns an attribute value describing the morpheme of the lexical unit. The name of the lexical unit will affect the decision during the grammatical analysis process, and this attribute will affect the translation of the lexical unit after the grammatical analysis

Suppose that a lexical unit has at most one related attribute value. Of course, this attribute value may be a structured data that combines a variety of information. The most important example is the lexical unit id, we usually associate a lot of information with it. Generally speaking, the information related to an identifier-such as its morpheme, type, and its first occurrence position (this information is needed when issuing an error message about the identifier) is stored in the symbol table . Therefore, is a pointer to the corresponding entry of the identifier in the symbol table

lexical error

Without the help of other components, it is difficult for a lexical analyzer to find errors in the source code. For example, when the lexical analyzer is processing the following C program fragment, there will be problems

fi（a == f(x)）

When it encounters fi for the first time, it cannot indicate whether fi is a miswrite of the keyword if or an undeclared function identifier. Since fi is a valid morpheme of the identifier id, the lexical analyzer must return this id lexical unit to the parser, and let another stage of the compiler (in this example, the parser) to deal with this because the letters are reversed And the error caused

However, assuming that the patterns of all lexical units cannot match a certain prefix of the remaining input, the lexical analyzer cannot continue to process the input at this time. When this happens, the simplest error recovery strategy is "panic mode" recovery. Continue to delete characters from the remaining input until the lexical analyzer can find a correct lexical unit at the beginning of the remaining input. This recovery technique may bring confusion to the parser

Other error recovery actions that may be taken include:

Delete a character from the remaining input
Insert a missing character into the remaining input
Replace one character with another
Swap two adjacent characters

These transformations can be performed while trying to repair incorrect input. The simplest strategy is to see if a certain prefix of the remaining input can be transformed into a legal morpheme through a transformation. This strategy makes sense, because in practice, most lexical errors involve only one character. Another more general correction strategy is to calculate the minimum number of transformations required to convert a source program into a program containing only legal morphemes. But in practice, it is found that this method is too expensive to be used

Reference

"Analysis of Go Language Underlying Principles"-Chapter One
"Compiling Principles"
The Beauty of Compilation Principles-Geek Time
Go Language Design and Implementation-Compilation Principle
Compilation Principle (Harbin Institute of Technology)

Go Compilation Principle Series 1 (Overview of Compilation Principles)

Preface

Introduction to the compilation phase of traditional compilers

Language processor

translater

Interpreter

Example

The structure of a compiler

Compiler structure overview

Lexical analysis

Syntax analysis

Analysis tree of variable declaration statement

Semantic Analysis

Semantic analyzer

Semantic check

intermediate code generation

three address code

Intermediate code generation example

code optimization

code generation

Symbol table management

Lexical analysis in detail

Reference

书旅

引用和评论

Go编译原理系列10（逃逸分析）

一文掌握 MCP 上下文协议：从理论到实践

70k star，取代Postman！这款轻量级API工具，太香了！

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术

Go Compilation Principle Series 1 (Overview of Compilation Principles)

Preface

Introduction to the compilation phase of traditional compilers

Language processor

translater

Interpreter

Example

The structure of a compiler

Compiler structure overview

Lexical analysis

Syntax analysis

Analysis tree of variable declaration statement

Semantic Analysis

Semantic analyzer

Semantic check

intermediate code generation

three address code

Intermediate code generation example

code optimization

code generation

Symbol table management

Lexical analysis in detail

Reference

书旅

引用和评论

Go编译原理系列10（逃逸分析）

一文掌握 MCP 上下文协议：从理论到实践

70k star，取代Postman！这款轻量级API工具，太香了！

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储 ｜ 得物技术

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术