Go Compilation Principle Series 2 (Basics of Lexical Analysis & Syntax Analysis)

Preface

Follow the public account: IT monkey circle, backstage reply: Go compilation principle series 1, available in pdf version

In before a compiler theory article, and I did not introduce lexical analysis is how to convert the source file one character into lexical units, among which uses techniques or tools. There is also no detailed introduction to some common grammar and grammatical analysis methods in the grammatical analysis stage. So, in this article you can learn:

How does the lexical analyzer translate the characters in our source files into lexical units (undefined finite state machine & sure finite state machine)
What are the common lexical analyzers? How do they work?
Context-free grammar
Some grammatical rules in Go language
Abstract Syntax Tree Generation
Some methods of grammatical analysis (top-down, bottom-up)

💡 Tips: The content below may be more abstract, especially the grammar involved in the basic part of grammatical analysis and the two methods of processing grammar. But I will describe each step clearly in the form of tables or diagrams. I believe you will definitely gain something after reading it.

Basics of Lexical Analysis

Token

In a former compiler theory articles may know, the task of lexical analysis is scanned from left to right, character by character content source, identify the individual words, determine word type . Convert the recognized words into a unified in-machine representation — lexical unit (token) form

For example, identify whether these words are keywords, identifiers, constants, limit symbols, or operators. Because these types can be enumerated in any programming language, they are generally defined. For example, in Go language, _Name used to represent identifiers, and _Operator used to represent operators.

Like _Name , _Operator is what we often say Token, the content in the final compiled source file will be parsed into this Token form by the lexical analyzer

How does the lexical analyzer recognize each word in the source file and judge that the word is a keyword? Or an operator? Or is it a delimiter? This uses determine the finite automaton

Uncertain finite automata & Uncertain finite automata

I will not introduce the abstract concepts of alphabets, sentences, symbols, etc. in the finite automata in detail here, but only to understand how it works. Its realization is not the focus of this article. Those who are interested in finite automata can see. Chapter 3 of "Principles of Compilation"

automata are divided into two categories, namely 161c43a9ece5d2 uncertain finite automata (NFA) and determined finite automaton (DFA). The following is an introduction to how these two finite automata work

Uncertain finite automata (NFA)

In fact, the code we write in a programming language can be regarded as a long string. What the lexical analysis process needs to do is to identify which are keywords, which are operators, which are identifiers, etc. in this long string.

If we are to do such a thing, the easiest thing to think of is to use the regular expression . In fact, the parsing process of the lexical analyzer is to use regularization to split long strings through regularization, and the split strings (may be identifiers, operators, etc.) to match the corresponding tokens

Suppose there is a regular expression (a|b)*abb, and then give you a string to determine whether this string satisfies the regular expression?

If you are familiar with the backtracking algorithm, we know that it can be achieved by a simple backtracking algorithm. But another method is used here, which is Uncertain (NFA)

According to the regular expression given above, the following finite automata can be drawn:

The number in the red circle indicates the state

When the character a or b is encountered in the state of 0, the state transitions to itself
0 This state can also be migrated to state 1 when encountering a
1 If this state encounters b, it will migrate to state 2.
2 If this state encounters b, it will migrate to state 3 (3 is the final state)

The state machine has a total of four states, among which the three states of 0, 1, and 2 are represented by a single-layer circle, and the state 3 is represented by a two-layer circle. Is a state machine representation of the single circle intermediate state , the state machine is represented by a double circle final state . The arrow and the characters above it indicate that each state encounters a different input and migrates to another state

You can use the following strings as the input of the state machine above to see if you can reach the final state. If you can, it means you can match. If you can’t, you can’t match.

可以匹配：abb、aabb、babb、aababb
不可匹配：a、ab、bb、acabb

It can be seen from the above that NFA can solve the purpose of matching the strings in the source file. It seems that we only need to write the corresponding regular expressions for identifiers, keywords, constants, etc., and then separate the source files by automata Each string, and then match the corresponding token

But it has a flaw. For example, if you use abb to match the upper state machine, if the state 0 encounters a, it may still be the state of 0, the state 0 encounters 0 again, or the state 0, and the state b still encounters the state 0, so It can't match again. In fact, abb can satisfy the regular expression (a|b)*abb. The reason is that when you encounter a in the 0 state, the state it transfers is uncertain, so it is called an indeterminate finite automaton

In order to solve the above problem, a definite finite automaton appeared

Determined finite automata (DFA)

Or the regular expression above: (a|b)*abb, draw its finite automaton as shown below

0 If this state encounters a, it will migrate to state 1
0 If this state encounters b, it still migrates to its own state 0
1 If this state encounters a, it still migrates to its own state 1
1 If this state encounters b, it will migrate to state 2.
2 If this state encounters a, it will migrate to state 1
2 If this state encounters b, it will migrate to state 3 (3 is the final state)
3 If this state encounters a, it will migrate to state 1
3 If this state encounters b, it will migrate to state 0

0, 1, 2 are intermediate states, and 3 is the final state. The difference with indeterminate finite automata is that every input encountered in each state of . You can use the string given above to verify this finite automaton

In this way, our problem can be solved through DFA. However, if this is the case, to perform lexical analysis on a source file, we need to write a lot of regular expressions, and we need to manually write the implementation of the finite state machine for each regular expression

In order to solve this problem, a lot of lexical parser tools have appeared, which allow us to avoid manually implementing a finite automaton. The following briefly introduces two common lexical parser tools

Lexical tokenizer

re2c

We can write a file that conforms to the re2c rules, and then generate a .c file through re2c, and then compile and execute the .c file

If you have not installed re2c , you need to install it first ( click to download ). After the download is complete, the installation process

1. 解压：tar -zxvf re2c-1.1.1.tar.gz
2. 进入解压出来的目录：cd re2c-1.1.1
3. ./configure
4. make && make install

The following is to write a re2c source file. Suppose I want to identify whether a number is binary, octal or hexadecimal, and look at how re2c is written (the source file of re2c is a .l file)

#include <stdio.h> //头文件，后边用到的标注输入输出就用到了这个头文件里的方法
enum num_t { ERR, BIN, OCT, DEC, HEX }; //定义了5个枚举值
static num_t lex(const char *YYCURSOR) //返回值类型是num_t。下边函数看着只有一行代码，还有一堆注释，其实这一堆注释就是re2c的核心代码，!re2c是它的开头
{
    const char *YYMARKER;
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;
        end = "\x00";
        bin = '0b'[01]+; //这些都是正则表达式
        oct = "0"[0-7]*;
        dec = [1-9][0-9]*;
        hex = '0x'[0-9a-fA-F]+;
        *       { return ERR; }
        bin end { return BIN; } //如归以匹配的二进制形式开头，并且以匹配的end形式结尾，就返回二进制的枚举值，其余同理
        oct end { return OCT; }
        dec end { return DEC; }
        hex end { return HEX; }
    */
}
int main(int argc, char **argv) //获取参数并遍历，调用lex函数，根据它的返回值来进行switch，看它属于那种类型的数字
{
    for (int i = 1; i < argc; ++i) {
        switch(lex(argv[i])) {
            case ERR: printf("error\n");break;
            case BIN: printf("binary\n");break;
            case OCT: printf("octal\n");break;
            case DEC: printf("decimal\n");break;
            case HEX: printf("hexadecimal\n");break;
        }
    }
    return 0;
}

Note: If you paste the code and it cannot be used normally, try to remove the comment I wrote

Then go to process this .l file

# re2c integer.l -o integer.c
# g++ integer.c -o integer
# ./integer 0b10（此时应该输出binary）

You can also try other hexadecimal numbers, you can get what kind of hexadecimal number it is normally

Now you can open the integer.c file we just produced, its content is as follows:

/* Generated by re2c 1.1.1 on Thu Dec  9 23:09:54 2021 */
#line 1 "integer.l"
#include <stdio.h>
enum num_t { ERR, BIN, OCT, DEC, HEX };
static num_t lex(const char *YYCURSOR)
{
    const char *YYMARKER;
    
#line 10 "integer.c"
{
    char yych;
    yych = *YYCURSOR;
    switch (yych) {
    case '0':    goto yy4;
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
    case '8':
    case '9':    goto yy5;
    default:    goto yy2;
    }
yy2:
    ++YYCURSOR;
yy3:
#line 14 "integer.l"
    { return ERR; }
#line 32 "integer.c"
yy4:
    yych = *(YYMARKER = ++YYCURSOR);
    switch (yych) {
    case 0x00:    goto yy6;
    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':    goto yy8;
    case 'B':
    case 'b':    goto yy11;
    case 'X':
    case 'x':    goto yy12;
    default:    goto yy3;
    }
yy5:
    yych = *(YYMARKER = ++YYCURSOR);
    switch (yych) {
    case 0x00:    goto yy13;
    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
    case '8':
    case '9':    goto yy15;
    default:    goto yy3;
    }
yy6:
    ++YYCURSOR;
#line 16 "integer.l"
    { return OCT; }
#line 71 "integer.c"
yy8:
    yych = *++YYCURSOR;
    switch (yych) {
    case 0x00:    goto yy6;
    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':    goto yy8;
    default:    goto yy10;
    }
yy10:
    YYCURSOR = YYMARKER;
    goto yy3;
yy11:
    yych = *++YYCURSOR;
    if (yych <= 0x00) goto yy10;
    goto yy18;
yy12:
    yych = *++YYCURSOR;
    if (yych <= 0x00) goto yy10;
    goto yy20;
yy13:
    ++YYCURSOR;
#line 17 "integer.l"
    { return DEC; }
#line 101 "integer.c"
yy15:
    yych = *++YYCURSOR;
    switch (yych) {
    case 0x00:    goto yy13;
    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
    case '8':
    case '9':    goto yy15;
    default:    goto yy10;
    }
yy17:
    yych = *++YYCURSOR;
yy18:
    switch (yych) {
    case 0x00:    goto yy21;
    case '0':
    case '1':    goto yy17;
    default:    goto yy10;
    }
yy19:
    yych = *++YYCURSOR;
yy20:
    switch (yych) {
    case 0x00:    goto yy23;
    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
    case '8':
    case '9':
    case 'A':
    case 'B':
    case 'C':
    case 'D':
    case 'E':
    case 'F':
    case 'a':
    case 'b':
    case 'c':
    case 'd':
    case 'e':
    case 'f':    goto yy19;
    default:    goto yy10;
    }
yy21:
    ++YYCURSOR;
#line 15 "integer.l"
    { return BIN; }
#line 160 "integer.c"
yy23:
    ++YYCURSOR;
#line 18 "integer.l"
    { return HEX; }
#line 165 "integer.c"
}
#line 19 "integer.l"

}
int main(int argc, char **argv)
{
    for (int i = 1; i < argc; ++i) {
        switch(lex(argv[i])) {
            case ERR: printf("error\n");break;
            case BIN: printf("binary\n");break;
            case OCT: printf("octal\n");break;
            case DEC: printf("decimal\n");break;
            case HEX: printf("hexadecimal\n");break;
        }
    }
    return 0;
}

This code actually implements the definite finite automata mentioned above. The code is very simple, you can take 0b10 in the example as the input of this code, and see if you can get it through this as binary

Therefore, we only need to provide some regular expressions, and re2c can help us implement a deterministic finite automata (DFA). In this way, you can easily do some lexical analysis, we can provide regular expressions

lex

The lexical analyzer generation tool lex is introduced in the third chapter of the book "Principles of Compilation". I will only briefly introduce it here. For more detailed content, please go to the corresponding chapter.

The principle of lex and re2c is actually the same. Write its source code (also ending with .l) according to the rules of lex, and then generate a .c file. The generated .c file is actually a regular written in the .l file C language implementation of matching and transforming into finite state machine

Here I refer to a piece of lex code in the blog related to compilation principles in "Faith-Oriented Programming" by Wang Jing. This code can perform lexical analysis on simple go source files.

%{
#include <stdio.h>
%}

%%
package      printf("PACKAGE ");
import       printf("IMPORT ");
\.           printf("DOT ");
\{           printf("LBRACE ");
\}           printf("RBRACE ");
\(           printf("LPAREN ");
\)           printf("RPAREN ");
\"           printf("QUOTE ");
\n           printf("\n");
[0-9]+       printf("NUMBER ");
[a-zA-Z_]+   printf("IDENT ");
%%

Here is actually a regular definition of some keywords to match the keywords, numbers, identifiers, etc. in the go source file. Also use the lex command to compile the above .l file into a .c file. The .c file actually implements a finite state machine (according to the regularity provided in the .l file), which can match some of the symbols defined in the .l

Suppose there is such a piece of go code

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Learn Compile")
}

It will become like this after processing by the lexical analyzer

PACKAGE  IDENT

IMPORT  LPAREN
    QUOTE IDENT QUOTE
RPAREN

IDENT  IDENT LPAREN RPAREN  LBRACE
    IDENT DOT IDENT LPAREN QUOTE IDENT IDENT QUOTE RPAREN
RBRACE

With the above basic things, it will be much easier to look at the lexical analysis source code of Go later. Because the lexical analysis and grammatical analysis of Go language are actually connected together, so here is also to sort out the grammatical analysis by the way, we need to understand what basic things can help us to see the source code of the grammatical analysis part of Go

Basics of grammatical analysis

grammar

When designing a language, each programming language has a set of precise rules to describe the grammatical structure of the program. For example, in the C language, a program is composed of multiple functions, a function is generated by declarations and statements, a statement is composed of expressions, and so on. The grammar of programming language construction can be described using the context-free grammar or BNF (Backus paradigm) notation

The grammar gives a precise and easy-to-understand language convention for a programming language
For certain types of grammar, we can automatically construct an efficient parser, which can determine the grammatical structure of a source program
3. A properly designed grammar gives the structure of a language. This structure helps to translate the source program into the correct target generation and also helps to detect errors

Here again simply describe the role and the role played by the parser during compilation (in on a described in detail in the article), to facilitate understanding behind

The grammar analyzer obtains a string composed of lexical units from the lexical analyzer, and verifies that this string can be generated from the grammar of the source language. For a well-formed program, the syntax analyzer constructs a syntax analysis tree and passes it to other parts of the compiler for further processing. In fact, there is no need to explicitly construct a grammatical analysis tree, because the inspection and translation of the source program can be done alternately with the grammatical analysis process. Therefore, the parser and other parts of the front end can be implemented with one module

The parser for processing grammar can be roughly divided into three types: general , top-down and bottom-up . General grammatical analysis methods like Cocke-Younger-Kasami algorithm and Earley algorithm can perform grammatical analysis on any grammar. But these general methods are very inefficient and cannot be used in compiler products

Below is a detailed introduction to the context-free grammar and two types of processing grammar

Context-free grammar

The description below may be very abstract, it doesn’t matter, just look at the example below to understand (we have to understand, the more general things are often the more abstract)

A context-free grammar (grammar for short) is composed of terminal symbols , non-terminal symbols , a start symbol and a set of productions

terminal symbol is the basic symbol that composes the string. Usually what we call lexical unit, you can understand it as terminal symbols, such as if, for, ), (, etc., are terminal symbols
non-terminal symbol is a syntax variable representing a set of strings. The set of strings represented by non-terminal signs is used to define the language generated by the grammar. Non-terminal symbols give the hierarchical structure of the language, and this hierarchical structure is the key to grammatical analysis and translation
In a grammar, is designated as the start symbol . The set of strings represented by this symbol is the language generated by this grammar. By convention, first list the productions of the start symbol
A grammar's production describes the combines terminal symbols and non-terminal symbols into a string . Each production is composed of the following elements:
a. A production head or non-terminal symbol . This production defines part of the set of strings represented by this head
b. to the right in the direction of
c. A composed of zero or more terminal symbols and non-terminal symbols. The components in the production body describe a certain method of constructing the string corresponding to the non-terminal symbol on the head of the production

For a relatively simple example, suppose the following grammar is defined (the meaning of | is or below)

S -> ABC
A -> c|B
B -> a|r
C -> n|y

In the grammar defined above, A, B, and C are non-terminal symbols, and c, a, r, n, and y are terminal symbols. Each line above is a production, and A is also the start symbol. The above set of productions plus non-terminal symbols, terminal symbols, and start symbols form a context-free grammar. The above grammar can match can, arn, aan, cry, ray, etc.

Because understanding the above is to see the implementation of Go parsing later, I got its grammar from Go’s parser here. If you understand the grammar-related content mentioned above, you can easily understand Go’s grammar. The grammar used for parsing (Go's syntax analysis source code is in: src/cmd/compile/internal/syntax/parser.go)

SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" } .
PackageClause  = "package" PackageName .
PackageName    = identifier .

ImportDecl       = "import" ( ImportSpec | "(" { ImportSpec ";" } ")" ) .
ImportSpec       = [ "." | PackageName ] ImportPath .
ImportPath       = string_lit .

TopLevelDecl  = Declaration | FunctionDecl | MethodDecl .
Declaration   = ConstDecl | TypeDecl | VarDecl .

ConstDecl = "const" ( ConstSpec | "(" { ConstSpec ";" } ")" ) .
ConstSpec = IdentifierList [ [ Type ] "=" ExpressionList ] .

TypeDecl  = "type" ( TypeSpec | "(" { TypeSpec ";" } ")" ) .
TypeSpec  = AliasDecl | TypeDef .
AliasDecl = identifier "=" Type .
TypeDef   = identifier Type .

VarDecl = "var" ( VarSpec | "(" { VarSpec ";" } ")" ) .
VarSpec = IdentifierList ( Type [ "=" ExpressionList ] | "=" ExpressionList ) .

Because every Go source file will eventually be parsed into an abstract syntax tree, the topmost structure or start symbol of the syntax tree is SourceFile. The meaning of the SourceFile production is also very simple

First, there is a package (package)
Then there is the optional import declaration (import). Braces are optional. For more grammar of Go, you can go to here to , very detailed, and each type in Go has a detailed introduction
Finally, there is an optional top-level statement (TopLevelDecl)

The above does not list all the productions of Go. For example, the function and method are not listed here, it will be more complicated. Here take the production of PackageClause as an example to explain its meaning

PackageClause  = "package" PackageName . 
PackageName    = identifier .

它的意思就是，满足一个包声明的，它正确的语法结构应该是以package开头，然后package后边跟一个标识符（identifier）

Knowing that the grammatical analysis of a computer language parses the grammar in the source file through a defined grammar, how does the grammatical parser implement grammatical analysis based on the grammar? This requires some algorithms, that is, the three mentioned in the grammar section above, general , top-down and bottom-up

Abstract syntax tree

In fact, in the grammatical analysis stage, not only need to judge whether a given string meets the prescribed grammar, but also need to analyze these strings

Which structure conforms to, that is to say, analyze how this string starts from the start character, is matched by the production, and the root

According to this generated process, a syntax tree is generated

Suppose there is such a grammar

语句 -> 主 谓 宾
主语 -> 你
主语 -> 我
主语 -> 他
谓语 -> 敲
谓语 -> 写
宾语 -> 代码
宾语 -> bug

For example, the sentence "I write bug", when we know which word the subject, predicate and object are, can we understand the meaning of this sentence. Below is its syntax tree

Another example is the following grammar that can match expressions

Expr -> Expr op Expr | (Expr) | number
op   -> + - * /

Suppose there is such an expression 6 + (60-6 ). According to the above grammar, we can get the following overturning process

Expr -> Expr + Expr
Expr -> number + Expr
Expr -> number + ( Expr - Expr )
Expr -> number + ( number - number )
Expr -> 6 + ( 60 - 6 )

Expressed by a parse tree is:

Remove some redundant nodes inside, and further condense, you can get the following abstract syntax tree. For this kind of tree structure, it can be analyzed in a recursive manner to generate the target code

Look at a different one, suppose there is such an expression: 6+20*3, or analyze it through the above grammar, you will find that it can be derived in two different ways

This is actually the ambiguity , which means that for a string that conforms to the grammatical structure, it can be derived in two different ways. On the contrary, if there is a grammar, the derivation method of any string is unique, then this grammar is not ambiguous. Obviously, the grammar used by our programming language must be unambiguous

To eliminate the ambiguity, we can rewrite the grammar , and rewrite the ambiguous productions into unambiguous ones. However, this kind of rewriting is not only very difficult, but often the rewritten production is very different from the original production, and the readability is also relatively poor.

Another way is to use priority to achieve, without rewriting the production, when there is an ambiguity in the derivation process, use the priority of the symbol to select the required derivation method, which is basically used in programming languages. In this way, the syntax analysis part of Go that we will understand later is also used in this way. I will not introduce it in more detail here. If you are interested, you can take a look at the part of Go syntax analysis (location: src/cmd/compile/ internal/syntax/parser.go → func fileOrNil)

For a given grammar, deriving an arbitrary string from it is relatively simple and straightforward, but you need to derive the specified string or Given a string, you must analyze how it is derived from the starting symbol The obtained not that simple. This is the task of grammatical analysis. The analysis can use top-down analysis or bottom-up analysis two methods, because these two analysis methods involve a lot of content, I will first briefly mention some for us to study the Go syntax analysis source code. The part of the help. For more detailed content about top-down analysis and bottom-up analysis, you can see Chapter 3 of "Principles of Compilation"

Grammatical analysis method

General grammatical analysis method

Cocke-Younger-Kasami algorithm and Earley algorithm such general syntax analysis methods can perform syntax analysis on any grammar. But these general methods are very inefficient and cannot be used in compiler products

These two general grammatical analysis methods are not described in detail here. If you are interested, you can click to understand by yourself

Top-down grammatical analysis

top-down analysis of 161c43a9ecf3c0 starts from the starting symbol, and continuously selects suitable productions, expands the non-terminal symbols in an intermediate string, and finally expands to the given string

Take the following grammar as an example to analyze the string code

S –> AB
A –> oA | cA | ε
B –> dB | e

说明：ε表示空字符串

At the beginning, the starting symbol has only one production: S -> AB, so you can only choose this one production, replace S with the right part of this production, and you get an intermediate string AB

Middle string	Production
S	S → AB
AB

For this intermediate string, expand from its initial symbol A, and expand A to get three productions: A → oA; A → cA; A → ε. We can compare the string code we want to analyze with this intermediate string AB, and find that we can only choose the production A → cA, otherwise the code cannot be derived from this intermediate string. So here I choose to replace the A in the middle sentence with cA, and get the middle sentence cAB

Middle string	Production
S	S → AB
AB	A → cA
cAB

Then continue to try to expand the A in the CAB. It is found that only the production A → oA can be selected

Middle string	Production
S	S → AB
AB	A → cA
cAB	A → oA
coAB

Continue to expand A and find that A can only choose the production A → ε (otherwise the code cannot be derived)

Middle string	Production
S	S → AB
AB	A → cA
cAB	A → oA
coAB	A → ε
coB

Then is to expand the non-terminal symbol B, B expands to get two productions: B → dB; B → e. According to the same method above, you can get:

Middle string	Production
S	S → AB
AB	A → cA
cAB	A → oA
coAB	A → ε
coB	B → d
codB	B → e
code	Finish

The above is the top-down grammatical analysis process. If you look back at the grammar in the Go language mentioned in the context-free grammar section, do you understand the grammatical analysis process of Go? If you look at Go's syntax analysis source code (location: src/cmd/compile/internal/syntax/parser.go → func fileOrNil), you will find that Go uses this top-down syntax analysis method

Bottom-up grammatical analysis

bottom-up analysis method of 161c43a9ecf7f7 is to start with a given string, then select a suitable production, and fold the substrings in the middle string into non-terminal symbols, and finally fold to the starting symbol

Assuming the following grammar, the string we want to analyze is: aaab

S –> AB
A –> Aa | ε
B –> b | bB

First, start with the first character a on the left, compare all the productions in the grammar, and find that there is no production on the right that matches exactly. But after observation and thinking, I found that: you can try to insert an empty sentence ε at the very beginning of aaab, then you can apply A -> ε, and then apply A -> Aa, .... So insert an empty sentence first, and get the middle sentence εaaab

Middle string	Production
aaab	Insert ε
εaaab

At this point, the leftmost ε of the middle sentence can match the production A -> ε. Use this production to fold ε into A to get Aaaab

Middle string	Production
aaab	Insert ε
εaaab	A → ε
Aaaab

From the middle string Aaaab, it can be found that the front Aa can match A -> Aa, and can only match this production, so apply this production to fold Aa into A to get Aaab

Middle string	Production
aaab	Insert ε
εaaab	A → ε
Aaaab	A → Aa
Aaab

Follow the same steps above to fold the substring of the middle character into a non-terminal symbol, and finally fold it to the starting symbol S. The process is as follows:

Middle string	Production
aaab	Insert ε
εaaab	A → ε
Aaaab	A → Aa
Aaab	A → Aa
Aab	A → Aa
Ab	B → b
AB	S → AB
S	Finish

The above is the general process of bottom-up grammatical analysis, here is mainly for the convenience of understanding, there is no more things involved. But it’s enough to look at the Go syntax analysis source code later.

Backtracking and ambiguity in the process of grammatical analysis

The example given in the previous article may be special, because in the derivation process, only one production satisfies the condition at each step. But in the actual analysis process, we may encounter the following two situations:

All productions are not applicable
There are multiple productions that can be applied

If the second situation occurs, we often need to use backtracking to deal with it. First try to select the first production that satisfies the conditions. If the target string can be deduced, it means that the production is available. If the first situation is encountered during the deduction process, go back to the place just now and choose another A production that satisfies the conditions

If after trying all the productions, the first situation is encountered, it means that the string does not meet the grammatical requirements. If there are multiple productions that can deduce the target string, it means that the grammar is ambiguous

backtracking analysis is generally very slow, so the grammar is generally carefully constructed to avoid backtracking

Syntax parser generation tool

Common syntax analyzer generation tools are Yacc , bison , and its principle is similar to the lexical analyzer introduced above. It is to write the source file according to the grammar of the corresponding grammar analyzer tool (the file suffix is .y), (actually, it is to provide some grammar and lexical analyzer tool in the source file code, and we also provide the regular) and then use the command of the corresponding tool Generate .y files into .c files

In the generated .c file, in fact, the code for a parser is generated according to the grammar rules we provided. We only need to compile and execute the .c file.

Because this article is already very long, I won’t give an example here. Regarding these two syntax analyzer generation tools, you can click on the link above to enter the official website to download and install, and try it yourself

The process is exactly the same as the lexical analyzer tool introduced above

Summarize

This article mainly shares the uncertain finite automata and deterministic finite automata in lexical analysis, and shows how lexical analyzers commonly used in lexical analysis are used. Then there are two parsing methods involving grammar, abstract syntax tree generation, and grammatical analysis involved in grammatical analysis

Thanks for reading, I hope you can gain something after reading

Go Compilation Principle Series 2 (Basics of Lexical Analysis & Syntax Analysis)

Preface

Basics of Lexical Analysis

Token

Uncertain finite automata & Uncertain finite automata

Uncertain finite automata (NFA)

Determined finite automata (DFA)

Lexical tokenizer

re2c

lex

Basics of grammatical analysis

grammar

Context-free grammar

Abstract syntax tree

Grammatical analysis method

General grammatical analysis method

Top-down grammatical analysis

Bottom-up grammatical analysis

Backtracking and ambiguity in the process of grammatical analysis

Syntax parser generation tool

Summarize

书旅

引用和评论

Go编译原理系列10（逃逸分析）

一文掌握 MCP 上下文协议：从理论到实践

70k star，取代Postman！这款轻量级API工具，太香了！

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术

Go Compilation Principle Series 2 (Basics of Lexical Analysis & Syntax Analysis)

Preface

Basics of Lexical Analysis

Token

Uncertain finite automata & Uncertain finite automata

Uncertain finite automata (NFA)

Determined finite automata (DFA)

Lexical tokenizer

re2c

lex

Basics of grammatical analysis

grammar

Context-free grammar

Abstract syntax tree

Grammatical analysis method

General grammatical analysis method

Top-down grammatical analysis

Bottom-up grammatical analysis

Backtracking and ambiguity in the process of grammatical analysis

Syntax parser generation tool

Summarize

书旅

引用和评论

Go编译原理系列10（逃逸分析）

一文掌握 MCP 上下文协议：从理论到实践

70k star，取代Postman！这款轻量级API工具，太香了！

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储 ｜ 得物技术

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术