Go compilation principle series 3 (lexical analysis)

Preface

In the last article, I introduced the core technology of lexical analysis, finite automata (DFA), and the use and working principle of two common lexical analyzers. On this basis, it will be much easier to look at Go's lexical analysis source code

This article mainly contains the following contents:

The entry file compiled by Go, and what is done in the entry file of the compilation
Where is the lexical analysis in Go compilation, and what is the detailed process?
Write a test go source file, perform lexical analysis on this source file, and get the result of lexical analysis

Source code analysis

Go's compilation entry

In order to have a clearer understanding of how the Go compilation process comes to the lexical analysis step, here is an introduction to where the Go compilation entry file is, and what things are roughly done.

Go的编译入口文件在：src/cmd/compile/main.go -> gc.Main(archInit)

Enter the gc.Main(archInit) function. The content of this function is relatively long. The front part of it is mainly to obtain the parameters passed in from the command line and update the compilation options and configuration. Then you will see this line of code below

lines := parseFiles(flag.Args())

This is the entrance to lexical analysis and grammatical analysis. It will perform lexical and grammatical analysis on the incoming file, and get a grammar tree, which will be built into an abstract syntax tree, and then type checking and other operations will be performed on it. , Will be shared in a later article

Open the parseFiles(flag.Args()) file, you can see the following content (I omitted the code in the latter part, mainly see the content of the lexical analysis):

func parseFiles(filenames []string) uint {
    noders := make([]*noder, 0, len(filenames))
    // Limit the number of simultaneously open files.
    sem := make(chan struct{}, runtime.GOMAXPROCS(0)+10)

    for _, filename := range filenames {
        p := &noder{
            basemap: make(map[*syntax.PosBase]*src.PosBase),
            err:     make(chan syntax.Error),
        }
        noders = append(noders, p)

        go func(filename string) {
            sem <- struct{}{}
            defer func() { <-sem }()
            defer close(p.err)
            base := syntax.NewFileBase(filename)

            f, err := os.Open(filename)
            if err != nil {
                p.error(syntax.Error{Msg: err.Error()})
                return
            }
            defer f.Close()

            p.file, _ = syntax.Parse(base, f, p.error, p.pragma, syntax.CheckBranches) // errors are tracked via p.error
        }(filename)
    }
    ......
}

We know that in the Go compilation process, every source file will be parsed into a syntax tree. As can be seen from the first few lines of the code above, it will create multiple coroutines to compile the source files. , But there is a limit to the number of open source files each time

sem := make(chan struct{}, runtime.GOMAXPROCS(0)+10)

Then it traverses the source file, and sets up multiple coroutines to perform lexical and grammatical analysis on the file, which is mainly reflected in the for loop and go func. As you can see in go func, it will first initialize the information of the source file, mainly to record the name, row, and column information of the source file. The purpose is to perform lexical and grammatical analysis, if you encounter an error, you can Report where the error occurred. It mainly contains the following structures

type PosBase struct {
    pos       Pos
    filename  string
    line, col uint32
}

type Pos struct {
    base      *PosBase
    line, col uint32
}

The next step is to open the source file and start to initialize the syntax analyzer. The reason why the syntax analyzer is initialized is that you will find that during the Go compilation process, lexical analysis and syntax analysis are put together, and performing syntax analysis. During the initialization process, the lexical analyzer is also initialized. We can enter the syntax.Parse function in go fun

func Parse(base *PosBase, src io.Reader, errh ErrorHandler, pragh PragmaHandler, mode Mode) (_ *File, first error) {
    defer func() {
        if p := recover(); p != nil {
            if err, ok := p.(Error); ok {
                first = err
                return
            }
            panic(p)
        }
    }()

    var p parser
    p.init(base, src, errh, pragh, mode) //初始化操作
    p.next() // 词法解析器对源文件进行解析，转换成全部由token组成的源文件
    return p.fileOrNil(), p.first //语法解析器对上边的token文件进行语法解析
}

You can see that the initialization operation of the syntax analysis is called:

p.init(base, src, errh, pragh, mode)

Enter p.init, we will see such a line of code, it is the initialization of the lexical analyzer

p.scanner.init(...这里是初始化词法分析器的参数)

You can see that the syntax analyzer uses p.scanner to call the init method of the lexical analyzer. Just look at the structure of the grammar analyzer. In the structure of the grammar analyzer, the structure of the lexical analyzer is embedded (this article mainly introduces the lexical analyzer, so I will not introduce the various structures of the grammatical analyzer The meaning of the body field will be explained in detail in the article on grammatical analysis)

//语法分析结构体
type parser struct {
    file  *PosBase 
    errh  ErrorHandler
    mode  Mode
    pragh PragmaHandler
    scanner //嵌入了词法分析器

    base   *PosBase 
    first  error 
    errcnt int  
    pragma Pragma  

    fnest  int
    xnest  int 
    indent []byte
}

After clarifying the relationship between grammatical analysis and lexical analysis, let’s look at the specific process of lexical analysis below.

Lexical analysis process

The code location of lexical analysis is:

src/cmd/compile/internal/syntax/scanner.go

The lexical analyzer is implemented through a structure, and its structure is as follows:

type scanner struct {
    source //source也是一个结构体，它主要记录的是要进行词法分析的源文件的信息，比如内容的字节数组，当前扫描到的字符以及位置等（因为我们知道词法分析过程是对源文件进行从左往右的逐字符扫描的）
    mode   uint  //控制是否解析注释
    nlsemi bool // if set '\n' and EOF translate to ';'

    // current token, valid after calling next()
    line, col uint   //当前扫描的字符的位置，初始值都是0
    blank     bool // line is blank up to col（词法分析过程中用不到，语法分析过程会用到）
    tok       token // 当前匹配出来的字符串对应的TOKEN（记录了Go中支持的所有Token类型）
    lit       string   // Token的源代码文本表示，比如从源文件中识别出了if，那它的TOKEN就是_If，它的lit就是if
    bad       bool     // 如果出现了语法错误，获得的lit可能就是不正确的
    kind      LitKind  // 如果匹配到的字符串是一个值类型，这个变量就是标识它属于哪种值类型，比如是INT还是FLOAT还是RUNE类型等
    op        Operator // 跟kind差不多，它是标识识别出的TOKEN如果是操作符的话，是哪种操作符）
    prec      int      // valid if tok is _Operator, _AssignOp, or _IncOp
}

type source struct {
    in   io.Reader
    errh func(line, col uint, msg string)

    buf       []byte // 源文件内容的字节数组
    ioerr     error  // 文件读取的错误信息
    b, r, e   int    // buffer indices (see comment above)
    line, col uint   // 当前扫描到的字符的位置信息
    ch        rune   // 当前扫描到的字符
    chw       int    // width of ch
}

After knowing the meaning of each field in the structure of the lexical parser, let’s take a look at the types of tokens in Go.

Token

Token is the smallest lexical unit with independent meaning in a programming language. Token mainly contains keywords, custom identifiers, operators, separators, comments, etc., all of which can be found in: src/cmd/compile/internal/syntax /tokens.go saw that I intercepted part of it below (these tokens are all in the form of constants)

const (
    _    token = iota
    _EOF       // EOF

    // names and literals
    _Name    // name
    _Literal // literal

    // operators and operations
    // _Operator is excluding '*' (_Star)
    _Operator // op
    _AssignOp // op=
    _IncOp    // opop
    _Assign   // =
    ......

    // delimiters
    _Lparen    // (
    _Lbrack    // [
    _Lbrace    // {
    _Rparen    // )
    _Rbrack    // ]
    _Rbrace    // }
    ......

    // keywords
    _Break       // break
    _Case        // case
    _Chan        // chan
    _Const       // const
    _Continue    // continue
    _Default     // default
    _Defer       // defer
    ......

    // empty line comment to exclude it from .String
    tokenCount //
)

The three most important attributes of the lexical unit corresponding to each lexical Token are: lexical unit type , Token text form in the source code , The position where the Token appears . Comment and semicolon are two special tokens. Ordinary comments generally do not affect the semantics of the program, so comments can be ignored in many cases (the mode field in the scanner structure is to identify whether to parse the comment)

All tokens are divided into four categories:

Special type of Token. For example: _EOF
Token corresponding to the base denomination. For example: IntLit , FloatLit , ImagLit etc.
Operator. For example: Add* // + , Sub* // - , * Mul* // *
Keywords. For example: _Break* // break , _Case* // case

Lexical analysis implementation

In the lexical analysis part, there are two core methods, one is nextch() and the other is next()

We know that the lexical analysis process is to read the character-by-character nextch() function is to continuously read the content of the source file character-by-character from left to right

The following is part of the code of the nextch() function, which is mainly to get the next unprocessed character and update the scanned position information

func (s *source) nextch() {
redo:
    s.col += uint(s.chw)
    if s.ch == '\n' {
        s.line++
        s.col = 0
    }

    // fast common case: at least one ASCII character
    if s.ch = rune(s.buf[s.r]); s.ch < sentinel {
        s.r++
        s.chw = 1
        if s.ch == 0 {
            s.error("invalid NUL character")
            goto redo
        }
        return
    }

    // slower general case: add more bytes to buffer if we don't have a full rune
    for s.e-s.r < utf8.UTFMax && !utf8.FullRune(s.buf[s.r:s.e]) && s.ioerr == nil {
        s.fill()
    }

    // EOF
    if s.r == s.e {
        if s.ioerr != io.EOF {
            // ensure we never start with a '/' (e.g., rooted path) in the error message
            s.error("I/O error: " + s.ioerr.Error())
            s.ioerr = nil
        }
        s.ch = -1
        s.chw = 0
        return
    }

......
}

and the next() function is based on the scanned characters to split the string through the idea of determining the finite automata introduced in the previous article, and match the corresponding token . Part of the core code of the next() function is as follows:

func (s *scanner) next() {
    nlsemi := s.nlsemi
    s.nlsemi = false

redo:
    // skip white space
    s.stop()
    startLine, startCol := s.pos()
    for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
        s.nextch()
    }

    // token start
    s.line, s.col = s.pos()
    s.blank = s.line > startLine || startCol == colbase
    s.start()
    if isLetter(s.ch) || s.ch >= utf8.RuneSelf && s.atIdentChar(true) {
        s.nextch()
        s.ident()
        return
    }

    switch s.ch {
    case -1:
        if nlsemi {
            s.lit = "EOF"
            s.tok = _Semi
            break
        }
        s.tok = _EOF

    case '\n':
        s.nextch()
        s.lit = "newline"
        s.tok = _Semi

    case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
        s.number(false)

    case '"':
        s.stdString()
......
}

A complete description of what these two methods do is:

The lexical analyzer uses the nextch() function to get the latest unparsed characters each time
According to the scanned characters, the next() function will determine which type of character is currently scanned. For example, if an a character is currently scanned, then it will try to match an identifier type, that is, in the next() function The s.ident() method called (and will determine whether this identifier is a keyword)
If the scanned character is a numeric character, it will try to match a basic face value type (such as IntLit , FloatLit , ImagLit )
After next() recognizes a token, it will be passed to the parser, and then the parser will continue to get the next token by calling the next() function of the lexical analyzer (so you will find that the lexical analyzer is not a one-time After translating the entire source file into a token, it is then provided to the syntax analyzer, but the syntax analyzer itself needs one to obtain a through the next() function of the lexical analyzer)

We can see such a line of code in the next() function

for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
        s.nextch()
    }

It is to filter out spaces, tabs, newlines, etc. in the source file

Regarding how it recognizes whether a string is a basic denomination or a string, you can see ident() , number() , stdString() , here is not sticky code, in fact, the idea is introduced in the previous article Deterministic finite automata

Below I start from the entrance of the Go compiler, and draw a flow chart of lexical analysis to facilitate the content of the introduction.

Maybe after reading the source code, I still don't have a clear understanding of the lexical parser. Let’s use the test file or standard library provided by Go to actually use the lexical analyzer to see how it works.

`Test the lexical analysis process`

There are two ways to test lexical analysis, you can directly compile and execute the lexical analyzer test file provided by Go, or the standard library provided by Go

词法分析器测试文件地址：src/cmd/compile/internal/syntax/scanner_test.go
Go提供的词法分析器标准库地址：src/go/scanner/scanner.go

Below I will write a source file by myself and pass it to the lexical analyzer to see how it is parsed and what the result of the analysis is like

`Test the lexical analyzer through the test file`

TestScanner method in src/cmd/compile/internal/syntax/scanner_test.go. The source code of this method is as follows (with comments in the code):

func TestScanner(t *testing.T) {
    if testing.Short() {
        t.Skip("skipping test in short mode")
    }

    filename := *src_ // can be changed via -src flag
    //这里你可以选择一个你自己想解析的源文件的绝对路径
    src, err := os.Open("/Users/shulv/studySpace/GolangProject/src/data_structure_algorithm/SourceCode/Token/aa.go")
    if err != nil {
        t.Fatal(err)
    }
    defer src.Close()

    var s scanner
    s.init(src, errh, 0) //初始化词法解析器
    for {
        s.next() //获取token（next函数里边会调用nextch()方法，不断获取下一个字符，直到匹配到一个token）
        if s.tok == _EOF {
            break
        }
        if !testing.Verbose() {
            continue
        }
        switch s.tok { //获取到的token值
        case _Name, _Literal: //标识符或基础面值
            //打印出文件名、行、列、token以及token对应的源文件中的文本
            fmt.Printf("%s:%d:%d: %s => %s\n", filename, s.line, s.col, s.tok, s.lit)
        case _Operator:
            fmt.Printf("%s:%d:%d: %s => %s (prec = %d)\n", filename, s.line, s.col, s.tok, s.op, s.prec)
        default:
            fmt.Printf("%s:%d:%d: %s\n", filename, s.line, s.col, s.tok)
        }
    }
}

The test function will first open your source file and pass the content of the source file to the initialization function of the lexical analyzer. Then through an infinite loop, continue to call the next() function to obtain the token, until the end character _EOF is encountered, then the loop is jumped out

The content of the file I want to parse for the lexical parser is as follows:

package Token

import "fmt"

func testScanner()  {
    a := 666
    if a == 666 {
        fmt.Println("Learning Scanner")
    }
}

Then use the following command to run the test method (you can print more information, the fields of the sacnner structure, you can print it out and see):

# cd /usr/local/go/src/cmd/compile/internal/syntax
# go test -v -run="TestScanner"

打印结果：
=== RUN   TestScanner
parser.go:1:1: package
parser.go:1:9: name => Token
parser.go:1:14: ;
parser.go:3:1: import
parser.go:3:8: literal => "fmt"
parser.go:3:13: ;
parser.go:5:1: func
parser.go:5:6: name => testScanner
parser.go:5:17: (
parser.go:5:18: )
parser.go:5:21: {
parser.go:6:2: name => a
parser.go:6:4: :=
parser.go:6:7: literal => 666
parser.go:6:10: ;
parser.go:7:2: if
parser.go:7:5: name => a
parser.go:7:7: op => == (prec = 3)
parser.go:7:10: literal => 666
parser.go:7:14: {
parser.go:8:3: name => fmt
parser.go:8:6: .
parser.go:8:7: name => Println
parser.go:8:14: (
parser.go:8:15: literal => "Learning Scanner"
parser.go:8:33: )
parser.go:8:34: ;
parser.go:9:2: }
parser.go:9:3: ;
parser.go:10:1: }
parser.go:10:2: ;
--- PASS: TestScanner (0.00s)
PASS
ok      cmd/compile/internal/syntax    0.007s

`Test the lexical analyzer through the standard library`

Another test method is to use the standard library provided by Go. Here I will demonstrate how to use the methods in the standard library to test the lexical analyzer.

You need to write a piece of code to call a method in the standard library to implement a lexical analysis process. An example is as follows:

package Token

import (
    "fmt"
    "go/scanner"
    "go/token"
)

func TestScanner1()  {
    src := []byte("cos(x)+2i*sin(x) //Comment") //我要解析的内容（当然你也可以用一个文件内容的字节数组）
    //初始化scanner
    var s scanner.Scanner
    fset := token.NewFileSet() //初始化一个文件集（我在下边会解释这个）
    file := fset.AddFile("", fset.Base(), len(src)) //向字符集中加入一个文件
    s.Init(file, src, nil, scanner.ScanComments) //第三个参数是mode，我传的是ScanComments，表示需要解析注释，一般情况下是可以不解析注释的
    //扫描
    for  {
        pos, tok, lit := s.Scan() //就相当于next()函数
        if tok == token.EOF {
            break
        }
        fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit) //fset.Position(pos)：获取位置信息
    }
}

Execute the above code, and get the following result:

1:1     IDENT   "cos"
1:4     (       ""
1:5     IDENT   "x"
1:6     )       ""
1:7     +       ""
1:8     IMAG    "2i"
1:10    *       ""
1:11    IDENT   "sin"
1:14    (       ""
1:15    IDENT   "x"
1:16    )       ""
1:18    ;       "\n"
1:18    COMMENT "//Comment"

You will find that the methods used in the standard library are completely different from those in the test files. This is because the standard library implements a set of lexical analyzers separately, and does not reuse the lexical analyzer code in go compilation. I understand this is because the code in the go compiler cannot be used as public The method is for external use. For security reasons, the method inside must be kept private

If you look at the implementation of the lexical analyzer in the standard library, you find that it is not the same as the implementation in go compilation, but the core ideas are the same (such as character scanning, token recognition). The difference lies in the processing of the file to be parsed. We know that in the Go language, a package is composed of multiple files, and then multiple packages are linked into an executable file, so multiple files corresponding to a single package can be regarded as It is the basic compilation unit of Go language. Therefore, the lexical parser provided by Go also defines FileSet and File objects to describe file sets and files

type FileSet struct {
    mutex sync.RWMutex // protects the file set
    base  int          // base offset for the next file
    files []*File      // list of files in the order added to the set
    last  *File        // cache of last file looked up
}

type File struct {
    set  *FileSet
    name string // file name as provided to AddFile
    base int    // Pos value range for this file is [base...base+size]
    size int    // file size as provided to AddFile

    // lines and infos are protected by mutex
    mutex sync.Mutex
    lines []int // lines contains the offset of the first character for each line (the first entry is always 0)（行包含每行第一个字符的偏移量（第一个条目始终为0））
    infos []lineInfo
}

The function is actually to record the information of the parsed file, which is similar to the source structure in the scanner structure of the lexical parser. The difference is that we know that the go compiler creates multiple coroutines and compiles multiple files concurrently. , And the standard library stores multiple files to be parsed through file sets, you will find that there is a one-dimensional array

The following briefly introduces the relationship between FileSet and File, and how it calculates the location information of Token

`FileSet and File`

The corresponding relationship between FileSet and File is shown in the figure:

Image source: go-ast-book

The Pos type in the figure represents the subscript position of the array. Each File element in the FileSet corresponds to an interval of the underlying array. There is no intersection between different files, and there may be filling space between adjacent files

Each File is mainly composed of three information: file name, base and size. The base corresponds to the Pos index position of the File in the FileSet, so base and base+size define the start and end positions of the File in the FileSet array. In each File, the subscript index can be located by offset, and the offset in the File can be converted to the Pos position by offset+File.base. Because Pos is the global offset of the FileSet, on the contrary, you can also query the corresponding File through Pos, and the offset inside the corresponding File

The location information of each Token in lexical analysis is defined by Pos, and the corresponding File can be easily queried through Pos and the corresponding FileSet. Then the corresponding line number and column number are calculated by the source file and offset corresponding to File (in the implementation, File only saves the starting position of each line, and does not contain the original source code data). The bottom layer of Pos is of type int, which has similar semantics to pointers. Therefore, 0 is also similar to a null pointer, which is defined as NoPos, which means invalid Pos.

Source: go-ast-book

It can be seen from the relationship between FileSet and File that the lexical analyzer in the Go standard library uses file sets to parse multiple source files.

`Summarize`

This article mainly starts from the entry file of Go compilation, and gradually introduces the implementation of the source code part of the lexical analysis in go compilation, and through the test files and the lexical analyzer standard library provided by Go, the actual test and use of the lexical analyzer are carried out. . I believe that after reading it, I can have a clearer understanding of Go's lexical analysis process

The lexical analysis part is relatively simple, and the core content involved is less. The really difficult part is the grammatical analysis and abstract syntax tree part. If you are interested, please continue to pay attention

Go compilation principle series 3 (lexical analysis)

Preface

Source code analysis

Go's compilation entry

Lexical analysis process

Token

Lexical analysis implementation

`Test the lexical analysis process`

`Test the lexical analyzer through the test file`

`Test the lexical analyzer through the standard library`

`FileSet and File`

`Summarize`

书旅

`引用和评论`

Go编译原理系列10（逃逸分析）

70k star，取代Postman！这款轻量级API工具，太香了！

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

C++ 中 VS 项目引入公共配置文件

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

Go slice切片使用教程，一次通关！