foreword

In the previous article grammar analysis , we know how the Go compiler parses various declaration types (import, var, const, func, etc.) in the go text file according to the Go grammar. The syntax analysis phase parses the entire source file into a File structure, and parses various declaration types in the source file into File.DeclList . Finally, a syntax tree is generated with the File structure as the root node and importDecl , constDecl , typeDecl , varDecl FuncDecl etc. as the child nodes

First of all, we need to make it clear that the function of the abstract syntax tree is actually for type checking, code style checking and so on. In short, with the abstract syntax tree, the compiler can accurately locate anywhere in the code, and perform some series of operations and verifications on it.

This article is the construction of the abstract syntax tree. We know that the source program must be built into an intermediate representation in the front-end of the compiler, so that it can be used in the back-end of the compiler. The abstract syntax tree is a common tree-like intermediate representation Form . So this article mainly introduces what the Go compiler does to build the syntax tree into an abstract syntax tree?

Abstract Syntax Tree Construction Overview

The following is an overall understanding of the abstract syntax tree construction process, which may span a relatively large span. The specific implementation details are introduced in the next part.

syntax parsing stage of the previous article, we know that the Go compiler will start multiple coroutines to parse each source file into a syntax tree. The location of the specific code is: src/cmd/compile/internal/gc/noder.go → parseFiles

func parseFiles(filenames []string) uint {
    noders := make([]*noder, 0, len(filenames))
    // Limit the number of simultaneously open files.
    sem := make(chan struct{}, runtime.GOMAXPROCS(0)+10)

    for _, filename := range filenames {
        p := &noder{
            basemap: make(map[*syntax.PosBase]*src.PosBase),
            err:     make(chan syntax.Error),
        }
        noders = append(noders, p)
        //起多个协程对源文件进行语法解析
        go func(filename string) {
            sem <- struct{}{}
            defer func() { <-sem }()
            defer close(p.err)
            base := syntax.NewFileBase(filename)

            f, err := os.Open(filename)
            if err != nil {
                p.error(syntax.Error{Msg: err.Error()})
                return
            }
            defer f.Close()

            p.file, _ = syntax.Parse(base, f, p.error, p.pragma, syntax.CheckBranches) // errors are tracked via p.error
        }(filename)
    }

  //开始将每一棵语法树构建成抽象语法树
    var lines uint
    for _, p := range noders {
        for e := range p.err {
            p.yyerrorpos(e.Pos, "%s", e.Msg)
        }

        p.node() //构建抽象语法树的核心实现
        lines += p.file.Lines
        p.file = nil // release memory

        ......
    }

    localpkg.Height = myheight

    return lines
}

After parsing the source file into a syntax tree, the Go compiler builds each syntax tree (source file) into an abstract syntax tree. The core code is in the p.node() method:

func (p *noder) node() {
    ......

    xtop = append(xtop, p.decls(p.file.DeclList)...)

    ......
    clearImports()
}

The core part of the p.node() method is the p.decls(p.file.DeclList) method, which implements convert various declaration types in the source file into abstract syntax trees one by one, that is import, var, type, const, func declarations will become a root node , and the root node contains the currently declared child nodes

p.decls(p.file.DeclList) is as follows:

func (p *noder) decls(decls []syntax.Decl) (l []*Node) {
    var cs constState

    for _, decl := range decls {
        p.setlineno(decl)
        switch decl := decl.(type) {
        case *syntax.ImportDecl:
            p.importDecl(decl)

        case *syntax.VarDecl:
            l = append(l, p.varDecl(decl)...)

        case *syntax.ConstDecl:
            l = append(l, p.constDecl(decl, &cs)...)

        case *syntax.TypeDecl:
            l = append(l, p.typeDecl(decl))

        case *syntax.FuncDecl:
            l = append(l, p.funcDecl(decl))

        default:
            panic("unhandled Decl")
        }
    }

    return
}

On the whole, this method actually converts various declaration types in the syntax tree into an abstract syntax tree (Node structure) with various declarations as the root node, and finally the syntax tree becomes an array of nodes ( Node)

Below you can see what this Node structure looks like

type Node struct {
    // Tree structure.
    // Generic recursive walks should follow these fields.
    //通用的递归遍历,应该遵循这些字段
    Left  *Node //左子节点
    Right *Node //右子节点
    Ninit Nodes
    Nbody Nodes
    List  Nodes //左子树
    Rlist Nodes //右子树

    // most nodes
    Type *types.Type //节点类型
    Orig *Node // original form, for printing, and tracking copies of ONAMEs

    // func
    Func *Func //方法

    // ONAME, OTYPE, OPACK, OLABEL, some OLITERAL
    Name *Name //变量名、类型明、包名等等

    Sym *types.Sym  // various
    E   interface{} // Opt or Val, see methods below

    // Various. Usually an offset into a struct. For example:
    // - ONAME nodes that refer to local variables use it to identify their stack frame position.
    // - ODOT, ODOTPTR, and ORESULT use it to indicate offset relative to their base address.
    // - OSTRUCTKEY uses it to store the named field's offset.
    // - Named OLITERALs use it to store their ambient iota value.
    // - OINLMARK stores an index into the inlTree data structure.
    // - OCLOSURE uses it to store ambient iota value, if any.
    // Possibly still more uses. If you find any, document them.
    Xoffset int64

    Pos src.XPos

    flags bitset32

    Esc uint16 // EscXXX

    Op  Op //当前结点的属性
    aux uint8
}

Knowing the meaning of several fields in the comments above is basically enough. The core is the Op field, which identifies the attributes of each node. You can see the definitions of all Ops in: src/cmd/compile/internal/gc/syntax.go, they all start with O, and they are all integers, each Op has its own semantics

const (
    OXXX Op = iota

    // names
    ONAME // var or func name 遍历名或方法名
    // Unnamed arg or return value: f(int, string) (int, error) { etc }
    // Also used for a qualified package identifier that hasn't been resolved yet.
    ONONAME
    OTYPE    // type name 变量类型
    OPACK    // import
    OLITERAL // literal 标识符

    // expressions
    OADD          // Left + Right  加法
    OSUB          // Left - Right  减法
    OOR           // Left | Right  或运算
    OXOR          // Left ^ Right
    OADDSTR       // +{List} (string addition, list elements are strings)
    OADDR         // &Left
    ......
    // Left = Right or (if Colas=true) Left := Right
    // If Colas, then Ninit includes a DCL node for Left.
    OAS
    // List = Rlist (x, y, z = a, b, c) or (if Colas=true) List := Rlist
    // If Colas, then Ninit includes DCL nodes for List
    OAS2
    OAS2DOTTYPE // List = Right (x, ok = I.(int))
    OAS2FUNC    // List = Right (x, y = f())
  ......
)

For example, when the Op of a node is OAS, the semantics represented by the node is Left := Right. When the Op of the node is OAS2, the semantic representation is x, y, z = a, b, c

Suppose there is such a declaration statement: a := b + c(6), and the abstract syntax tree is constructed as follows

In the end, each declaration statement will be constructed into such an abstract syntax tree. The above is a general understanding of the abstract syntax tree, and the following is a detailed look at how various declaration statements are constructed step by step into an abstract syntax tree.

The parsing phase parses various declarations

In order to more intuitively see how the abstract syntax tree parses various declarations, we can directly use the methods in the standard library provided by go to debug. Because I did not intuitively see what a declaration looks like after it is parsed by the grammar, so I will show it through the methods in the standard library.

💡 Tip: As Go lexical analysis 161e24ddc34d45, the implementation of lexical parsing, grammar parsing, abstract syntax tree construction, etc. in the standard library provided by Go is different from the implementation or design in the Go compiler. But the overall idea is the same

Basic face value analysis

The base denominations are integer , float , complex , , 161e24ddc34d7e character, string, identifier . From the grammar parsing of the previous Go, we know that the structure of the basic value in the Go compiler is

BasicLit struct {
        Value string   //值
        Kind  LitKind  //那种类型的基础面值,范围(IntLit、FloatLit、ImagLit、RuneLit、StringLit)
        Bad   bool // true means the literal Value has syntax errors
        expr
}

In the standard library, the structure of the base denomination looks like this

BasicLit struct {
        ValuePos token.Pos   // literal position
        Kind     token.Token // token.INT, token.FLOAT, token.IMAG, token.CHAR, or token.STRING
        Value    string      // literal string; e.g. 42, 0x7f, 3.14, 1e-9, 2.4i, 'a', '\x7f', "foo" or `\m\n\o`
    }

In fact, it is almost the same, including other various denomination structures or declared structures that we will mention later. The structures in the Go compiler are different from those in the Go standard library, but the meanings are similar.

Knowing the structure of the base face value, if we want to build a base face value, we can do this

func AstBasicLit()  {
    var basicLit = &ast.BasicLit{
        Kind:  token.INT,
        Value: "666",
    }
    ast.Print(nil, basicLit)
}

//打印结果
*ast.BasicLit {
        ValuePos: 0
    Kind: INT
    Value: "666"
}

The above is to directly construct a basic face value. In theory, we can construct a completed syntax tree in this way, but the manual method is too troublesome after all. So the standard library provides methods to automatically build syntax trees. Suppose I want to build the integer 666 into the structure of the base denomination

func AstBasicLitCreat()  {
    expr, _ := parser.ParseExpr(`666`)
    ast.Print(nil, expr)
}

//打印结果
*ast.BasicLit {
        ValuePos: 1
    Kind: INT
    Value: "666"
}

Another example is the identifier face value, its structure is:

type Ident struct {
    NamePos token.Pos // 位置
    Name    string    // 标识符名字
    Obj     *Object   // 标识符类型或扩展信息
}

An identifier type can be constructed by the following method

func AstInent()  {
    ast.Print(nil, ast.NewIdent(`a`))
}

//打印结果
*ast.Ident {
        NamePos: 0
    Name: "a"
}

If the identifier appears in an expression, additional information about the identifier is stored in the Obj field

func AstInent()  {
    expr, _ := parser.ParseExpr(`a`)
    ast.Print(nil, expr)
}

//打印结果
*ast.Ident {
   NamePos: 1
   Name: "a"
   Obj: *ast.Object {
      Kind: bad
      Name: ""
   }
}

Kind in the ast.Object structure is the type that describes the identifier

const (
    Bad ObjKind = iota // for error handling
    Pkg                // package
    Con                // constant
    Typ                // type
    Var                // variable
    Fun                // function or method
    Lbl                // label
)

Expression parsing

In the go/ast/ast.go of the standard library, you will see the structure of various types of expressions, I will take a look here

// A SelectorExpr node represents an expression followed by a selector.
    SelectorExpr struct {
        X   Expr   // expression
        Sel *Ident // field selector
    }

    // An IndexExpr node represents an expression followed by an index.
    IndexExpr struct {
        X      Expr      // expression
        Lbrack token.Pos // position of "["
        Index  Expr      // index expression
        Rbrack token.Pos // position of "]"
    }

    // A SliceExpr node represents an expression followed by slice indices.
    SliceExpr struct {
        X      Expr      // expression
        Lbrack token.Pos // position of "["
        Low    Expr      // begin of slice range; or nil
        High   Expr      // end of slice range; or nil
        Max    Expr      // maximum capacity of slice; or nil
        Slice3 bool      // true if 3-index slice (2 colons present)
        Rbrack token.Pos // position of "]"
    }

In the Go compiler, you can also see a similar expression structure at: src/cmd/compile/internal/gc/noder.go

// X.Sel
    SelectorExpr struct {
        X   Expr
        Sel *Name
        expr
    }

    // X[Index]
    IndexExpr struct {
        X     Expr
        Index Expr
        expr
    }

    // X[Index[0] : Index[1] : Index[2]]
    SliceExpr struct {
        X     Expr
        Index [3]Expr
        // Full indicates whether this is a simple or full slice expression.
        // In a valid AST, this is equivalent to Index[2] != nil.
        // TODO(mdempsky): This is only needed to report the "3-index
        // slice of string" error when Index[2] is missing.
        Full bool
        expr
    }

Although the definition of the structure is different, the meaning of the expression is similar. There are many methods for parsing various expressions in the standard library

type BadExpr struct{ ... }
type BinaryExpr struct{ ... }
type CallExpr struct{ ... }
type Expr interface{ ... }
type ExprStmt struct{ ... }
type IndexExpr struct{ ... }
type KeyValueExpr struct{ ... }
......

In the Go compiler, the core method of parsing expressions is: src/cmd/compile/internal/gc/noder.go→ expr()

func (p *noder) expr(expr syntax.Expr) *Node {
    p.setlineno(expr)
    switch expr := expr.(type) {
    case nil, *syntax.BadExpr:
        return nil
    case *syntax.Name:
        return p.mkname(expr)
    case *syntax.BasicLit:
        n := nodlit(p.basicLit(expr))
        n.SetDiag(expr.Bad) // avoid follow-on errors if there was a syntax error
        return n
    case *syntax.CompositeLit:
        n := p.nod(expr, OCOMPLIT, nil, nil)
        if expr.Type != nil {
            n.Right = p.expr(expr.Type)
        }
        l := p.exprs(expr.ElemList)
        for i, e := range l {
            l[i] = p.wrapname(expr.ElemList[i], e)
        }
        n.List.Set(l)
        lineno = p.makeXPos(expr.Rbrace)
        return n
    case *syntax.KeyValueExpr:
        // use position of expr.Key rather than of expr (which has position of ':')
        return p.nod(expr.Key, OKEY, p.expr(expr.Key), p.wrapname(expr.Value, p.expr(expr.Value)))
    case *syntax.FuncLit:
        return p.funcLit(expr)
    case *syntax.ParenExpr:
        return p.nod(expr, OPAREN, p.expr(expr.X), nil)
    case *syntax.SelectorExpr:
        // parser.new_dotname
        obj := p.expr(expr.X)
        if obj.Op == OPACK {
            obj.Name.SetUsed(true)
            return importName(obj.Name.Pkg.Lookup(expr.Sel.Value))
        }
        n := nodSym(OXDOT, obj, p.name(expr.Sel))
        n.Pos = p.pos(expr) // lineno may have been changed by p.expr(expr.X)
        return n
    case *syntax.IndexExpr:
        return p.nod(expr, OINDEX, p.expr(expr.X), p.expr(expr.Index))
    
    ......
    }
    panic("unhandled Expr")
}

Let's still use the methods provided in the Go standard library to see what a binary expression looks like after it is parsed

func AstBasicExpr()  {
    expr, _ := parser.ParseExpr(`6+7*8`)
    ast.Print(nil, expr)
}

The structure of the first binary expression is BinaryExpr

// A BinaryExpr node represents a binary expression.
    BinaryExpr struct {
        X     Expr        // left operand
        OpPos token.Pos   // position of Op
        Op    token.Token // operator
        Y     Expr        // right operand
    }

After being parsed into such a structure, different nodes can be created according to the type of Op. As mentioned earlier, each Op has its own semantics

expression evaluation

Suppose the binary expression above is to be evaluated

func AstBasicExpr()  {
    expr, _ := parser.ParseExpr(`6+7*8`)
    fmt.Println(Eval(expr))
}

func Eval(exp ast.Expr) float64 {
    switch exp := exp.(type) {
    case *ast.BinaryExpr: //如果是二元表达式类型,调用EvalBinaryExpr进行解析
        return EvalBinaryExpr(exp)
    case *ast.BasicLit: //如果是基础面值类型
        f, _ := strconv.ParseFloat(exp.Value, 64)
        return f
    }
    return 0
}

func EvalBinaryExpr(exp *ast.BinaryExpr) float64 { //这里仅实现了+和*
    switch exp.Op {
    case token.ADD:
        return Eval(exp.X) + Eval(exp.Y)
    case token.MUL:
        return Eval(exp.X) * Eval(exp.Y)
    }
    return 0
}

//打印结果
62

The main places are annotated, it should be easy to understand

Var declaration parsing

The first thing to note is that in the previous article Go grammar parsing , we know that the declaration of Var type will be parsed into the VarDecl structure. But in the Go standard library, the syntax parsing parses the Var, const, type, and import declarations into the GenDecl structure (called general declarations)

//    token.IMPORT  *ImportSpec
    //    token.CONST   *ValueSpec
    //    token.TYPE    *TypeSpec
    //    token.VAR     *ValueSpec
    //
    GenDecl struct {
        Doc    *CommentGroup // associated documentation; or nil
        TokPos token.Pos     // position of Tok
        Tok    token.Token   // IMPORT, CONST, TYPE, VAR
        Lparen token.Pos     // position of '(', if any
        Specs  []Spec
        Rparen token.Pos // position of ')', if any
    }

Which type of declaration can be distinguished by the Tok field

The following shows an example of what the Var declaration looks like after being parsed by the grammar

const srcVar = `package test
var a = 6+7*8
`

func AstVar()  {
    fset := token.NewFileSet()
    f, err := parser.ParseFile(fset, "hello.go", srcVar, parser.AllErrors)
    if err != nil {
        log.Fatal(err)
    }
    for _, decl := range f.Decls {
        if v, ok := decl.(*ast.GenDecl); ok {
            fmt.Printf("Tok: %v\n", v.Tok)
            for _, spec := range v.Specs {
                ast.Print(nil, spec)
            }
        }
    }
}

First of all, you can see that its Tok is Var, indicating that it is a declaration of Var type, and then its variable name is stored through the ast.ValueSpec structure, which can actually be understood as the VarDecl structure in the Go compiler

At this point, you should have a general understanding of what the basic value, expressions, and var declarations look like after syntax parsing. As mentioned in the previous overview, the abstract syntax tree stage will convert the various declarations in the Go source file, into one by one abstract syntax tree , that is, import, var, type, const, func declarations will become a root node , below the root node contains the currently declared child nodes. Let's take the var declaration as an example to see how it is handled

Abstract Syntax Tree Construction

The idea of the construction process of the abstract syntax tree of each declaration is similar. The code inside is more complicated, so there is no line-by-line code to explain what they are doing. You can see it yourself: src/cmd/compile/internal/gc/noder .go → internal implementation of decls()

I only take the statement of Var declaration as an example to show how to deal with var declaration in the abstract syntax tree construction phase

Abstract Syntax Tree Construction for Var Declaration Statement

As mentioned earlier, the core logic of abstract syntax tree construction is: src/cmd/compile/internal/gc/noder.go → decls , when the declaration type is * syntax.VarDecl, call the p.varDecl(decl) method to process

func (p *noder) decls(decls []syntax.Decl) (l []*Node) {
    var cs constState

    for _, decl := range decls {
        p.setlineno(decl)
        switch decl := decl.(type) {
        ......
        case *syntax.VarDecl:
            l = append(l, p.varDecl(decl)...)
        ......
        default:
            panic("unhandled Decl")
        }
    }

    return
}

Look directly at the internal implementation of p.varDecl(decl)

func (p *noder) varDecl(decl *syntax.VarDecl) []*Node {
    names := p.declNames(decl.NameList) //处理变量名
    typ := p.typeExprOrNil(decl.Type) //处理变量类型

    var exprs []*Node
    if decl.Values != nil {
        exprs = p.exprList(decl.Values) //处理值
    }
    ......
    return variter(names, typ, exprs)
}

I have shown several core methods called in this method. The method calls are relatively deep. I will show what is done in each method through the diagram below.

Let's first review what the structure that holds the var declaration looks like

// NameList Type
    // NameList Type = Values
    // NameList      = Values
    VarDecl struct {
        Group    *Group // nil means not part of a group
        Pragma   Pragma
        NameList []*Name
        Type     Expr // nil means no type
        Values   Expr // nil means no values
        decl
    }

The core fields are NameList, Type, and Values. We can find that in the above processing method, three methods are called to process these three fields.

  1. names := p.declNames(decl.NameList), this method is convert all variable names into the corresponding Node structure , the fields of the Node structure have been introduced earlier, the core field inside is Op , This method assigns ONAME to the Op of each Name. So the method finally returns a Node array , which contains all the variable names declared by var
  2. p.typeExprOrNil(decl.Type), this method is convert a specific type into the corresponding Node structure (such as int, string, slice, etc., var a int ). This method is mainly implemented by calling the expr(expr syntax.Expr) method. Its core function is to convert the specified type into the corresponding Node structure (there is a bunch of switch cases inside)
  3. p.exprList(decl.Values), this method is convert the value part into the corresponding Node structure , the core is to match the corresponding method according to the type of value for parsing
  4. variter(names, typ, exprs), which is actually a tree of Node or Node arrays that combine the variable name part, type part, value or expression part of the var declaration

The first three methods are to convert each part of the var declaration into the corresponding Node node. In fact, it is to set the Op attribute of this node, and each Op represents a semantic. Then the fourth method is to splicing these nodes into a tree according to the semantics, so that it can legally express the var declaration

The following is an example to show the abstract syntax tree construction of var declaration

Example showing abstract syntax tree for var declaration

Suppose there is an expression declared by var as below, I will first show the parsed look through the syntax parsing method provided in the standard library, and then I will show the result after building the result into an abstract syntax tree.

const srcVar = `package test
var a = 666+6
`
func AstVar()  {
    fset := token.NewFileSet()
    f, err := parser.ParseFile(fset, "hello.go", srcVar, parser.AllErrors)
    if err != nil {
        log.Fatal(err)
    }
    for _, decl := range f.Decls {
        if v, ok := decl.(*ast.GenDecl); ok {
            fmt.Printf("Tok: %v\n", v.Tok)
            for _, spec := range v.Specs {
                ast.Print(nil, spec)
            }
        }
    }
}

The above is the result after syntax analysis, and then the three methods mentioned above are called to convert Names, Type, and Values into Node structures as follows:

Names:
ONAME(a)

Values:
OLITERAL(666)
OADD(+)
OLITERAL(6)

Then build these Nodes into a tree through the variter(names, typ, exprs) method as follows:

You can view the parsing results of any code in the following ways:

const src = `你的代码`
func Parser()  {
    fset := token.NewFileSet() // positions are relative to fset
    f, err := parser.ParseFile(fset, "", src, 0)
    if err != nil {
        panic(err)
    }

    // Print the AST.
    ast.Print(fset, f)
}

Summarize

This article firstly understands what the abstract syntax tree does as a whole? and what does it do? What do the declarations in the source file look like after being constructed into an abstract syntax tree?

Then, through the syntax parsing method provided in the standard library, it shows how the basic value, expression, and Var declaration statement are parsed. Then, taking the Var declaration type as an example, it shows how to process the Var declaration statement in the construction stage of the abstract syntax tree. of

Whether in lexical analysis , grammatical analysis , the abstract syntax tree construction stage or the type checking to be shared later, etc., their implementation must have many details, which cannot be presented here one by one. This series of articles It can help friends to provide a clear outline, and you can follow this outline to see the details. For example, which var declarations are reasonable to use, and how import is written, you can see it in the underlying implementation of Go compilation

refer to

  • "Compilation Principles"
  • "Analysis of the underlying principles of the Go language"
  • go-ast-book

书旅
125 声望32 粉丝