In the compilation process, a very important step is syntax analysis (also known as parsing, Parsing). The parser is responsible for converting the token stream into an abstract syntax tree (AST). This article introduces a Parser implementation algorithm: Pratt Parsing, also known as Top Down Operator Precedence Parsing, and

Pratt Parsing is very simple to implement, you can take a look at TypeScript implementation result , the core code is less than 40 lines!

application background

There are generally two ways to implement a parser:

  • Using the Parser generator
  • manual implementation

Parser generator

Use the Parser generator. Describe your grammar in a DSL (such as BNF), feed the description file to the Parser generator, which will output a code to parse the grammar.

This method is very convenient and sufficient for most needs. But in some cases, it is not flexible enough (for example, it cannot provide more useful, contextual error information), the performance is not good enough, and the generated code is long. Also, when describing the operator precedence and associativity of expressions , the syntax description can become very complex and difficult to read, such as wikipedia example :

expression ::= equality-expression
equality-expression ::= additive-expression ( ( '==' | '!=' ) additive-expression ) *
additive-expression ::= multiplicative-expression ( ( '+' | '-' ) multiplicative-expression ) *
multiplicative-expression ::= primary ( ( '*' | '/' ) primary ) *
primary ::= '(' expression ')' | NUMBER | VARIABLE | '-' primary

You need to create a rule for each kind of precedence, which makes the syntax description of the expression very complicated.

So sometimes you need to use the second way: manual implementation.

manual implementation

recursive descent algorithm

A common way to implement Parser by hand is recursive descent algorithm . The recursive descent algorithm is better at parsing the statement (Statement) , because the creator intentionally placed the statement type identifier at the beginning when designing the statement, such as if (expression) ... and while (expression) ... . Thanks to this, after the Parser identifies the type of the statement at the beginning, it knows which structures need to be parsed in turn, and the corresponding structure parsing function can be called in turn, and the implementation is very simple.

However, since the recursive descent algorithm needs to understand the code structure top-down, it is very difficult to process Expression . When Parser reads the beginning of the expression, he cannot know which expression he is in. This is because the operator (Operator) is often in the middle (or even the end) of the expression, such as 062243af01ecda for addition, and + for function calls. () . In order to parse expressions top-down, you need to treat each operator precedence as a level, write analytic functions for it, and manually handle associativity , so parse Functions will be more and more complex.

For example, in the example of the wikipedia, expression handles addition and subtraction, term handles multiplication and division, and the former calls the latter (multiplication and division terms are at a lower level). It is conceivable that when there are more priorities, the code will be more complex, and the recursive call level will be deeper. For example, even if the input string is simply 1 , this parser needs to recursively call the following parsing function: program -> block -> statement -> expression -> term -> factor . The next 2 layers of calls should have been avoided because the input does not contain addition, subtraction, multiplication and division at all!

Therefore, when implementing Parser manually, the parsing of the expression is generally over to other to avoid the disadvantage of recursive descent. Pratt Parsing is one such algorithm that is good at parsing expressions.

Pratt Parsing

Pratt Parsing, also known as Top Down Operator Precedence Parsing, is a very ingenious algorithm. It is simple to implement, has good performance, and is easy to customize and expand. especially good at parsing expressions , and is good at processing expression operator priority ( precedence) and associativity (associativity) .

Algorithm introduction

Concept introduction

Pratt Parsing divides tokens into 2 types:

// 负数的负号前缀
// 不需要知道它左边的表达式
{
  type: "unary",
  operator: "-",
  body: rightExpression,
}
// 减法操作符
// 需要提前解析好它左边的表达式,得到leftExpression,才能构建减法节点
{
  type: "binary",
  operator: "-",
  left: leftExpression,
  right: rightExpression,
}

Note that although - can be both prefix and infix, in fact, when you read the input string from left to right in , you can immediately determine whether the - you encounter should be used as prefix or infix , don't worry about confusion (eg -1-2 ) . You will understand this better after understanding the algorithm below.

Code explanation

The core implementation of the Pratt Parsing algorithm is the parseExp function :

/*  1 */ function parseExp(ctxPrecedence: number): Node {
/*  2 */   let prefixToken = scanner.consume();
/*  3 */   if (!prefixToken) throw new Error(`expect token but found none`);
/*  4 */ 
/*  5 */   // because our scanner is so naive,
/*  6 */   // we treat all non-operator tokens as value (.e.g number)
/*  7 */   const prefixParselet =
/*  8 */     prefixParselets[prefixToken] ?? prefixParselets.__value;
/*  9 */   let left: Node = prefixParselet.handle(prefixToken, parser);
/* 10 */
/* 11 */   while (true) {
/* 12 */     const infixToken = scanner.peek();
/* 13 */     if (!infixToken) break;
/* 14 */     const infixParselet = infixParselets[infixToken];
/* 15 */     if (!infixParselet) break;
/* 16 */     if (infixParselet.precedence <= ctxPrecedence) break;
/* 17 */     scanner.consume();
/* 18 */     left = infixParselet.handle(left, infixToken, parser);
/* 19 */   }
/* 20 */   return left;
/* 21 */ }

Below we explain, line by line, how this algorithm works.

Lines 2 to 10: Parse the prefix

First, this method will eat a token from the token stream . The token must be as a prefix (for example, if - is encountered, it should be understood as a prefix).

Note that consume means eat and peek means glance.

On line 7, we find the expression builder (prefixParselet) corresponding to this prefix and call it. The role of prefixParselet is to construct an expression node centered on this prefix.

Let's assume a simple case first, assuming the first token is 123 . It will trigger the default ( prefixParselets.__value ) of 162243af01ef28 and return a value node directly:

{
  type: "value",
  value: "123",
}

It is the value we assigned to left on line 9 (the already constructed expression node).

In more complex cases, prefixParselet recursively calls parseExp . For example, prefixParselets with minus sign - are registered like this:

// 负号前缀的优先级定为150,它的作用在后面讲述
prefixParselets["-"] = {
  handle(token, parser) {
    const body = parser.parseExp(150);
    return {
      type: "unary",
      operator: "-",
      body,
    };
  },
};

It will recursively call parseExp and parse the expression node on the right as its own body.

Note that it doesn't care what the expression on its left is, which is the fundamental characteristic of prefix.

Here, the parameter 150 passed by the recursive call parseExp(150) can be understood as and its binding strength to the right sub-expression is . For example, when parsing -1+2 , the body obtained by prefix - calling parseExp(150) is 1 instead of 1+2 , which is due to the 150 parameter. The specific mechanism of priority will be described later.

Lines 11~19: Parse infix

After getting the expression node of the prefix, we enter a while loop, which is responsible for parsing out the subsequent infix operations. For example, -1 + 2 + 3 + 4 , the last three plus signs will be parsed in this loop.

It first glimpses a token from the token stream , as an infix, finds its corresponding expression builder (infixParselet), and calls infixParselet.handle to get a new expression node. Note that calls infixParselet with the current left , because infix needs the expression node to its left to construct itself. The new expression node is again assigned to left . left keeps accumulating into a larger tree of nodes.

For example, the - of 062243af01f02e is registered like this:

// 加减法的优先级定义为120
infixParselets["-"] = {
  precedence: 120,
  handle(left, token, parser) {
    const right = parser.parseExp(120);
    return {
      type: "binary",
      operator: "-",
      left,
      right,
    };
  },
};

Similar to prefixParselet, it also recursively calls parseExp to parse the expression node on the right. The difference is that it also has a readable precedence property itself, and that it uses the left parameter when building the expression node.

Moving on, understanding the three judgments in lines 13-16 is the key to understanding the entire algorithm.

The first judgment if (!infixToken) break; is easy to understand, indicating that the end of the input has been read, and the parsing will naturally end.

The second judgment if (!infixParselet) break; is also easier to understand, indicating that a non-infix operator is encountered, which may be due to incorrect syntax in the input, or ) or ; , and the currently parsed expression node needs to be returned to the caller. deal with.

The third judgment if (infixParselet.precedence <= ctxPrecedence) break; is the core of the whole algorithm. The parameter ctxPrecedence of parseExp mentioned above exists for this line. Its function is that restricts this parseExp call to only parse the infix operator ctxPrecedence If the infix priority encountered is less than or equal to ctxPrecedence , the parsing will be stopped, and the current parsing result will be returned to the caller, allowing the caller to process subsequent tokens. The initial value of in ctxPrecedence is 0 , which means that all operations should be parsed until the end (or an unrecognized operator) is encountered.

For example, in the previous example of -1+2 , the prefixParselet with the prefix - calls parseExp(150) recursively. In the execution of recursive parseExp, ctxPrecedence is 150, which is greater than + . The priority of infix is 120 , so this recursive call encounters + . Make prefix - bind to 1 instead of 1+2 . In this way, the correct result (-(1))+2 can be obtained.

This parameter is also passed in when infixParselet recursively calls parseExp.

You can understand the behavior of prefixParselet and infixParselet recursively calling parseExp, and understand that uses a "magnet" to attract subsequent tokens, and the recursive parameter ctxPrecedence represents the "attraction" of the magnet . This infix will be "sucked" together only when the subsequent infix is tightly bound to the token on its left (infixParselet.precedence is large enough). Otherwise, the infix will be "separated" from the token on its left, and the token on its left will participate in the process of constructing the expression node in this parseExp, but the infix will not participate.

Algorithm Summary

To sum up, Pratt Parsing is a combination of looping and recursion . The execution structure of parseExp is probably like this:

  • Eat a token as prefix, call its prefixParselet, get left (the already constructed expression node)

    • prefixParselet recursively calls parseExp , parses the parts you need, and builds expression nodes
  • while loop

    • Glancing at the token as an infix, only if its priority is high enough to continue processing. Otherwise, break out of the loop
    • Eat the infix token, call its infixParselet, and pass left to it

      • infixParselet recursively calls parseExp , parses the parts you need, and builds expression nodes
    • get new left
  • return left

Now, you should be able to understand the above-mentioned "When you read the input string from left to right, you can immediately determine whether the - you encounter should be used as a prefix or an infix, without worrying about confusion (such as -1-2 )", because before reading the next token, the algorithm already knows that the next token should be as the prefix or infix!

The subtlety of Pratt Parsing is that after seeing the first atomic expression, it can directly construct its corresponding node, does not need to know how it is in a higher-level expression structure If it is found after scanning, the expression on the left belongs to an infix, and then it is handed over to the processing function of the infix to construct a higher-level expression. That is, Pratt Parsing builds from the leaf node of the expression tree and places it in the appropriate context (higher level expression structure) based on the results of subsequent scans. That's the root reason why it's so good at handling expressions .

Contrast this with the recursive descent algorithm mentioned earlier, which requires a top-down understanding of the expression structure: program -> block -> statement -> expression -> term -> factor .

Example execution process

Now, using 1 + 2 * 3 - 4 as an example, understand how the Pratt Parsing algorithm works:

  • First define the priority of each infix (ie infixParselet.precedence ): for example, addition and subtraction is 120, multiplication and division is 130 (the "binding strength" of multiplication and division is higher)
  • Initially call parseExp(0) , which is ctxPrecedence=0

    • eats drops a token 1 , calls prefixParselet, gets the expression node 1 , and assigns it to left
    • Enter the while loop, catch a glimpse of + , find its infixParselet with a priority of 120, greater than ctxPrecedence. So this infix is also "sucked away"
    • eats and drops + , and calls + of 062243af01f432. At this time, left is 1

      • + 's infixParselet.handle recursively calls parser.parseExp(120) , which is ctxPrecedence=120
      • eats drops a token 2 , calls prefixParselet, gets the expression node 2 , and assigns it to left
      • Enter the while loop, catch a glimpse of * , find its infixParselet with a priority of 130, greater than ctxPrecedence. So this infix is also "sucked away"
      • eats and drops * , and calls * of 062243af01f510. At this time, left is 2

        • * 's infixParselet.handle recursively calls parser.parseExp(130) , which is ctxPrecedence=130
        • eats drops a token 3 , calls prefixParselet, gets the expression node 3 , and assigns it to left
        • Enter the while loop, catch a glimpse of - , find its infixParselet, the priority is 120, is not greater than ctxPrecedence, so this infix will not be sucked away together, the while loop ends
        • parser.parseExp(130) returns 3
      • The * of 062243af01f5ef returns 2 * 3 ( parser.parseExp the return value of left with 062243af01f5f2) and assigns it to left
      • Continue the while loop, catch a glimpse of - , find its infixParselet with a priority of 120, not greater than ctxPrecedence. So this infix won't be sucked away together and the while loop ends
      • parser.parseExp(120) returns the subexpression 2 * 3
    • The + of 062243af01f657 returns 1+(2*3) ( parser.parseExp the return value of left with 062243af01f65a), and assigns it to left
    • Continue the while loop, catch a glimpse of - , find its infixParselet with a priority of 120, greater than ctxPrecedence. So this infix is also "sucked away"
    • eats and drops - , and calls - of 062243af01f69b. At this time, left is 1+(2*3)
    • In the same way as before, the return result of infixParselet.handle of - is (1+(2*3))-4 ( parser.parseExp the return value of left with 062243af01f6c8), and assign it to left
    • The while loop continues, but it finds that there is no token behind it, so it exits the while loop and returns left
  • parseExp(0) returns (1+(2*3))-4

How to deal with associativity

The operator's associativity ( ) means that when there are multiple consecutive and operators with the same priority in the expression, is the left-associative or the right-hand operator preferred (left-associative) the right-associative.

According to the algorithm described above, 1+1+1+1 is left-associative, that is, it resolves to ((1+1)+1)+1 , which is what we expected.

However, some operators are right-associative, such as the assignment symbol = (eg a = b = 1 should be parsed as a = (b = 1) ), and the exponentiation symbol ^ (eg a^b^c should be parsed as a^(b^c) ).

Here, we use ^ as the exponentiation symbol, instead of using ** like Javascript, in order to avoid an operator that happens to be a prefix of another operator, causing the current implementation defect: encountering the first character of ** eagerly recognized as multiplication. Actually this bug is pretty fixed, can you try to raise a PR?

How to implement this right-associative operator? answer requires only one line : In infixParselet, when recursively calling parseExp , pass a slightly smaller ctxPrecedence. Here is our utility function for registering infix:

function helpCreateInfixOperator(
  infix: string,
  precedence: number,
  associateRight2Left: boolean = false
) {
  infixParselets[infix] = {
    precedence,
    handle(left, token, { parseExp }) {
      const right = parseExp(associateRight2Left ? precedence - 1 : precedence);
      return {
        type: "binary",
        operator: infix,
        left,
        right,
      };
    },
  };
}

In this way, the "suction" of recursion parseExp is weaker. When encountering operators of the same priority, the operators on the right are combined more closely, so they are also "sucked" together (without separation).

complete implementation

Github repository for the full implementation of . It contains tests (100% coverage), and more operator implementations (such as parentheses, function calls, branch operator ...?...:... , right-associative power operator ^ , etc.).

References


csRyan
1.1k 声望198 粉丝

So you're passionate? How passionate? What actions does your passion lead you to do? If the heart doesn't find a perfect rhyme with the head, then your passion means nothing.