Handwritten a Parser - Pratt Parsing with simple code and powerful functions

In the compilation process, a very important step is syntax analysis (also known as parsing, Parsing). The parser is responsible for converting the token stream into an abstract syntax tree (AST). This article introduces a Parser implementation algorithm: Pratt Parsing, also known as Top Down Operator Precedence Parsing, and

Pratt Parsing is very simple to implement, you can take a look at TypeScript implementation result , the core code is less than 40 lines!

application background

There are generally two ways to implement a parser:

  • Using the Parser generator
  • manual implementation

Parser generator

Use the Parser generator. Describe your grammar in a DSL (such as BNF), feed the description file to the Parser generator, which will output a code to parse the grammar.

This method is very convenient and sufficient for most needs. But in some cases, it is not flexible enough (for example, it cannot provide more useful, contextual error information), the performance is not good enough, and the generated code is long. Also, when describing the operator precedence and associativity of expressions , the syntax description can become very complex and difficult to read, such as wikipedia example :

expression ::= equality-expression
equality-expression ::= additive-expression ( ( '==' | '!=' ) additive-expression ) *
additive-expression ::= multiplicative-expression ( ( '+' | '-' ) multiplicative-expression ) *
multiplicative-expression ::= primary ( ( '*' | '/' ) primary ) *
primary ::= '(' expression ')' | NUMBER | VARIABLE | '-' primary

You need to create a rule for each kind of precedence, which makes the syntax description of the expression very complicated.

So sometimes you need to use the second way: manual implementation.

manual implementation

recursive descent algorithm

A common way to implement Parser by hand is recursive descent algorithm . The recursive descent algorithm is better at parsing the statement (Statement) , because the creator intentionally placed the statement type identifier at the beginning when designing the statement, such as if (expression) ... and while (expression) ... . Thanks to this, after the Parser identifies the type of the statement at the beginning, it knows which structures need to be parsed in turn, and the corresponding structure parsing function can be called in turn, and the implementation is very simple.

However, since the recursive descent algorithm needs to understand the code structure top-down, it is very difficult to process Expression . When Parser reads the beginning of the expression, he cannot know which expression he is in. This is because the operator (Operator) is often in the middle (or even the end) of the expression, such as 062243af01ecda for addition, and + for function calls. () . In order to parse expressions top-down, you need to treat each operator precedence as a level, write analytic functions for it, and manually handle associativity , so parse Functions will be more and more complex.

For example, in the example of the wikipedia, expression handles addition and subtraction, term handles multiplication and division, and the former calls the latter (multiplication and division terms are at a lower level). It is conceivable that when there are more priorities, the code will be more complex, and the recursive call level will be deeper. For example, even if the input string is simply 1 , this parser needs to recursively call the following parsing function: program -> block -> statement -> expression -> term -> factor . The next 2 layers of calls should have been avoided because the input does not contain addition, subtraction, multiplication and division at all!

Therefore, when implementing Parser manually, the parsing of the expression is generally over to other to avoid the disadvantage of recursive descent. Pratt Parsing is one such algorithm that is good at parsing expressions.

Pratt Parsing

Pratt Parsing, also known as Top Down Operator Precedence Parsing, is a very ingenious algorithm. It is simple to implement, has good performance, and is easy to customize and expand. especially good at parsing expressions , and is good at processing expression operator priority ( precedence) and associativity (associativity) .

Algorithm introduction

Concept introduction

Pratt Parsing divides tokens into 2 types:

// 负数的负号前缀
// 不需要知道它左边的表达式
{
  type: "unary",
  operator: "-",
  body: rightExpression,
}
// 减法操作符
// 需要提前解析好它左边的表达式,得到leftExpression,才能构建减法节点
{
  type: "binary",
  operator: "-",
  left: leftExpression,
  right: rightExpression,
}

Note that although - can be both prefix and infix, in fact, when you read the input string from left to right in , you can immediately determine whether the - you encounter should be used as prefix or infix , don't worry about confusion (eg -1-2 ) . You will understand this better after understanding the algorithm below.

Code explanation

The core implementation of the Pratt Parsing algorithm is the parseExp function :

/*  1 */ function parseExp(ctxPrecedence: number): Node {
/*  2 */   let prefixToken = scanner.consume();
/*  3 */   if (!prefixToken) throw new Error(`expect token but found none`);
/*  4 */ 
/*  5 */   // because our scanner is so naive,
/*  6 */   // we treat all non-operator tokens as value (.e.g number)
/*  7 */   const prefixParselet =
/*  8 */     prefixParselets[prefixToken] ?? prefixParselets.__value;
/*  9 */   let left: Node = prefixParselet.handle(prefixToken, parser);
/* 10 */
/* 11 */   while (true) {
/* 12 */     const infixToken = scanner.peek();
/* 13 */     if (!infixToken) break;
/* 14 */     const infixParselet = infixParselets[infixToken];
/* 15 */     if (!infixParselet) break;
/* 16 */     if (infixParselet.precedence <= ctxPrecedence) break;
/* 17 */     scanner.consume();
/* 18 */     left = infixParselet.handle(left, infixToken, parser);
/* 19 */   }
/* 20 */   return left;
/* 21 */ }

Below we explain, line by line, how this algorithm works.

Lines 2 to 10: Parse the prefix

First, this method will eat a token from the token stream . The token must be as a prefix (for example, if - is encountered, it should be understood as a prefix).

Note that consume means eat and peek means glance.

On line 7, we find the expression builder (prefixParselet) corresponding to this prefix and call it. The role of prefixParselet is to construct an expression node centered on this prefix.

Let's assume a simple case first, assuming the first token is 123 . It will trigger the default ( prefixParselets.__value ) of 162243af01ef28 and return a value node directly:

{
  type: "value",
  value: "123",
}

It is the value we assigned to left on line 9 (the already constructed expression node).

In more complex cases, prefixParselet recursively calls parseExp . For example, prefixParselets with minus sign - are registered like this:

// 负号前缀的优先级定为150,它的作用在后面讲述
prefixParselets["-"] = {
  handle(token, parser) {
    const body = parser.parseExp(150);
    return {
      type: "unary",
      operator: "-",
      body,
    };
  },
};

It will recursively call parseExp and parse the expression node on the right as its own body.

Note that it doesn't care what the expression on its left is, which is the fundamental characteristic of prefix.

Here, the parameter 150 passed by the recursive call parseExp(150) can be understood as and its binding strength to the right sub-expression is . For example, when parsing -1+2 , the body obtained by prefix - calling parseExp(150) is 1 instead of 1+2 , which is due to the 150 parameter. The specific mechanism of priority will be described later.

Lines 11~19: Parse infix

After getting the expression node of the prefix, we enter a while loop, which is responsible for parsing out the subsequent infix operations. For example, -1 + 2 + 3 + 4 , the last three plus signs will be parsed in this loop.

It first glimpses a token from the token stream , as an infix, finds its corresponding expression builder (infixParselet), and calls infixParselet.handle to get a new expression node. Note that calls infixParselet with the current left , because infix needs the expression node to its left to construct itself. The new expression node is again assigned to left . left keeps accumulating into a larger tree of nodes.

For example, the - of 062243af01f02e is registered like this:

// 加减法的优先级定义为120
infixParselets["-"] = {
  precedence: 120,
  handle(left, token, parser) {
    const right = parser.parseExp(120);
    return {
      type: "binary",
      operator: "-",
      left,
      right,
    };
  },
};

Similar to prefixParselet, it also recursively calls parseExp to parse the expression node on the right. The difference is that it also has a readable precedence property itself, and that it uses the left parameter when building the expression node.

Moving on, understanding the three judgments in lines 13-16 is the key to understanding the entire algorithm.

The first judgment if (!infixToken) break; is easy to understand, indicating that the end of the input has been read, and the parsing will naturally end.

The second judgment if (!infixParselet) break; is also easier to understand, indicating that a non-infix operator is encountered, which may be due to incorrect syntax in the input, or ) or ; , and the currently parsed expression node needs to be returned to the caller. deal with.

The third judgment if (infixParselet.precedence <= ctxPrecedence) break; is the core of the whole algorithm. The parameter ctxPrecedence of parseExp mentioned above exists for this line. Its function is that restricts this parseExp call to only parse the infix operator ctxPrecedence If the infix priority encountered is less than or equal to ctxPrecedence , the parsing will be stopped, and the current parsing result will be returned to the caller, allowing the caller to process subsequent tokens. The initial value of in ctxPrecedence is 0 , which means that all operations should be parsed until the end (or an unrecognized operator) is encountered.

For example, in the previous example of -1+2 , the prefixParselet with the prefix - calls parseExp(150) recursively. In the execution of recursive parseExp, ctxPrecedence is 150, which is greater than + . The priority of infix is 120 , so this recursive call encounters + . Make prefix - bind to 1 instead of 1+2 . In this way, the correct result (-(1))+2 can be obtained.

This parameter is also passed in when infixParselet recursively calls parseExp.

You can understand the behavior of prefixParselet and infixParselet recursively calling parseExp, and understand that uses a "magnet" to attract subsequent tokens, and the recursive parameter ctxPrecedence represents the "attraction" of the magnet . This infix will be "sucked" together only when the subsequent infix is tightly bound to the token on its left (infixParselet.precedence is large enough). Otherwise, the infix will be "separated" from the token on its left, and the token on its left will participate in the process of constructing the expression node in this parseExp, but the infix will not participate.

Algorithm Summary

To sum up, Pratt Parsing is a combination of looping and recursion . The execution structure of parseExp is probably like this:

  • Eat a token as prefix, call its prefixParselet, get left (the already constructed expression node)

    • prefixParselet recursively calls parseExp , parses the parts you need, and builds expression nodes
  • while loop

    • Glancing at the token as an infix, only if its priority is high enough to continue processing. Otherwise, break out of the loop
    • Eat the infix token, call its infixParselet, and pass left to it

      • infixParselet recursively calls parseExp , parses the parts you need, and builds expression nodes
    • get new left
  • return left

Now, you should be able to understand the above-mentioned "When you read the input string from left to right, you can immediately determine whether the - you encounter should be used as a prefix or an infix, without worrying about confusion (such as -1-2 )", because before reading the next token, the algorithm already knows that the next token should be as the prefix or infix!

The subtlety of Pratt Parsing is that after seeing the first atomic expression, it can directly construct its corresponding node, does not need to know how it is in a higher-level expression structure If it is found after scanning, the expression on the left belongs to an infix, and then it is handed over to the processing function of the infix to construct a higher-level expression. That is, Pratt Parsing builds from the leaf node of the expression tree and places it in the appropriate context (higher level expression structure) based on the results of subsequent scans. That's the root reason why it's so good at handling expressions .

Contrast this with the recursive descent algorithm mentioned earlier, which requires a top-down understanding of the expression structure: program -> block -> statement -> expression -> term -> factor .

Example execution process

Now, using 1 + 2 * 3 - 4 as an example, understand how the Pratt Parsing algorithm works:

  • First define the priority of each infix (ie infixParselet.precedence ): for example, addition and subtraction is 120, multiplication and division is 130 (the "binding strength" of multiplication and division is higher)
  • Initially call parseExp(0) , which is ctxPrecedence=0

    • eats drops a token 1 , calls prefixParselet, gets the expression node 1 , and assigns it to left
    • Enter the while loop, catch a glimpse of + , find its infixParselet with a priority of 120, greater than ctxPrecedence. So this infix is also "sucked away"
    • eats and drops + , and calls + of 062243af01f432. At this time, left is 1

      • + 's infixParselet.handle recursively calls parser.parseExp(120) , which is ctxPrecedence=120
      • eats drops a token 2 , calls prefixParselet, gets the expression node 2 , and assigns it to left
      • Enter the while loop, catch a glimpse of * , find its infixParselet with a priority of 130, greater than ctxPrecedence. So this infix is also "sucked away"
      • eats and drops * , and calls * of 062243af01f510. At this time, left is 2

        • * 's infixParselet.handle recursively calls parser.parseExp(130) , which is ctxPrecedence=130
        • eats drops a token 3 , calls prefixParselet, gets the expression node 3 , and assigns it to left
        • Enter the while loop, catch a glimpse of - , find its infixParselet, the priority is 120, is not greater than ctxPrecedence, so this infix will not be sucked away together, the while loop ends
        • parser.parseExp(130) returns 3
      • The * of 062243af01f5ef returns 2 * 3 ( parser.parseExp the return value of left with 062243af01f5f2) and assigns it to left
      • Continue the while loop, catch a glimpse of - , find its infixParselet with a priority of 120, not greater than ctxPrecedence. So this infix won't be sucked away together and the while loop ends
      • parser.parseExp(120) returns the subexpression 2 * 3
    • The + of 062243af01f657 returns 1+(2*3) ( parser.parseExp the return value of left with 062243af01f65a), and assigns it to left
    • Continue the while loop, catch a glimpse of - , find its infixParselet with a priority of 120, greater than ctxPrecedence. So this infix is also "sucked away"
    • eats and drops - , and calls - of 062243af01f69b. At this time, left is 1+(2*3)
    • In the same way as before, the return result of infixParselet.handle of - is (1+(2*3))-4 ( parser.parseExp the return value of left with 062243af01f6c8), and assign it to left
    • The while loop continues, but it finds that there is no token behind it, so it exits the while loop and returns left
  • parseExp(0) returns (1+(2*3))-4

How to deal with associativity

The operator's associativity ( ) means that when there are multiple consecutive and operators with the same priority in the expression, is the left-associative or the right-hand operator preferred (left-associative) the right-associative.

According to the algorithm described above, 1+1+1+1 is left-associative, that is, it resolves to ((1+1)+1)+1 , which is what we expected.

However, some operators are right-associative, such as the assignment symbol = (eg a = b = 1 should be parsed as a = (b = 1) ), and the exponentiation symbol ^ (eg a^b^c should be parsed as a^(b^c) ).

Here, we use ^ as the exponentiation symbol, instead of using ** like Javascript, in order to avoid an operator that happens to be a prefix of another operator, causing the current implementation defect: encountering the first character of ** eagerly recognized as multiplication. Actually this bug is pretty fixed, can you try to raise a PR?

How to implement this right-associative operator? answer requires only one line : In infixParselet, when recursively calling parseExp , pass a slightly smaller ctxPrecedence. Here is our utility function for registering infix:

function helpCreateInfixOperator(
  infix: string,
  precedence: number,
  associateRight2Left: boolean = false
) {
  infixParselets[infix] = {
    precedence,
    handle(left, token, { parseExp }) {
      const right = parseExp(associateRight2Left ? precedence - 1 : precedence);
      return {
        type: "binary",
        operator: infix,
        left,
        right,
      };
    },
  };
}

In this way, the "suction" of recursion parseExp is weaker. When encountering operators of the same priority, the operators on the right are combined more closely, so they are also "sucked" together (without separation).

complete implementation

Github repository for the full implementation of . It contains tests (100% coverage), and more operator implementations (such as parentheses, function calls, branch operator ...?...:... , right-associative power operator ^ , etc.).

References


csRyan的学习专栏
分享对于计算机科学的学习和思考,只发布有价值的文章: 对于那些网上已经有完整资料,且相关资料已经整...

So you're passionate? How passionate? What actions does your passion lead you to do? If the heart...

1.1k 声望
181 粉丝
0 条评论
推荐阅读
如何编写一个d.ts文件
总结一下:从类型type角度分为:基本类型(string、number、boolean等)及其混合;复杂类型(class、function、object)及其混合(比如说又是class又是function)。从代码有效范围分为:全局变量、模块变量和又是...

Midqiu282阅读 111.8k评论 45

「多图预警」完美实现一个@功能
一天产品大大向 boss 汇报完研发成果和产品业绩产出,若有所思的走出来,劲直向我走过来,嘴角微微上扬。产品大大:boss 对我们的研发成果挺满意的,balabala...(内心 OS:不听,讲重点)产品大大:咱们的客服 I...

wuwhs40阅读 4.7k评论 5

封面图
安全地在前后端之间传输数据 - 「3」真的安全吗?
在「2」注册和登录示例中,我们通过非对称加密算法实现了浏览器和 Web 服务器之间的安全传输。看起来一切都很美好,但是危险就在哪里,有些人发现了,有些人嗅到了,更多人却浑然不知。就像是给门上了把好锁,还...

边城31阅读 7.2k评论 5

封面图
涨姿势了,有意思的气泡 Loading 效果
今日,群友提问,如何实现这么一个 Loading 效果:这个确实有点意思,但是这是 CSS 能够完成的?没错,这个效果中的核心气泡效果,其实借助 CSS 中的滤镜,能够比较轻松的实现,就是所需的元素可能多点。参考我们...

chokcoco20阅读 2.1k评论 2

在前端使用 JS 进行分类汇总
最近遇到一些同学在问 JS 中进行数据统计的问题。虽然数据统计一般会在数据库中进行,但是后端遇到需要使用程序来进行统计的情况也非常多。.NET 就为了对内存数据和数据库数据进行统一地数据处理,发明了 LINQ (L...

边城17阅读 1.9k

封面图
【已结束】SegmentFault 思否写作挑战赛!
SegmentFault 思否写作挑战赛 是思否社区新上线的系列社区活动在 2 月 8 日 正式面向社区所有用户开启;挑战赛中包含多个可供作者选择的热门技术方向,根据挑战难度分为多个等级,快来参与挑战,向更好的自己前进!

SegmentFault思否20阅读 5.6k评论 10

封面图
过滤/筛选树节点
又是树,是我跟树杠上了吗?—— 不,是树的问题太多了!🔗 相关文章推荐:使用递归遍历并转换树形数据(以 TypeScript 为例)从列表生成树 (JavaScript/TypeScript) 过滤和筛选是一个意思,都是 filter。对于列表来...

边城18阅读 7.7k评论 3

封面图

So you're passionate? How passionate? What actions does your passion lead you to do? If the heart...

1.1k 声望
181 粉丝
宣传栏