In the compilation process, a very important step is syntax analysis (also known as parsing, Parsing). The parser is responsible for converting the token stream into an abstract syntax tree (AST). This article introduces a Parser implementation algorithm: Pratt Parsing, also known as Top Down Operator Precedence Parsing, and
Pratt Parsing is very simple to implement, you can take a look at TypeScript implementation result , the core code is less than 40 lines!
application background
There are generally two ways to implement a parser:
- Using the Parser generator
- manual implementation
Parser generator
Use the Parser generator. Describe your grammar in a DSL (such as BNF), feed the description file to the Parser generator, which will output a code to parse the grammar.
This method is very convenient and sufficient for most needs. But in some cases, it is not flexible enough (for example, it cannot provide more useful, contextual error information), the performance is not good enough, and the generated code is long. Also, when describing the operator precedence and associativity of expressions , the syntax description can become very complex and difficult to read, such as wikipedia example :
expression ::= equality-expression
equality-expression ::= additive-expression ( ( '==' | '!=' ) additive-expression ) *
additive-expression ::= multiplicative-expression ( ( '+' | '-' ) multiplicative-expression ) *
multiplicative-expression ::= primary ( ( '*' | '/' ) primary ) *
primary ::= '(' expression ')' | NUMBER | VARIABLE | '-' primary
You need to create a rule for each kind of precedence, which makes the syntax description of the expression very complicated.
So sometimes you need to use the second way: manual implementation.
manual implementation
recursive descent algorithm
A common way to implement Parser by hand is recursive descent algorithm . The recursive descent algorithm is better at parsing the statement (Statement) , because the creator intentionally placed the statement type identifier at the beginning when designing the statement, such as if (expression) ...
and while (expression) ...
. Thanks to this, after the Parser identifies the type of the statement at the beginning, it knows which structures need to be parsed in turn, and the corresponding structure parsing function can be called in turn, and the implementation is very simple.
However, since the recursive descent algorithm needs to understand the code structure top-down, it is very difficult to process Expression . When Parser reads the beginning of the expression, he cannot know which expression he is in. This is because the operator (Operator) is often in the middle (or even the end) of the expression, such as 062243af01ecda for addition, and +
for function calls. ()
. In order to parse expressions top-down, you need to treat each operator precedence as a level, write analytic functions for it, and manually handle associativity , so parse Functions will be more and more complex.
Therefore, when implementing Parser manually, the parsing of the expression is generally over to other to avoid the disadvantage of recursive descent. Pratt Parsing is one such algorithm that is good at parsing expressions.
Pratt Parsing
Pratt Parsing, also known as Top Down Operator Precedence Parsing, is a very ingenious algorithm. It is simple to implement, has good performance, and is easy to customize and expand. especially good at parsing expressions , and is good at processing expression operator priority ( precedence) and associativity (associativity) .
Algorithm introduction
Concept introduction
Pratt Parsing divides tokens into 2 types:
- prefix (regular term is nud). A token is a "prefix" if it can be placed at the very beginning of an expression. For example,
123
,(
, or-
for negative numbers. With this token as the center, when constructing an expression node, you do not need to know the expression on the left side of the token. They build expression nodes like this:
// 负数的负号前缀
// 不需要知道它左边的表达式
{
type: "unary",
operator: "-",
body: rightExpression,
}
- infix (regular term is led). A token is an "infix" if must know its left-hand subexpression when constructing an expression node. This means that infix cannot be placed at the beginning of any expression. Such as addition, subtraction, multiplication and division operators. They build expression nodes like this:
// 减法操作符
// 需要提前解析好它左边的表达式,得到leftExpression,才能构建减法节点
{
type: "binary",
operator: "-",
left: leftExpression,
right: rightExpression,
}
Note that although -
can be both prefix and infix, in fact, when you read the input string from left to right in , you can immediately determine whether the -
you encounter should be used as prefix or infix , don't worry about confusion (eg -1-2
) . You will understand this better after understanding the algorithm below.
Code explanation
The core implementation of the Pratt Parsing algorithm is the parseExp function :
/* 1 */ function parseExp(ctxPrecedence: number): Node {
/* 2 */ let prefixToken = scanner.consume();
/* 3 */ if (!prefixToken) throw new Error(`expect token but found none`);
/* 4 */
/* 5 */ // because our scanner is so naive,
/* 6 */ // we treat all non-operator tokens as value (.e.g number)
/* 7 */ const prefixParselet =
/* 8 */ prefixParselets[prefixToken] ?? prefixParselets.__value;
/* 9 */ let left: Node = prefixParselet.handle(prefixToken, parser);
/* 10 */
/* 11 */ while (true) {
/* 12 */ const infixToken = scanner.peek();
/* 13 */ if (!infixToken) break;
/* 14 */ const infixParselet = infixParselets[infixToken];
/* 15 */ if (!infixParselet) break;
/* 16 */ if (infixParselet.precedence <= ctxPrecedence) break;
/* 17 */ scanner.consume();
/* 18 */ left = infixParselet.handle(left, infixToken, parser);
/* 19 */ }
/* 20 */ return left;
/* 21 */ }
Below we explain, line by line, how this algorithm works.
Lines 2 to 10: Parse the prefix
First, this method will eat a token from the token stream . The token must be as a prefix (for example, if -
is encountered, it should be understood as a prefix).
Note that consume means eat and peek means glance.
On line 7, we find the expression builder (prefixParselet) corresponding to this prefix and call it. The role of prefixParselet is to construct an expression node centered on this prefix.
Let's assume a simple case first, assuming the first token is 123
. It will trigger the default ( prefixParselets.__value
) of 162243af01ef28 and return a value node directly:
{
type: "value",
value: "123",
}
It is the value we assigned to left
on line 9 (the already constructed expression node).
In more complex cases, prefixParselet recursively calls parseExp
. For example, prefixParselets with minus sign -
are registered like this:
// 负号前缀的优先级定为150,它的作用在后面讲述
prefixParselets["-"] = {
handle(token, parser) {
const body = parser.parseExp(150);
return {
type: "unary",
operator: "-",
body,
};
},
};
It will recursively call parseExp and parse the expression node on the right as its own body.
Note that it doesn't care what the expression on its left is, which is the fundamental characteristic of prefix.
Here, the parameter 150 passed by the recursive call parseExp(150)
can be understood as and its binding strength to the right sub-expression is . For example, when parsing -1+2
, the body obtained by prefix -
calling parseExp(150)
is 1
instead of 1+2
, which is due to the 150 parameter. The specific mechanism of priority will be described later.
Lines 11~19: Parse infix
After getting the expression node of the prefix, we enter a while loop, which is responsible for parsing out the subsequent infix operations. For example, -1 + 2 + 3 + 4
, the last three plus signs will be parsed in this loop.
It first glimpses a token from the token stream , as an infix, finds its corresponding expression builder (infixParselet), and calls infixParselet.handle
to get a new expression node. Note that calls infixParselet with the current left
, because infix needs the expression node to its left to construct itself. The new expression node is again assigned to left
. left
keeps accumulating into a larger tree of nodes.
For example, the -
of 062243af01f02e is registered like this:
// 加减法的优先级定义为120
infixParselets["-"] = {
precedence: 120,
handle(left, token, parser) {
const right = parser.parseExp(120);
return {
type: "binary",
operator: "-",
left,
right,
};
},
};
Similar to prefixParselet, it also recursively calls parseExp to parse the expression node on the right. The difference is that it also has a readable precedence
property itself, and that it uses the left
parameter when building the expression node.
Moving on, understanding the three judgments in lines 13-16 is the key to understanding the entire algorithm.
The first judgment if (!infixToken) break;
is easy to understand, indicating that the end of the input has been read, and the parsing will naturally end.
The second judgment if (!infixParselet) break;
is also easier to understand, indicating that a non-infix operator is encountered, which may be due to incorrect syntax in the input, or )
or ;
, and the currently parsed expression node needs to be returned to the caller. deal with.
The third judgment if (infixParselet.precedence <= ctxPrecedence) break;
is the core of the whole algorithm. The parameter ctxPrecedence
of parseExp mentioned above exists for this line. Its function is that restricts this parseExp call to only parse the infix operator ctxPrecedence
If the infix priority encountered is less than or equal to ctxPrecedence
, the parsing will be stopped, and the current parsing result will be returned to the caller, allowing the caller to process subsequent tokens. The initial value of in ctxPrecedence
is 0 , which means that all operations should be parsed until the end (or an unrecognized operator) is encountered.
For example, in the previous example of -1+2
, the prefixParselet with the prefix -
calls parseExp(150)
recursively. In the execution of recursive parseExp, ctxPrecedence
is 150, which is greater than +
. The priority of infix is 120
, so this recursive call encounters +
. Make prefix -
bind to 1
instead of 1+2
. In this way, the correct result (-(1))+2
can be obtained.
This parameter is also passed in when infixParselet recursively calls parseExp.
You can understand the behavior of prefixParselet and infixParselet recursively calling parseExp, and understand that uses a "magnet" to attract subsequent tokens, and the recursive parameter ctxPrecedence
represents the "attraction" of the magnet . This infix will be "sucked" together only when the subsequent infix is tightly bound to the token on its left (infixParselet.precedence is large enough). Otherwise, the infix will be "separated" from the token on its left, and the token on its left will participate in the process of constructing the expression node in this parseExp, but the infix will not participate.
Algorithm Summary
To sum up, Pratt Parsing is a combination of looping and recursion . The execution structure of parseExp
is probably like this:
Eat a token as prefix, call its prefixParselet, get
left
(the already constructed expression node)- prefixParselet recursively calls parseExp , parses the parts you need, and builds expression nodes
while loop
- Glancing at the token as an infix, only if its priority is high enough to continue processing. Otherwise, break out of the loop
Eat the infix token, call its infixParselet, and pass
left
to it- infixParselet recursively calls parseExp , parses the parts you need, and builds expression nodes
- get new
left
return left
Now, you should be able to understand the above-mentioned "When you read the input string from left to right, you can immediately determine whether the -
you encounter should be used as a prefix or an infix, without worrying about confusion (such as -1-2
)", because before reading the next token, the algorithm already knows that the next token should be as the prefix or infix!
The subtlety of Pratt Parsing is that after seeing the first atomic expression, it can directly construct its corresponding node, does not need to know how it is in a higher-level expression structure If it is found after scanning, the expression on the left belongs to an infix, and then it is handed over to the processing function of the infix to construct a higher-level expression. That is, Pratt Parsing builds from the leaf node of the expression tree and places it in the appropriate context (higher level expression structure) based on the results of subsequent scans. That's the root reason why it's so good at handling expressions .
Contrast this with the recursive descent algorithm mentioned earlier, which requires a top-down understanding of the expression structure: program -> block -> statement -> expression -> term -> factor
.
Example execution process
Now, using 1 + 2 * 3 - 4
as an example, understand how the Pratt Parsing algorithm works:
- First define the priority of each infix (ie
infixParselet.precedence
): for example, addition and subtraction is 120, multiplication and division is 130 (the "binding strength" of multiplication and division is higher) Initially call
parseExp(0)
, which isctxPrecedence=0
- eats drops a token
1
, calls prefixParselet, gets the expression node1
, and assigns it toleft
- Enter the while loop, catch a glimpse of
+
, find its infixParselet with a priority of 120, greater than ctxPrecedence. So this infix is also "sucked away" eats and drops
+
, and calls+
of 062243af01f432. At this time,left
is1
+
's infixParselet.handle recursively callsparser.parseExp(120)
, which isctxPrecedence=120
- eats drops a token
2
, calls prefixParselet, gets the expression node2
, and assigns it toleft
- Enter the while loop, catch a glimpse of
*
, find its infixParselet with a priority of 130, greater than ctxPrecedence. So this infix is also "sucked away" eats and drops
*
, and calls*
of 062243af01f510. At this time,left
is2
*
's infixParselet.handle recursively callsparser.parseExp(130)
, which isctxPrecedence=130
- eats drops a token
3
, calls prefixParselet, gets the expression node3
, and assigns it toleft
- Enter the while loop, catch a glimpse of
-
, find its infixParselet, the priority is 120, is not greater than ctxPrecedence, so this infix will not be sucked away together, the while loop ends parser.parseExp(130)
returns3
- The
*
of 062243af01f5ef returns2 * 3
(parser.parseExp
the return value ofleft
with 062243af01f5f2) and assigns it toleft
- Continue the while loop, catch a glimpse of
-
, find its infixParselet with a priority of 120, not greater than ctxPrecedence. So this infix won't be sucked away together and the while loop ends parser.parseExp(120)
returns the subexpression2 * 3
- The
+
of 062243af01f657 returns1+(2*3)
(parser.parseExp
the return value ofleft
with 062243af01f65a), and assigns it toleft
- Continue the while loop, catch a glimpse of
-
, find its infixParselet with a priority of 120, greater than ctxPrecedence. So this infix is also "sucked away" - eats and drops
-
, and calls-
of 062243af01f69b. At this time,left
is1+(2*3)
- In the same way as before, the return result of infixParselet.handle of
-
is(1+(2*3))-4
(parser.parseExp
the return value ofleft
with 062243af01f6c8), and assign it toleft
- The while loop continues, but it finds that there is no token behind it, so it exits the while loop and returns
left
- eats drops a token
parseExp(0)
returns(1+(2*3))-4
How to deal with associativity
The operator's associativity ( ) means that when there are multiple consecutive and operators with the same priority in the expression, is the left-associative or the right-hand operator preferred (left-associative) the right-associative.
According to the algorithm described above, 1+1+1+1
is left-associative, that is, it resolves to ((1+1)+1)+1
, which is what we expected.
However, some operators are right-associative, such as the assignment symbol =
(eg a = b = 1
should be parsed as a = (b = 1)
), and the exponentiation symbol ^
(eg a^b^c
should be parsed as a^(b^c)
).
Here, we use^
as the exponentiation symbol, instead of using**
like Javascript, in order to avoid an operator that happens to be a prefix of another operator, causing the current implementation defect: encountering the first character of**
eagerly recognized as multiplication. Actually this bug is pretty fixed, can you try to raise a PR?
How to implement this right-associative operator? answer requires only one line : In infixParselet, when recursively calling parseExp
, pass a slightly smaller ctxPrecedence. Here is our utility function for registering infix:
function helpCreateInfixOperator(
infix: string,
precedence: number,
associateRight2Left: boolean = false
) {
infixParselets[infix] = {
precedence,
handle(left, token, { parseExp }) {
const right = parseExp(associateRight2Left ? precedence - 1 : precedence);
return {
type: "binary",
operator: infix,
left,
right,
};
},
};
}
In this way, the "suction" of recursion parseExp
is weaker. When encountering operators of the same priority, the operators on the right are combined more closely, so they are also "sucked" together (without separation).
complete implementation
Github repository for the full implementation of . It contains tests (100% coverage), and more operator implementations (such as parentheses, function calls, branch operator ...?...:...
, right-associative power operator ^
, etc.).
References
- How Desmos uses Pratt Parsers This article walks the reader through deriving the Pratt algorithm from scratch and gives the trade-offs they made when choosing Pratt Parsing.
- Pratt Parsers: Expression Parsing Made Easy is also a good introductory article that takes the reader through the derivation of Pratt's algorithm.
- Arrow functions break JavaScript parsers leads us to think about an interesting question: How are JavaScript arrow functions
(arg1=...)=>{...}
parsed? Might be harder than you think!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。