1

1. Background

Since big data was first written into the government work report in 2014, big data has been in development for 7 years. The types of big data also extend from transaction data to interaction data and sensor data. The data scale has also reached the PB level.

The scale of big data is so large that the acquisition, storage, management, and analysis of data are beyond the capabilities of traditional database software tools. In this context, various big data-related tools have emerged one after another to respond to the needs of various business scenarios. From Hive, Spark, Presto, Kylin, Druid in the Hadoop ecosystem to ClickHouse and Elasticsearch in the non-Hadoop ecosystem...

These big data processing tools have different characteristics and different application scenarios, but the interfaces or operating languages provided to the outside are similar, that is, each component supports the SQL language. Only based on different application scenarios and characteristics, the respective SQL dialects are implemented. This requires related open source projects to implement SQL parsing by themselves. In this context, the grammar parser generator ANTLR, which was born in 1989, ushered in a golden age.

2. Introduction

ANTLR is an open source grammar parser generator with a history of more than 30 years. It is an open source project that has stood the test of time. A program from source code to machine executable basically requires three stages: writing, compiling, and executing.

In the compilation stage, lexical and grammatical analysis is required. The problem that ANTLR focuses on is to perform lexical and syntactic analysis on the source code to produce a tree-like analyzer. ANTLR supports the parsing of almost all mainstream programming languages. From antlr/grammars-v4 can see that ANTLR supports dozens of programming languages such as Java, C, Python, SQL. Usually we do not have the need to extend the programming language, so in most cases the language compilation support is more for learning and research, or used in various development tools (NetBeans, Intellij) to verify syntax correctness and formatting Code.

For SQL language, the application breadth and depth of ANTLR will be greater. This is because Hive, Presto, SparkSQL, etc. require customized development of SQL execution, such as implementing distributed query engines and unique in various big data scenarios. Characteristics and so on.

3. Realize four arithmetic operations based on ANTLR4

Currently we mainly use ANTLR4. In the book "The Definitive ANTLR4 Reference", various interesting application scenarios based on ANTLR4 are introduced. For example: implement a calculator that supports four arithmetic operations; implement the analysis and extraction of formatted text such as JSON;

Convert JSON to XML; extract interfaces from Java source code, etc. This section takes the implementation of four arithmetic calculators as an example to introduce the simple application of Antlr4 and pave the way for the later implementation of ANTLR4 to parse SQL. In fact, supporting digital operations is also a basic ability that every programming language must have.

3.1 Realize by self-coding

Without ANTLR4, what should we do if we want to implement the four arithmetic? One way of thinking is based on stack implementation. For example, without considering exception handling, the simple four arithmetic codes implemented by themselves are as follows:

package org.example.calc;
 
import java.util.*;
 
public class CalcByHand {
    // 定义操作符并区分优先级,*/ 优先级较高
    public static Set<String> opSet1 = new HashSet<>();
    public static Set<String> opSet2 = new HashSet<>();
    static{
        opSet1.add("+");
        opSet1.add("-");
        opSet2.add("*");
        opSet2.add("/");
    }
    public static void main(String[] args) {
        String exp="1+3*4";
        //将表达式拆分成token
        String[] tokens = exp.split("((?<=[\\+|\\-|\\*|\\/])|(?=[\\+|\\-|\\*|\\/]))");
 
        Stack<String> opStack = new Stack<>();
        Stack<String> numStack = new Stack<>();
        int proi=1;
        // 基于类型放到不同的栈中
        for(String token: tokens){
            token = token.trim();
 
            if(opSet1.contains(token)){
                opStack.push(token);
                proi=1;
            }else if(opSet2.contains(token)){
                proi=2;
                opStack.push(token);
            }else{
                numStack.push(token);
                // 如果操作数前面的运算符是高优先级运算符,计算后结果入栈
                if(proi==2){
                    calcExp(opStack,numStack);
                }
            }
        }
 
        while (!opStack.isEmpty()){
            calcExp(opStack,numStack);
        }
        String finalVal = numStack.pop();
        System.out.println(finalVal);
    }
     
    private static void calcExp(Stack<String> opStack, Stack<String> numStack) {
        double right=Double.valueOf(numStack.pop());
        double left = Double.valueOf(numStack.pop());
        String op = opStack.pop();
        String val;
        switch (op){
            case "+":
                 val =String.valueOf(left+right);
                break;
            case "-":
                 val =String.valueOf(left-right);
                break;
            case "*":
                val =String.valueOf(left*right);
                break;
            case "/":
                val =String.valueOf(left/right);
                break;
            default:
                throw new UnsupportedOperationException("unsupported");
        }
        numStack.push(val);
    }
}

The amount of code is not large, the data structure-stack feature is used, and the operator priority needs to be controlled by itself. The feature does not support bracket expressions, nor does it support expression assignment. Next, look at the implementation using ANTLR4.

3.2 Implementation based on ANTLR4

The basic process of programming with ANTLR4 is fixed, usually divided into the following three steps:

  • Based on the requirements, compile the semantic rules of the custom grammar according to the rules of ANTLR4, and save them as files with the suffix g4.
  • Use ANTLR4 tool to process g4 file, generate lexical analyzer, syntax analyzer code, dictionary file.
  • Write code to inherit the Visitor class or implement the Listener interface, and develop your own business logic code.

Based on the above process, we analyze the details with the help of existing cases.

Step 1: Define a grammar file based on the rules of ANTLR4. The file name is suffixed with g4. For example, the grammar rule file for the calculator is named LabeledExpr.g4. Its contents are as follows:

grammar LabeledExpr; // rename to distinguish from Expr.g4
 
prog:   stat+ ;
 
stat:   expr NEWLINE                # printExpr
    |   ID '=' expr NEWLINE         # assign
    |   NEWLINE                     # blank
    ;
 
expr:   expr op=('*'|'/') expr      # MulDiv
    |   expr op=('+'|'-') expr      # AddSub
    |   INT                         # int
    |   ID                          # id
    |   '(' expr ')'                # parens
    ;
 
MUL :   '*' ; // assigns token name to '*' used above in grammar
DIV :   '/' ;
ADD :   '+' ;
SUB :   '-' ;
ID  :   [a-zA-Z]+ ;      // match identifiers
INT :   [0-9]+ ;         // match integers
NEWLINE:'\r'? '\n' ;     // return newlines to parser (is end-statement signal)
WS  :   [ \t]+ -> skip ; // toss out whitespace

(Note: This document case comes from "The Definitive ANTLR4 Reference")

Simply interpret the LabeledExpr.g4 file. ANTLR4 rules are defined based on regular expression definitions. The understanding of rules is top-down, and each sentence ending with a semicolon represents a rule. For example, the first line: grammar LabeledExpr ; It means that our grammar name is LabeledExpr, and this name needs to be consistent with the file name. Java coding also has a similar rule: the class name is the same as the class file.

rule prog indicates that prog is one or more stats.

rule stat adapts to three sub-rules: blank line, expression expr, assignment expression ID'='expr.

expression expr adapts to five sub-rules: multiplication and division, addition and subtraction, integer, ID, bracket expressions. Obviously, this is a recursive definition.

The final definition is the basic element that composes the compound rule, such as: rule ID: [a-zA-Z]+ means that ID is limited to uppercase and lowercase English strings; INT: [0-9]+ ; means INT this rule It is one or more numbers between 0-9. Of course, this definition is not strict. To be more strict, its length should be limited.

On the basis of understanding regular expressions, ANTLR4's g4 grammar rules are relatively easy to understand.

When defining ANTLR4 rules, we need to pay attention to a situation, that is, a string may support multiple rules at the same time, such as the following two rules:

ID: [a-zA-Z]+;

FROM: ‘from’;

Obviously, the string "from" satisfies the above two rules at the same time, and the ANTLR4 processing method is determined in accordance with the defined order. Here ID is defined in front of FROM, so the string from will match the ID rule first.

In fact, in the definition and regulations, after writing the g4 file, ANTLR4 has completed 50% of the work for us: it has helped us implement the entire architecture and interface, and the rest of the development work is based on the interface or abstract class for specific implementation . There are two ways to process the generated syntax tree in implementation, one is the Visitor mode, and the other is the Listener (listener mode).

3.2.1 Using Visitor Mode

Step 2: Use the ANTLR4 tool to parse the g4 file and generate code. That is, the ANTLR tool parses the g4 file and automatically generates the basic code for us. The process diagram is as follows:

The command line is as follows:

antlr4 -package org.example.calc -no-listener -visitor .\LabeledExpr.g4

After the command is executed, the generated file is as follows:

$ tree .
.
├── LabeledExpr.g4
├── LabeledExpr.tokens
├── LabeledExprBaseVisitor.java
├── LabeledExprLexer.java
├── LabeledExprLexer.tokens
├── LabeledExprParser.java
└── LabeledExprVisitor.java

First, develop the entry class Calc.java. The Calc class is the entry point of the entire program. The core codes of the lexer and parser classes that call ANTLR4 are as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
EvalVisitor eval = new EvalVisitor();
eval.visit(tree);

Next, define the class to inherit the LabeledExprBaseVisitor class, the overriding method is as follows:

As can be seen from the figure, the generated code corresponds to the rule definition. For example, visitAddSub corresponds to the AddSub rule, and visitId corresponds to the id rule. And so on... The code to implement addition and subtraction is as follows:

/** expr op=('+'|'-') expr */
@Override
public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
    int left = visit(ctx.expr(0));  // get value of left subexpression
    int right = visit(ctx.expr(1)); // get value of right subexpression
    if ( ctx.op.getType() == LabeledExprParser.ADD ) return left + right;
    return left - right; // must be SUB
}

Quite intuitive. After the code is written, Calc is run. Run the main function of Calc, enter the corresponding calculation expression in the interactive command line, and wrap Ctrl+D to see the calculation result. For example, 1+3*4=13.

3.2.2 Use Listener mode

Similarly, we can also use the Listener mode to implement the four arithmetic. The command line is as follows:

antlr4 -package org.example.calc -listener .\LabeledExpr.g4

The execution of this command will also produce framework code for us. On the basis of the framework code, we can develop entry classes and interface implementation classes. First, develop the entry class Calc.java. The Calc class is the entry point of the entire program. The code for calling ANTLR4's lexer and parser classes is as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(new EvalListener(), tree);

It can be seen that the calling logic for generating ParseTree is exactly the same. The code to implement the Listener is slightly more complicated, and the data structure of the stack is also needed, but only one operand stack is needed, and there is no need to control the priority by yourself. Take AddSub as an example:

@Override
public void exitAddSub(LabeledExprParser.AddSubContext ctx) {
    Double left = numStack.pop();
    Double right= numStack.pop();
    Double result;
    if (ctx.op.getType() == LabeledExprParser.ADD) {
        result = left + right;
    } else {
        result = left - right;
    }
    numStack.push(result);
}

Take the operand directly from the stack and perform the operation.

3.2.3 Summary

Regarding the difference between Listener mode and Visitor mode, there is a clear explanation in the book "The Definitive ANTLR 4 Reference":

Listener mode:

Visitor mode:

  • The Listener mode traverses by itself through the walker object, without considering the relationship between the upper and lower levels of the syntax tree. Vistor needs to control the accessed child nodes by itself. If a child node is omitted, the entire child node will not be accessible.
  • The Listener mode method has no return value, and the Vistor mode can set any return value.
  • The access stack of the Listener mode is clear and clear, and the Vistor mode is the method call stack. If the implementation is wrong, it may cause StackOverFlow.

Through this simple example, we drive Antlr4 to implement a simple calculator. Learned the application process of ANTLR4. Understand the definition of g4 grammar file, Visitor mode and Listener mode. Through ANTLR4, we generated ParseTree, and accessed this ParseTree based on Visitor mode and Listener mode, and implemented four arithmetic operations.

Based on the above examples, it can be found that if there is no ANTLR4, we can write the algorithm ourselves to achieve the same function. But using ANTLR doesn't need to care about the parsing process of the expression string, just focus on the specific business realization, which is very worry-free and trouble-free.

More importantly, ANTLR4 provides more imaginative abstract logic than self-implementation, and it has risen to the height of methodology, because it is not limited to solving a certain problem, but solving a type of problem. It can be said that ANTLR is like the gap between ordinary area formula and calculus in the field of mathematics compared to hard-coding the problem by itself.

Four, refer to Presto source code to develop SQL parser

The four arithmetic operations using ANTLR4 were introduced in the previous section, and the purpose is to understand the application of ANTLR4. See you in the next picture to show our true purpose: to study how ANTLR4 implements the analysis of SQL statements in Presto.

Supporting complete SQL syntax is a huge project. There is a complete SqlBase.g4 file in presto, which defines all SQL grammars supported by presto, covering DDL grammar and DML grammar. The file system is relatively large and is not suitable for learning and exploring a specific point of detail.

In order to explore the process of SQL parsing and understand the logic behind SQL execution, on the basis of simply reading relevant documents, I chose to do my own coding experiment. To this end, define a small goal: to implement a SQL parser. Use this parser to implement the select field from table syntax and query the specified field from the local csv data source.

4.1 Crop the SelectBase.g4 file

Based on the same process as the implementation of the four arithmetic operators, first define the SelectBase.g4 file. With the Presto source code as the reference system, our SelectBase.g4 does not need to be developed by ourselves, but only needs to be tailored based on Presto's g4 file. The trimmed content is as follows:

grammar SqlBase;
 
tokens {
    DELIMITER
}
 
singleStatement
    : statement EOF
    ;
 
statement
    : query                                                            #statementDefault
    ;
 
query
    :  queryNoWith
    ;
 
queryNoWith:
      queryTerm
    ;
 
queryTerm
    : queryPrimary                                                             #queryTermDefault
    ;
 
queryPrimary
    : querySpecification                   #queryPrimaryDefault
    ;
 
querySpecification
    : SELECT  selectItem (',' selectItem)*
      (FROM relation (',' relation)*)?
    ;
 
selectItem
    : expression  #selectSingle
    ;
 
relation
    :  sampledRelation                             #relationDefault
    ;
 
expression
    : booleanExpression
    ;
 
booleanExpression
    : valueExpression             #predicated
    ;
 
valueExpression
    : primaryExpression                                                                 #valueExpressionDefault
    ;
 
primaryExpression
    : identifier                                                                          #columnReference
    ;
 
sampledRelation
    : aliasedRelation
    ;
 
aliasedRelation
    : relationPrimary
    ;
 
relationPrimary
    : qualifiedName                                                   #tableName
    ;
 
qualifiedName
    : identifier ('.' identifier)*
    ;
 
identifier
    : IDENTIFIER             #unquotedIdentifier
    ;
 
SELECT: 'SELECT';
FROM: 'FROM';
 
fragment DIGIT
    : [0-9]
    ;
 
fragment LETTER
    : [A-Z]
    ;
 
IDENTIFIER
    : (LETTER | '_') (LETTER | DIGIT | '_' | '@' | ':')*
    ;
 
WS
    : [ \r\n\t]+ -> channel(HIDDEN)
    ;
 
// Catch-all for anything we can't recognize.
// We use this to be able to ignore and recover all the text
// when splitting statements with DelimiterLexer
UNRECOGNIZED
    : .
    ;

Compared with the 700-plus-line rule in the presto source code, we cut it to 1/10 of its size. The core rules of this file are: SELECT selectItem (',' selectItem) (FROM relation (',' relation) )

By understanding the g4 file, we can also understand the composition of our query more clearly. For example, usually our most common query data source is a data table. But in SQL syntax, our query data table is abstracted into relation.

This relation may come from a specific data table, or a subquery, or a JOIN, or a sampling of data, or an unnest of an expression. In the field of big data, such an expansion will greatly facilitate data processing.

For example, using unnest syntax to parse complex types of data, the SQL is as follows:

Although SQL is more complicated, by understanding g4 files, you can clearly understand its structure division. Back to the SelectBase.g4 file, we also use the Antlr4 command to process the g4 file and generate the code:

antlr4 -package org.example.antlr -no-listener -visitor .\SqlBase.g4

In this way, the basic framework code is generated. The next step is to handle the business logic by yourself.

4.2 Traverse the syntax tree to encapsulate SQL structure information

Next, define the node type of the syntax tree based on the SQL grammar, as shown in the following figure.

Through this class diagram, you can clearly see the basic elements in the SQL syntax.

Then implement your own parsing class AstBuilder based on the visitor mode (here, in order to simplify the problem, we still cut it from the presto source code). Take processing querySpecification rule code as an example:

@Override
public Node visitQuerySpecification(SqlBaseParser.QuerySpecificationContext context)
{
    Optional<Relation> from = Optional.empty();
    List<SelectItem> selectItems = visit(context.selectItem(), SelectItem.class);
 
    List<Relation> relations = visit(context.relation(), Relation.class);
    if (!relations.isEmpty()) {
        // synthesize implicit join nodes
        Iterator<Relation> iterator = relations.iterator();
        Relation relation = iterator.next();
 
        from = Optional.of(relation);
    }
 
    return new QuerySpecification(
            getLocation(context),
            new Select(getLocation(context.SELECT()), false, selectItems),
            from);
}

Through the code, we have parsed out the data source and specific fields of the query, encapsulated in the QuerySpecification object.

4.3 Application of Statement object to achieve data query

Through the previous example of implementing the four arithmetic operators, we know that ANTLR parses the sentence input by the user into ParseTree. Business developers implement relevant interfaces to parse ParseTree by themselves. Presto generates ParseTree by parsing the input sql statement, traverses the ParseTree, and finally generates the Statement object. The core code is as follows:

SqlParser sqlParser = new SqlParser();
Statement statement = sqlParser.createStatement(sql);

How do we use the Statement object? Combining the previous class diagram, we can find:

  • Statement of Query type has QueryBody attribute.
  • QuerySpecification type QueryBody has select attribute and from attribute.

Through this structure, we can clearly obtain the necessary elements to implement the select query:

  • Get the target table to be queried Table from the from attribute. It is agreed that the table name is the same as the csv file name.
  • Get the target field SelectItem to be queried from the select attribute. It is agreed that the first line of csv is the title line.

The entire business process is clear. After parsing the sql statement to generate the statement object, follow the steps below:

  • s1: Get the query data table and fields.
  • s2: Set to the data file by the name of the data table, and read the data in the data file.
  • s3: Format the output field name to the command line.
  • s4: Format the content of the output field to the command line.

In order to simplify the logic, the code only deals with the main line, without exception handling.

/**
 * 获取待查询的表名和字段名称
 */
QuerySpecification specification = (QuerySpecification) query.getQueryBody();
Table table= (Table) specification.getFrom().get();
List<SelectItem> selectItems = specification.getSelect().getSelectItems();
List<String> fieldNames = Lists.newArrayList();
for(SelectItem item:selectItems){
    SingleColumn column = (SingleColumn) item;
    fieldNames.add(((Identifier)column.getExpression()).getValue());
}
 
/**
 * 基于表名确定查询的数据源文件
 */
String fileLoc = String.format("./data/%s.csv",table.getName());
 
/**
 * 从csv文件中读取指定的字段
 */
Reader in = new FileReader(fileLoc);
Iterable<CSVRecord> records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse(in);
List<Row> rowList = Lists.newArrayList();
for(CSVRecord record:records){
    Row row = new Row();
    for(String field:fieldNames){
        row.addColumn(record.get(field));
    }
    rowList.add(row);
}
 
/**
 * 格式化输出到控制台
 */
int width=30;
String format = fieldNames.stream().map(s-> "%-"+width+"s").collect(Collectors.joining("|"));
System.out.println( "|"+String.format(format, fieldNames.toArray())+"|");
 
int flagCnt = width*fieldNames.size()+fieldNames.size();
String rowDelimiter = String.join("", Collections.nCopies(flagCnt, "-"));
System.out.println(rowDelimiter);
for(Row row:rowList){
    System.out.println( "|"+String.format(format, row.getColumnList().toArray())+"|");
}

The code is for demonstration purposes only, and abnormal logic is not considered for the time being, such as the non-existent query field, the field name defined in the csv file does not meet the requirements, and so on.

4.4 Realization effect display

In our project data directory, store the following csv files:

The sample data of cities.csv file is as follows:

"LatD","LatM","LatS","NS","LonD","LonM","LonS","EW","City","State"
   41,    5,   59, "N",     80,   39,    0, "W", "Youngstown", OH
   42,   52,   48, "N",     97,   23,   23, "W", "Yankton", SD
   46,   35,   59, "N",    120,   30,   36, "W", "Yakima", WA
   42,   16,   12, "N",     71,   48,    0, "W", "Worcester", MA

Run the code to query the data. Use SQL statements to specify fields to query from the csv file. The final effect of similar SQL query is as follows:

SQL example 1: select City, City from cities

SQL example 2: select name, age from employee

This section describes how to crop the g4 rule file based on the Presto source code, and then use the sql statement to query data from the csv file based on Antlr4. Relying on the coding experiment of cutting the Presto source code, it can play a certain role in studying the implementation of the SQL engine and understanding the Presto source code.

Five, summary

This article describes the application ideas and process of ANTLR4 in project development based on the four arithmetic units and the use of SQL to query csv data. The relevant code can be seen on github . Understanding the usage of ANTLR4 can help understand the definition rules and execution process of SQL, and assist in writing efficient SQL statements in business development. At the same time, it is also helpful for understanding compilation principles, defining your own DSL, and abstracting business logic. It's always shallow on paper, and I absolutely know that I have to do it personally. It is also a pleasure to study the source code implementation in the way described in this article.

Reference

1、《The Definitive ANTLR4 Reference》

2. Presto official document

3. "ANTLR 4 Concise Tutorial"

4. Calc class source code

5. EvalVisitor class source code

6, Presto source code

Author: vivo Internet Development Team-Shuai Guangying

vivo互联网技术
3.3k 声望10.2k 粉丝