Databend SQL Planner New Design

foreword

In order to support complex SQL queries and provide a better user experience, we have done a large-scale refactoring of Databend's SQL planner in recent months. At present, the refactoring is coming to an end, and interested friends can modify the Session settings of Databend by modifying

 SET enable_planner_v2 = 1

to enable the new planner for early access.

Feature Highlights

A more friendly query experience

Whether you are a data analyst or a developer, you will always encounter various errors when writing SQL queries. Especially in the case of complex SQL queries, troubleshooting and reporting errors has become a nightmare for many people (I have modified MySQL queries with dozens of JOIN clauses, and have since hated MySQL's error messages).

In order to improve the user experience in this regard, we have introduced strict semantic checking in the new Planner, so that most errors can be intercepted at the query compilation stage. At the same time, in order to facilitate users to locate the wrong location, we also introduced a new error prompt algorithm.

When your SQL query uses the wrong syntax (such as misspelled keywords, or omitted certain clauses), Databend will provide you with hints:

When your SQL query has a semantic error (such as using a non-existing Column, or Column is ambiguous), Databend will also point you to the location of the error:

You can still get a better experience when writing complex queries:

Support JOIN query and correlated subquery

In the new SQL planner, we support JOIN query (INNER JOIN, OUTER JOIN, CROSS JOIN) and correlated sub-query, and provide Hash Join algorithm to execute JOIN query.

Documentation on JOIN queries has been published at https://databend.rs/doc/reference/sql/query-syntax/dml-join , you can check the documentation to understand how JOIN queries are used in Databend.

In OLAP query, JOIN is a very important part. In both traditional star schema and snowflake schema , we need to join dimension table with fact table through JOIN query to generate the result report.

TPCH Benchmark is a set of OLAP query benchmarks developed by the TPC committee to evaluate the OLAP capabilities of database systems. It contains 8 tables, namely:

Lineitem: Product item
Orders: order information
Customer: customer information
Part: Parts information
Supplier: Supplier information
Partsupp: Parts and suppliers relationship table
Nation: country information
Region: region information

There are 22 complex queries in TPCH, corresponding to different business needs. Here is an example of the Q9 query. Its purpose is to calculate the profit of the specified year and region , which includes a large number of JOIN calculations. In the new Planner, we can already support this query:

Correlated sub-queries are also an important part of SQL, and complex query logic can be easily expressed through correlated sub-queries. An example is Q4 of TPCH, which is used to calculate the delivery of orders of various priority levels over a period of time. The EXISTS correlated subquery is used to filter overdue orders:

Currently, Databend only supports simple execution of correlated subqueries, and related query optimization work is still in progress, so stay tuned.

new architecture

In the new SQL planner, we have redesigned the process of SQL parsing to support more complex semantic analysis and SQL optimization. In the new SQL planner, after a SQL statement is sent to the databend-query server through the client, it will be processed by different components in the order shown in the following figure, and finally the query result will be returned to the client:

After receiving the SQL query, the Parser component parses it. In this step, if a syntax error is encountered, the error information will be directly returned to the client, and the AST (abstract syntax tree) corresponding to the query will be generated if the parsing is successful.

Parser

In order to provide richer syntax analysis functions and a better development experience, we have developed a set of DSL (Domain Specific Language) nom-rule based on nom Parser combinator, and rewritten SQL Parser based on this framework.

Under this framework, we can define the syntax of a Statement very easily. Taking the CREATE TABLE statement as an example, we can use DSL to describe it simply as:

 CREATE ~ TABLE ~ #identifier ~ "(" ~ (#column_def)+ ~ ")" ~ ";"

The elegant syntax greatly improves the fun of writing Parser, and interested friends are welcome to try it.

Binder

After the AST is successfully parsed by Parser, we will perform semantic analysis on it through Binder and generate an initial Logical Plan. During this process, we perform different types of semantic analysis:

Name resolution: Check the validity of the variables referenced in the SQL query by querying the relevant Table and Column object information in the Databend Catalog. And bind legal variables to corresponding objects for subsequent analysis.
Type check: Check the validity of the expression according to the information obtained in the name resolution, and find a suitable return type for the expression.
Subquery unnesting: Extract the subquery in the expression and translate it into relational algebra
Grouping check: For queries with aggregate calculations, analyze whether non-aggregate columns are referenced outside of aggregate functions

Through semantic analysis, we can eliminate the vast majority of semantic errors and return them to the user during the compilation phase to provide the best error troubleshooting experience.

Optimizer

After getting the initial Logical Plan, the optimizer will rewrite and optimize it, and finally generate an executable Physical Plan.

In the new Planner, we introduced a set of Transformation Rule-based optimizer frameworks (Volcano/Cascades). An independent Rule can be realized by defining a Pattern of relational algebra subtree structure and related Transform logic.

Take a simple Predicate Push Down as an example:

We just need to define the Pattern of the input Plan:

 impl RulePushDownFilterProject {
    pub fn new() -> Self {
        Self {
            id: RuleID::PushDownFilterProject,
            // Filter
            //  \
            //   Project
            //    \
            //     *
            pattern: SExpr::create_unary(
                Pattern {
                    plan_type: RelOp::Filter,
                },
                SExpr::create_unary(
                    Pattern {
                        plan_type: RelOp::Project,
                    },
                    SExpr::create_leaf(
                        Pattern {
                            plan_type: RelOp::Pattern,
                        },
                    ),
                ),
            ),
        }
    }
}

And implement a function that does the conversion:

 impl RulePushDownFilterProject {
    pub fn apply(&self, s_expr: SExpr) -> Result<SExpr> {
        let filter = s_expr.plan().into();
        let project = s_expr.child(0).plan().into();
        let result = SExpr::create_unary(
            project, 
            SExpr::create_unary(
                filter,
                s_expr.child(0).child(0)
            )
        );
        
        Ok(result)
    }
}

Interpreter

After the Physical Plan is generated by the Optimizer, we will translate it into an executable Pipeline and hand it over to the Processor execution framework of Databend for calculation. At this point, the work of Planner has come to an end, and I believe that readers also have a preliminary understanding of the architecture of the new Planner. For more technical details, please pay attention to our follow-up articles.

future plan

Building a SQL Planner from scratch is a very challenging thing, but through redesign and development, we can find the most suitable architecture and functionality for the system itself. In the coming period, we will continue to improve and consolidate the new SQL Planner, and the functions will focus on:

Cost-based Optimization (CBO, cost-based optimization)
Distributed query optimization
More optimization rules

Currently, the migration of the new SQL planner is nearing completion, and you can track the progress through this issue. It is expected that all the migration work will be completed in July, when we will release the announcement of the version update, so stay tuned.

About Databend

Databend is an open source, flexible, low-cost new data warehouse that can also perform real-time analysis based on object storage. Looking forward to your attention, we will explore cloud-native data warehouse solutions together and build a new generation of open source Data Cloud.

Databend Documentation: https://databend.rs/
Twitter: https://twitter.com/Datafuse_Labs
Slack: https://datafusecloud.slack.com/
Wechat: Databend
GitHub: https://github.com/datafuselabs/databend

The article was first published on the public account: Databend

Databend SQL Planner New Design

foreword

Feature Highlights

A more friendly query experience

Support JOIN query and correlated subquery

new architecture

Parser

Binder

Optimizer

Interpreter

future plan

About Databend

databend

引用和评论

如何在 DataGrip 中连接 Databend

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式

Databend SQL Planner New Design

foreword

Feature Highlights

A more friendly query experience

Support JOIN query and correlated subquery

new architecture

Parser

Binder

Optimizer

Interpreter

future plan

About Databend

databend

引用和评论

如何在 DataGrip 中 连接 Databend

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式

如何在 DataGrip 中连接 Databend