1
Author: vivo Internet Technology - Shuai Guangying

In " Exploring the Presto SQL Engine (1) - Using Antlr Skillfully ", we introduced the basic usage of Antlr and how to use Antlr4 to parse SQL to query CSV data, and have a deeper understanding of the SQL syntax supported by the Presto query engine and implementation ideas.

This article is the second in a series of articles. This article sorts out the principles of Join and the implementation of the Join algorithm in Presto. Through the combination of theory and practice, on the basis of understanding the principle, you can have a deeper understanding of the engineering landing skills of the Join algorithm in OLAP scenarios, such as the application of the volcano model, columnar storage, batch processing and other ideas.

1. Background

When using a database in business development, there is usually a join that does not allow too many tables. For example, in the Alibaba development manual, there are the following provisions:

[Mandatory] Join is prohibited for more than three tables. For fields that require Join, the data types must be absolutely consistent; when multiple tables are associated with queries, ensure that the associated fields need to have indexes. Note: Pay attention to table indexes and SQL performance even in dual-table Join.

In the construction of big data warehouse, although we have star structure and snowflake structure, most of the final delivery business use is wide table.

It can be seen that there is a contradiction in the business use of the database: we need Join to provide flexible association operations, but we must try to avoid performance problems caused by multi-table and large-table Join. Why is this?

2. The basic principle of Join

The semantics provided by Join in the database are very rich. A brief summary is as follows:

图片

Usually understand the implementation principle of Join, from Cross Join is the best entry point, which is the so-called Cartesian product. For the Cartesian product operation of sets, the understanding is very simple, that is to exhaust all combinations of elements in the two sets. In the database, the collection corresponds to all the rows (tuples) in the data table, and the elements in the collection correspond to the single row (tuple). So the algorithm to realize Cross Join is ready to come out.

The implemented code sample is as follows:

 List<Tuple>  r = newArrayList(
        new Tuple(newArrayList(1,"a")),
        new Tuple(newArrayList(2,"b")));
 
List<Tuple>  s = newArrayList(
        new Tuple(newArrayList(3,"c")),
        new Tuple(newArrayList(4,"d")));
 
int cnt =0;
for(Tuple ri:r){
    for(Tuple si:s){
        Tuple c = new Tuple().merge(ri).merge(si);
        System.out.println(++cnt+": "+ c);
    }
}
/**
 * out:
 1: [1, a, 3, c]
 2: [1, a, 4, d]
 3: [2, b, 3, c]
 4: [2, b, 4, d]
 */

It can be seen that the implementation logic is very simple, that is, two For loops are nested.

2.1 Nested Loop Join Algorithm

On this basis, the first algorithm to implement Inner Join goes with the flow. Very straightforward name: Nested Loop, the key points are as follows:

图片

(Source: Join Processing in Relational Databases)

Among them, the theta operator can be: =, !=, <, >, ≤, ≥.

Compared with the realization idea of the Cartesian product, that is, a layer of if condition judgment is added to filter the combinations that satisfy the conditions.

For the Nested Loop algorithm, the most critical point is its execution efficiency. If the two tables involved in Join are of the order of 10,000 and one of the order of 10w, then the number of comparisons is 1w*10w=1 billion times. In the era of big data, the amount of data in a table is usually in billions. If the Nested Loop Join algorithm is used, the number of comparisons of Join operations is directly astronomical. So Nested Loop Join is basically a last resort. Under the Nested Loop framework, common optimization measures are as follows:

  • The small table drives the large table, that is, the set with a large amount of data is used as the inner loop of the for loop.
  • Process one block of data at a time, not one record. This is the so-called Block Nested Loop Join, which reduces the number of IOs and improves the cache hit rate by dividing into blocks.

It is worth mentioning that although the idea of Nested Loop Join is very simple, it has natural distributed and parallel capabilities. This is also an important point why the implementation of Nested Loop Join is still retained in various NoSQL databases. Although the serial execution of a single machine is slow, if it can be parallelized, it is a problem that can be solved by adding a machine.

2.2 Sort Merge Join Algorithm

From the previous analysis, we can know that the key problem of the Nested Loop Join algorithm is that the number of comparisons is too many, and the complexity of the algorithm is O(m*n), so the breakthrough must also be directed towards this point. If the elements in the set are ordered, the number of comparisons will be greatly reduced, avoiding many meaningless comparison operations. The second implementation of an ordered So Join is described below:

图片

(Source: Join Processing in Relational Databases)s)

The acceleration of the Join operation is achieved by splitting the JOIN operation into two stages: Sort and Merge. For the Sort phase, it can be prepared and reused in advance. This kind of thinking is very friendly to relational databases such as MySQL, which also explains that the Alibaba development manual requires that the associated fields must be indexed, because the index ensures the order of the data. The time complexity of the algorithm is the sorting overhead O(m_log(m)+n_log(n)) + the merging overhead O(m+n). But usually because the index ensures that the data is ordered, the time complexity of the index is O(m+n).

2.3 Hash Join Algorithm

The idea of Sort Merge Join has certain limitations in its implementation. The so-called success and failure of Xiao He, for Hadoop-based data warehouses, ensuring the orderliness of data storage has a great impact on performance. In the context of massive data, the cost of maintaining indexes is relatively large. Moreover, the index also depends on the usage scenario, and it is impossible to build an index for each field. When the data table is associated with a large table and a small table, such as: user table (large table) - order table of the day (small table); fact table (large table) - dimension table (small table), you can change the space through space time. Recall that in the basic data structure, the tree structure and the Hash structure can be described as two magic weapons for data processing: one ensures that the data is ordered to facilitate the realization of interval search, and the other is to achieve accurate hit point-to-point query efficiency through the hash function.

In this context, it is not surprising that the idea of Join is realized by hashing the small table.

图片

(Source: Join Processing in Relational Databases)

And even if a table generates Hash memory consumption in a stand-alone environment is too large, you can also use Hash to segment the data to achieve distributed capabilities. Therefore, the Join algorithm in Presto usually chooses Hash Join, and the time complexity of this algorithm is O(m+n).

Through the study of relevant materials, it can be found that the implementation principle of the Join algorithm is quite simple, and sorting and Hash are the most basic contents of the data structure. After understanding the basic idea of Join, how to put it into practice? After all talk is cheap. Before implementing Join in a project, some preliminary knowledge is required. Generally speaking, the core algorithm is the jewel in the crown, but only the pearl is not enough, and the crown is needed as the base.

3. Join engineering preconditions

3.1 SQL Processing Architecture - Volcano Model

Before implementing the Join algorithm, you need to understand the basic structure of the database to process data. Only on the basis of understanding the architecture can the Join algorithm be placed in a suitable position. In the previous series of articles, the parsing of SQL statements based on antlr was discussed. It can be found that the types of operations supported by SQL syntax are very rich: query table (TableScan), filter data (Filter), sort (Order), limit (Limit), field operation (Project), aggregation (Group), association (Join), etc. . To achieve the above capabilities, a parallelizable and scalable architecture is required.

In 1994, Goetz Graefe proposed an architectural design idea in the paper "Volcano-An Extensible and Parallel Query Evaluation System", which is the famous volcano model, also known as the iterative model. The volcano model actually includes two parts: file system and query processing. Here we focus on the design idea of query processing. The architecture diagram is as follows:

图片

(Source: "Balancing vectorized execution with bandwidth-optimized storage")

Just read it:

Separation of duties: Separate different operations into an Operator, and the Operator adopts the open-next-close iterator mode.

For example for SQL.

 SELECT Id, Name, Age, (Age - 30) * 50 AS Bonus
FROM People
WHERE Age > 30

Corresponding to the three Operators of Scan, Select, and Project, and data interaction is realized through the next() function. The above theories can be corresponded in Presto. For example, several commonly used Operators in Presto are basically known by name:

图片

Dynamic assembly: Operators implement dynamic assembly based on the parsing of SQL statements, and multiple operators form a pipeline.

For example: the two operators print and predicate form a pipeline:

图片

(Source: "Volcano-An Extensible and Parallel Query Evaluation System")

On the basis of the volcano model, Presto absorbs other ideas in the database field, and optimizes the basic volcano model, which is mainly reflected in the following points:

  1. Operator data processing is optimized to be one Page at a time, rather than one row (also known as a tuple).
  2. The storage of Page adopts a columnar structure. That is, the same column is packaged into a Block.

Batch processing combined with columnar storage lays the foundation for vectorized computing. This is also the direction of optimization in the database field.

3.2 Batch Processing and Columnar Storage

When studying the Presto source code, Page/Block can be seen almost everywhere. So understanding the idea behind Page/Block is the basis for understanding the implementation mechanism of Presto. There are related books and documents explaining the concept of Page/Block, but because these concepts are mixed with other concepts, it is not easy to understand for a while.

The author thinks that Type-Block-Page is easier to understand when put together. We use database, usually need to define table, field name, field type. In a traditional DBMS, data is usually stored in rows, usually structured as follows:

图片

(Source: "Database System Implementation")

But usually OLAP scenarios do not need to read all fields. Based on such scenarios, columnar storage is derived. This is the structure we see as follows:

图片

(Source: "Presto Technology Insider")

That is, each field corresponds to a Block, and the slice of multiple Blocks is a record, which is the so-called row, which is called tuple in some papers. By comparison, it can be clearly seen that in Presto, Page is a typical implementation of columnar storage. So in Presto, each Type is bound to be associated with a Block. For example: bigint type corresponds to LongArrayBlockBuilder, varchar type corresponds to VariableWidthBlock.

After understanding the principle, it becomes very simple to operate Page/Block. The simple demo code is as follows:

 import com.facebook.presto.common.Page;
import com.facebook.presto.common.PageBuilder;
import com.facebook.presto.common.block.Block;
import com.facebook.presto.common.block.BlockBuilder;
import com.facebook.presto.common.type.BigintType;
import com.facebook.presto.common.type.Type;
import com.facebook.presto.common.type.VarcharType;
import com.google.common.collect.Lists;
import io.airlift.slice.Slice;
 
import java.util.List;
 
import static io.airlift.slice.Slices.utf8Slice;
 
/**
 * PageBlockDemo
 *
 * @version 1.0
 * @since 2021/6/22 19:26
 */
public class PageBlockDemo {
 
    private static Page buildPage(List<Type> types,List<Object[]> dataSet){
        PageBuilder pageBuilder = new PageBuilder(types);
        // 封装成Page
        for(Object[] row:dataSet){
            // 完成一行
            pageBuilder.declarePosition();
            for (int column = 0; column < types.size(); column++) {
                BlockBuilder out =  pageBuilder.getBlockBuilder(column);
 
                Object colVal = row[column];
                if(colVal == null){
                    out.appendNull();
                }else{
                    Type type = types.get(column);
                    Class<?> javaType = type.getJavaType();
                    if(javaType == long.class){
                        type.writeLong(out,(long)colVal);
                    }else if(javaType == Slice.class){
                        type.writeSlice(out, utf8Slice((String)colVal));
                    }else{
                        throw new UnsupportedOperationException("not implemented");
                    }
                }
            }
        }
        // 生成Page
        Page page = pageBuilder.build();
        pageBuilder.reset();
        return page;
    }
 
    private static void readColumn(List<Type> types,Page page){
        // 从Page中读取列
        for(int column=0;column<types.size();column++){
            Block block = page.getBlock(column);
            Type type = types.get(column);
            Class<?> javaType = type.getJavaType();
 
            System.out.print("column["+type.getDisplayName()+"]>>");
            List<Object> colList = Lists.newArrayList();
            for(int pos=0;pos<block.getPositionCount();pos++){
                if(javaType == long.class){
                    colList.add(block.getLong(pos));
                }else if(javaType == Slice.class){
                    colList.add(block.getSlice(pos,0,block.getSliceLength(pos)).toStringUtf8());
                }else{
                    throw new UnsupportedOperationException("not implemented");
                }
            }
            System.out.println(colList);
        }
    }
 
    public static void main(String[] args) {
        /**
         * 假设有两个字段,一个字段类型为int, 一个字段类型为varchar
         */
        List<Type> types = Lists.newArrayList(BigintType.BIGINT, VarcharType.VARCHAR);
 
        // 按行存储
        List<Object[]> dataSet = Lists.newArrayList(
                new Object[]{1L,"aa"},
                new Object[]{2L,"ba"},
                new Object[]{3L,"cc"},
                new Object[]{4L,"dd"});
 
        Page page = buildPage(types, dataSet);
 
        readColumn(types,page);
 
    }
}
// 运行结果:
//column[bigint]>>[1, 2, 3, 4]
//column[varchar]>>[aa, ba, cc, dd]

Encapsulating data into Pages and circulating them in each operator avoids the cost of serialization and deserialization of objects on the one hand, and reduces the overhead of function calls compared to the tuple method. This is similar to the idea of containerized freight to reduce transportation costs.

4. Engineering Practice of Join Algorithm

After understanding the core algorithm and infrastructure of Join, combined with the analysis of SQL expressions implemented by antlr and the implementation of where condition filtering in the previous article, we already have the basic conditions for implementing Join. Next, I will briefly describe the landing process of the Join algorithm. First of all, it is necessary to support the grammar of Join at the grammar level. Since the purpose of this article is to study the algorithm implementation process, not to realize the complete Join function, we will first consider the equivalent Join grammar that supports two form fields.

First of all, the syntax needs to support Join. The key points of the definition based on antlr syntax are as follows:

 querySpecification
    : SELECT  selectItem (',' selectItem)*
      (FROM relation (',' relation)*)?
      (WHERE where=booleanExpression)?
    ;
 
selectItem
    : expression  #selectSingle
    ;
 
relation
    : left=relation
      (
        joinType JOIN rightRelation=relation joinCriteria
      )                                           #joinRelation
    | sampledRelation                             #relationDefault
    ;
 
joinType
    : INNER?
    ;
 
joinCriteria
    : ON booleanExpression
    ;

The above grammar definition disassembles the key elements of Join very clearly: the left table of Join, the type of Join, the keywords of Join, the right table of Join, and the association conditions of Join. For example, usually our simplest join statement use case is as follows (borrowing the tpch data source of presto):

 select t2.custkey, t2.phone, t1.orderkey from orders t1 inner join customer t2 on t1.custkey=t2.custkey limit 10;

Corresponding to the syntax and SQL statement use cases, it can be seen that the following details need to be considered when implementing the Join algorithm:

  • Detect SQL statements to ensure that the SQL statements meet the syntax requirements.
  • Sort out the correspondence between table aliases and fields to ensure that the queried fields and tables can correspond, and the field types of the Join conditions can match.
  • The choice of Join algorithm, is it HashJoin, NestedLoopJoin or SortMergeJoin?
  • Which table is the build table and which table is the probe table?
  • How to judge the join condition?
  • The whole query involves how the Operator is assembled to achieve the output of the final result?

Let's review the key flow of SQL execution:

图片

(Source: Query Execution Flow Architecture (SQL Server))

Based on the above process, the question has already been answered.

  • Parser: With the ability of antlr, the detection of SQL syntax can be realized.
  • Binding: Generate AST based on SQL statements, and use metadata to detect the mapping relationship between fields and tables and the field type of Join conditions.
  • Planner: Generate query plans based on AST.
  • Executor: Generate the corresponding Operator based on the query plan and execute it.

Take the NestedLoop Join algorithm as an example to understand the implementation idea of Presto. For the implementation of the NestedLoopJoin Join algorithm, it is actually disassembled into two stages in Presto: the combination stage and the filtering stage. When implementing JoinOperator, it only needs to be responsible for the Cartesian product combination of the two table data. The core code is as follows:

 // NestedLoopPageBuilder中实现两个Page计算笛卡尔积的处理逻辑,这里RunLengthEncodedBlock用于一个元素复制,典型地笛卡尔积计算中需要将一列元素从1行复制成多行。
@Override
public Page next()
{
    if (!hasNext()) {
        throw new NoSuchElementException();
    }
 
    if (noColumnShortcutResult >= 0) {
        rowIndex = maxRowIndex;
        return new Page(noColumnShortcutResult);
    }
 
    rowIndex++;
 
    // Create an array of blocks for all columns in both pages.
    Block[] blocks = new Block[numberOfProbeColumns + numberOfBuildColumns];
 
    // Make sure we always put the probe data on the left and build data on the right.
    int indexForRleBlocks = buildPageLarger ? 0 : numberOfProbeColumns;
    int indexForPageBlocks = buildPageLarger ? numberOfProbeColumns : 0;
 
    // For the page with less rows, create RLE blocks and add them to the blocks array
    for (int i = 0; i < smallPage.getChannelCount(); i++) {
        Block block = smallPage.getBlock(i).getSingleValueBlock(rowIndex);
        blocks[indexForRleBlocks] = new RunLengthEncodedBlock(block, largePage.getPositionCount());
        indexForRleBlocks++;
    }
 
    // Put the page with more rows in the blocks array
    for (int i = 0; i < largePage.getChannelCount(); i++) {
        blocks[indexForPageBlocks + i] = largePage.getBlock(i);
    }
 
    return new Page(largePage.getPositionCount(), blocks);
}

V. Summary

This article briefly summarizes the basic algorithm of Join and the basic framework implemented in Presto, and takes the NestedLoop Join algorithm as an example to demonstrate the core points of implementation in Presto. It can be seen that compared to the original algorithm description, Presto's engineering implementation is completely different: it not only supports all Join semantics, but also achieves distributed capabilities. This includes thinking at the architectural level and thinking at the performance level, which is very worth exploring and researching. As far as the Join algorithm is concerned, there are still many points that can be explored, such as the sequential selection of multi-table Join, the algorithm optimization of large and small table Join, the optimization of the Semi Join algorithm, the problem of data skew in the Join algorithm, etc. It can be said that there is a long way to go. In the future, we will continue to analyze and explore in the follow-up series of articles.

6. References

  1. Presto source code
  2. Join Processing in Relational Databases
  3. Volcano-An Extensible and Parallel Query Evaluation System

vivo互联网技术
3.3k 声望10.2k 粉丝