TiFlash source code reading (9) Design and implementation of commonly used operators in TiFlash

This article mainly introduces the implementation of the commonly used operators Join and Aggregation in the database system in TiFlash, including query plan generation, compilation stage and execution stage, in order to expect readers to have a preliminary understanding of TiFlash operators.

video

https://www.bilibili.com/video/BV1tt4y1875T

Operator Summary

Before reading this article, it is recommended to read the previous work of this series: Overview of the Computing Layer to have a certain understanding of the TiFlash computing layer and the MPP framework.

In a database system, operators are where the main logic of SQL is executed. A piece of SQL will be parsed into an operator tree (query plan) by the parser, then optimized by the optimizer, and then handed over to the corresponding executor for execution, as shown in the following figure.

The main contents of this article include

How TiDB Generates and Optimizes MPP Operators and Query Plans
The compilation of the Join operator in TiFlash (compiling refers to the process of generating an executable structure from the execution plan fragments issued by TiDB-server, the same below) and execution
Compilation and execution of Aggregation operator in TiFlash

Build a query plan

Some background knowledge:

Logical plan and physical plan: It can be simply understood that the logical plan refers to what the operator is to do, and the physical plan refers to how the operator does it. For example, "read data from table a and table b, and then join" describes the logical plan; and "do shuffle hash join in TiFlash" describes the physical plan. For more information, please refer to: TiDB source code reading series of articles
MPP: Large-scale parallel computing, generally used to describe parallel computing that can exchange data between nodes. In the current version of TiDB (6.1.0, the same below), MPP operations all occur on TiFlash nodes. Recommended viewing: Source code interpretation - TiFlash computing layer overview . MPP is a concept at the physical plan level.

MPP Program

In TiDB, you can add explain before SQL to view the query plan of this SQL. As shown in the figure below, it is a tree composed of physical operators. You can view the TiDB execution plan overview to learn more about it. .

The uniqueness of the MPP query plan is that the query plan has additional ExchangeSender and ExchangeReceiver operators for data exchange.

There will be such a pattern in the execution plan, and the representative will perform data transmission and exchange here.

 ...
     |_ExchangeReceiver_xx
        |_ ExchangeSender_xx
             …

Each ExchangeSender will have an ExchangeType to identify the category of this data transfer, including:

HashPartition, distributes data to upstream nodes after partitioning according to Hash value.
Broadcast, copies several copies of its own data and broadcasts it to all upstream nodes.
PassThrough, transmits all its own data to a specified node. At this time, the receiver can be a TiFlash node (ExchangeReceiver) or a TiDB-server node (TableReader), which means that the MPP operation is completed and data is returned to the TiDB-server.

In the query plan above, there are three ExchangeSenders with ids 19, 13 and 17. Among them, ExchangeSender_13 and ExchangeSender_17 shuffle the read data into all nodes according to the hash value for joining, while ExchangeSender_19 returns the joined data to the TiDB-server node.

Add Exchange

During the plan exploration process of the optimizer, there are two places where Exchange operators are inserted into the query plan tree:

One is when MPP plans to access the tableReader of TiDB after the exploration is completed. The type is passThrough type. The source code is in the function func (t *mppTask) convertToRootTaskImpl
One is when the MPP plan finds that the property of the current operator (here mainly refers to the partition property) does not meet the requirements of the upper layer during the exploration process. For example, the upper-level requirement needs to be partitioned by the hash value of column a, but the lower-level operator cannot meet this requirement, and a group of Exchange will be inserted.

 func (t *mppTask) enforceExchanger(prop *property.PhysicalProperty) *mppTask {
   if !t.needEnforceExchanger(prop) {
      return t
   }
   return t.copy().(*mppTask).enforceExchangerImpl(prop)
}

// t.partTp 表示当前算子已有的 partition type，prop 表示父算子要求的 partition type
func (t *mppTask) needEnforceExchanger(prop *property.PhysicalProperty) bool {
   switch prop.MPPPartitionTp {
   case property.AnyType:
      return false
   case property.BroadcastType:
      return true
   case property.SinglePartitionType:
      return t.partTp != property.SinglePartitionType
   default:
      if t.partTp != property.HashType {
         return true
      }
      if len(prop.MPPPartitionCols) != len(t.hashCols) {
         return true
      }
      for i, col := range prop.MPPPartitionCols {
         if !col.Equal(t.hashCols[i]) {
            return true
         }
      }
      return false
   }
}

Property The requirements for partition properties (MPPPartitionTp) are as follows:

AnyType, there is no requirement for the lower operator, so there is no need to add exchange;
BroadcastType, used for broadcast join, requires lower-level nodes to copy data and broadcast to all nodes. At this time, a broadcast exchange must be added;
SinglePartitionType, which requires lower-level nodes to aggregate data into the same node. At this time, if it is already on the same node, there is no need to perform exchange.
HashType, which requires the lower nodes to be partitioned according to the hash value of a specific column. If the partitions have been partitioned as required, there is no need to perform exchange.

In the exploration of the query plan generated by the optimizer, each operator has property requirements for the lower layer, and also needs to meet the properties passed down from the upper layer; when the properties of the upper and lower layers cannot be matched, an exchange operator is inserted to exchange data . Relying on these properties, the exchange operator can be inserted without fail.

MPP algorithm

Whether to choose the MPP algorithm is decided when the TiDB optimizer generates the physical plan , that is, the CBO (Cost-Based Optimization) stage. The optimizer traverses all available plan paths, including plans with MPP algorithm and plans without MPP algorithm, estimates their costs, and selects a query plan with the smallest total cost.

For the current TiDB repo code, there are four locations that can trigger the generation of MPP plans, corresponding to the four operators of join, agg, window function, and projection:

func (p *LogicalJoin) tryToGetMppHashJoin
func (la *LogicalAggregation) tryToGetMppHashAggs
func (lw *LogicalWindow) tryToGetMppWindows
func (p *LogicalProjection) exhaustPhysicalPlans

Only the representative join and agg operators are described here, and the same applies to other operators.

Join

Currently TiDB supports two MPP Join algorithms:

Shuffle Hash Join, the data of the two tables are partitioned by hash key and then shuffled to each node, and then hash join is performed, as shown in the query plan shown in the previous section.
Broadcast Join, broadcasts the small table to each node where the large table is located, and then performs hash join, as shown in the following figure.

The tryToGetMppHashJoin function gives the property requirements for the sub-operator when constructing the join operator:

 if useBCJ { // broadcastJoin
    …
    childrenProps[buildside] = {MPPPartitionTp: BroadcastType}
    childrenProps[1-buildside] = {MPPPartitionTp: AnyType}
    …
} else { // shuffle hash join
    …
    childrenProps[0] = {MPPPartitionTp: HashType, key: leftKeys}
    childrenProps[1] = {MPPPartitionTp: HashType, key: rightKeys}
    …
}

As shown in the code, broadcast join requires the buildside (here refers to the small table to be broadcast) to have a property of BroadcastType, and there is no requirement for the large table side. The shuffle hash join requires that both sides have the partition attribute of HashType, and the partition columns are left keys and right keys respectively.

Aggregation

Currently tryToGetMppHashAggs may generate three MPP Aggregation plans:

1. "One-stage agg" requires data to be partitioned by group by key first, and then aggregated.

2. "Two-stage agg", first perform the first-stage aggregation on the local node, then partition by group by key, and perform another aggregation (use sum to summarize the results).

3. "scalar agg", there is no specific case of partition columns, the first-stage aggregation is performed on the local node, and then aggregated to the same node to complete the second-stage aggregation.

The difference between one-stage agg and two-stage agg is whether to do a pre-aggregation on the local node first, and the optimizer will choose which method to execute based on SQL and cost estimation. For the case of many repeated values, the two-stage agg can reduce a lot of data volume before network transmission, thereby reducing a lot of network consumption; and if there are few repeated values, this pre-aggregation will not reduce a lot of data volume, on the contrary In vain, the consumption of cpu and memory is increased, so it is better to use one-stage agg at this time.

Here is a small thought question. What property requirements do these three aggs have for the following? What kind of property is satisfied after the aggregation is completed?

the answer is:

The first-stage agg requires hash, and the hash is satisfied after completion; the second-stage agg has no requirements, and the completion satisfies the hash; the scalar agg has no requirements, and the completion satisfies the singlePartition.

compile and execute

After the execution plan is constructed, TiDB-server will deliver the dag (fragment of the execution plan) to the corresponding TiFlash node. In the TiFlash node, these execution plans need to be parsed first. This process is called "compiling". The result of the compilation is BlockInputStream, which is the executable structure in TiFlash; and the last step is to execute these BlockInputStreams in TiFlash.

The figure below is an example of a BlockInputStream DAG, each BlockInputStream has three methods: readPrefix, read and readSuffix; similar to other volcano models call open, next and close.

The source of the following figure is the thread model of TiFlash executor - Zhihu column (zhihu.com) . For more information on the execution model, you can refer to this article or TiFlash Overview, which will not be repeated here.

Compile and execute Join

The TiDB-server node will divide the query plan into different plan segments (tasks) according to the Exchange, and send it to the TiFlash node as a dag. For example, for the query plan shown in the figure below, it will be divided into these three red boxes.

The BlockInputStream generated by the TiFlash node after compilation is as follows, which can be seen in the debug log:

 task 1
ExchangeSender
 Expression: <final projection>
  Expression: <projection after push down filter>
   Filter: <push down filter>
    DeltaMergeSegmentThread
 
task 2
ExchangeSender
 Expression: <final projection>
  Expression: <projection after push down filter>
   Filter: <push down filter>
    DeltaMergeSegmentThread
 
task 3
CreatingSets
 Union: <for join>
  HashJoinBuildBlockInputStream x 20: <join build, build_side_root_executor_id = ExchangeReceiver_15>, join_kind = Inner
   Expression: <append join key and join filters for build side>
    Expression: <final projection>
     Squashing: <squashing after exchange receiver>
      TiRemoteBlockInputStream(ExchangeReceiver): schema: {<exchange_receiver_0, Nullable(Int32)>, <exchange_receiver_1, Nullable(Int32)>}
 Union: <for mpp>
  ExchangeSender x 20
   Expression: <final projection>
    Expression: <remove useless column after join>
     HashJoinProbe: <join probe, join_executor_id = HashJoin_34>
      Expression: <final projection>
       Squashing: <squashing after exchange receiver>
        TiRemoteBlockInputStream(ExchangeReceiver): schema: {<exchange_receiver_0, Nullable(Int32)>, <exchange_receiver_1, Nullable(Int32)>}

Among them, task1 and task2 read the data from the storage layer, and send it to ExchangeSender after simple processing. In task3, there are three BlockInpuStreams worth paying attention to, namely: CreatingSets, HashJoinBuild, HashJoinProbe.

CreatingSetsBlockInputStream

Accept a data BlockInputStream representing joinProbe, and several Subquery representing JoinBuild. CreatingSets will start these Subquery concurrently, wait for them to finish executing before starting the data InputStream. The following two figures are the call stack of the readPrefix and read functions of CreatingSets respectively.

Why CreatingSets may create multiple hash tables at the same time? Because in a multi-table join, the same plan fragment may be followed by multiple join porbes, as shown in the following figure:

 task:4
CreatingSets
 Union x 2: <for join>
  HashJoinBuildBlockInputStream x 20: <join build, build_side_root_executor_id = ExchangeReceiver_22>, join_kind = Left
   Expression: <append join key and join filters for build side>
    Expression: <final projection>
     Squashing: <squashing after exchange receiver>
      TiRemoteBlockInputStream(ExchangeReceiver): schema: {<exchange_receiver_0, Nullable(Int32)>, <exchange_receiver_1, Nullable(Int32)>}
 Union: <for mpp>
  ExchangeSender x 20
   Expression: <final projection>
    Expression: <remove useless column after join>
     HashJoinProbe: <join probe, join_executor_id = HashJoin_50>
      Expression: <final projection>
       Expression: <remove useless column after join>
        HashJoinProbe: <join probe, join_executor_id = HashJoin_14>
         Expression: <final projection>
          Squashing: <squashing after exchange receiver>
           TiRemoteBlockInputStream(ExchangeReceiver): schema: {<exchange_receiver_0, Nullable(Int32)>, <exchange_receiver_1, Nullable(Int32)>}

Join Build

Note that join here only stands for hash join, and has nothing to do with network communication and MPP-level algorithms.

The code about join is in dbms/src/Interpreters/Join.cpp; we take the following two tables for join as an example to illustrate:

 left_table l join right_table r 
on l.join_key=r.join_key
where l.b>=r.c

By default, the right table is used as the build side, and the left table is used as the probe side. The value of the hash table is stored using chaining:

Join Probe

The main description here is the process of the JoinBlockImpl function:

1. The block contains the content of the left table; create added_columns, that is, the columns of the right table to be added to the block; then create the corresponding filter replicate_offsets: indicating how many lines are currently matched, which can then be used to filter unmatched ones line, or copy a line that matches more than one line.

2. Look up the hash table in turn, call the corresponding addFound or addNotFound function according to the lookup result, and fill in added_columns and filters.

It can also be seen from the filling process that the left table of replicate_offsets indicates the number of rows in the right table that can be matched up to the current row. And replicate_offsets[i] - replicate_offsets[i-1] indicates the number of rows in the right table matched by the i-th row of the left table.

3. The added_column is directly spliced to the block. At this time, there will be a temporary inconsistency in the number of block lines.

4. According to the content of the filter, copy or filter out the rows in the original left table.

5. Finally, the other condition is processed on the block, and the result of join is obtained.

The above description is for the normal "all" join situation, which needs to return the data of the left and right tables. The opposite is "any" join, which means semi-join, no need to return the right table, just return the data of the left table, there is no need to use the auxiliary array replicate_offsets, the reader can read the code by himself. Still in dbms/src/intepreters/Join.cpp.

Aggregation compilation and execution

Or take a query plan and the corresponding BlockInputStream as an example:

 task:1
ExchangeSender
 Expression: <final projection>
  Expression: <before order and select>
   Aggregating
    Concat
     Expression: <before aggregation>
      Expression: <projection>
       Expression: <before projection>
        Expression: <final projection>
         DeltaMergeSegmentThread
 
task:2
Union: <for mpp>
 ExchangeSender x 20
  Expression: <final projection>
   Expression: <projection>
    Expression: <before projection>
     Expression: <final projection>
      SharedQuery: <restore concurrency>
       ParallelAggregating, max_threads: 20, final: true
        Expression x 20: <before aggregation>
         Squashing: <squashing after exchange receiver>
          TiRemoteBlockInputStream(ExchangeReceiver): schema: {<exchange_receiver_0, Int64>, <exchange_receiver_1, Nullable(Int64)>}

It can be seen from the query plan that this is a two-stage agg, the first stage corresponds to task1, and the BlockInputStream that performs aggregation is Aggregating. The second stage corresponds to task2, and the BlockInputStream that performs aggregation is ParallelAgragating. Both tasks perform network data transfer via Exchange.

During the compilation period of aggregation, the parallelism provided by the current pipeline will be checked. If it is only 1, use AggregatingBlockInputStream for single-threaded execution, and if it is greater than 1, use ParallelAggragating for parallel execution.

 DAGQueryBlockInterpreter::executeAggregation(){
    if (pipeline.streams.size() > 1){
        ParallelAggregatingBlockInputStream
    }else {
        AggregatingBlockInputStream
    }
}

The call stack of AggregatingBlockInputStream is as follows:

ParallelAggregatingBlockInputStream will operate in two phases internally (the two phases here are the concept of internal execution, which occurs on the same node, and are not the same concept as the two phases in the query plan). The partial stage builds HashTables in N threads respectively, and the merge stage merges the N HashTables and outputs a stream to the outside world. The call stack is as follows:

If result is empty, executeOnBlock method will be called once to generate a default data, similar to count() will return a 0 when there is no input.

Both execution methods use Aggregator's executeOnBlock method and mergeAndConvertToBlocks method, and their call stacks are shown in the figure. The former is where the aggregation function is actually executed, and the add method of the aggregation function is called to add data values; the main purpose of the latter is to merge the hash tables generated in parallel by ParallelAggregating.

TiFlash source code reading (9) Design and implementation of commonly used operators in TiFlash

video

Operator Summary

Build a query plan

MPP Program

Add Exchange

MPP algorithm

Join

Aggregation

compile and execute

Compile and execute Join

CreatingSetsBlockInputStream

Join Build

Join Probe

Aggregation compilation and execution

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

2025年2月国产数据库大事记-墨天轮

TiDB 可观测性解读（二）丨算子执行信息性能诊断案例分享

【赵渝强老师】使用TiDB的审计日志

架构师必看！现代应用架构发展趋势与数据库选型建议丨TiDB vs MySQL 专题（一）

Dify 基于 TiDB 的数据架构重构实践

TiDB 观测性解读（一）丨索引观测：快速识别无用索引与低效索