头图
版本日期备注
1.02024.2.18文章首发
1.12024.6.30根据部分同学的疑问补充内容
1.22024.8.15优化部分内容
本文的的源码分析全部基于TiDB6.5来做分析。

1.引子

如果让你做一个分布式数据库的优化器,面对以下的SQL,你会想到什么好的方法去执行他们呢?

  • SELECT id, name FROM person WHERE age >= 18 or height > 180 limit 100;:从条件上看,我们看到条件其实是二选一的: age >= 18 or height > 180。基于这种情况,我们肯定会去选择有索引的数据,如果都有索引or都没有,那么肯定选择扫描行数最少的数据。如果有一些算子在里面的话,则额外需要考虑数据的就近原则——有些算子在部分架构下可以充分利用MPP的能力,而有些则不行。
  • SELECT orders.order_id, customers.customer_name, orders.order_date FROM orders LEFT JOIN customers ON orders.customer_id=customers.customer_id;分布式数据库中的join,最优的方式就是小表广播到大表数据所在的地方。那么首先我们得知道谁是小表,谁是大表。

2.统计信息收集

根据前面的两个例子,我们可以发现——如果我们需要找到SQL对应的最佳计划,我们会需要一些表的元数据,或者说是统计信息。从常规的角度来说,以下统计信息是必须收集的:

  • 表的总行数
  • 每列数据的平均大小
  • 每列数据的基数:即NDV(distinct value)
  • 列的NULL值个数

如果是事务型的(行式存储),那么还要考虑索引平均长度、值的分布等等。

如果是分析型的(列式存储),那么还需要考虑相关列的最大值、最小值等等。

而统计方式的收集,会有多种方式。主要是考虑资源和准确性之间的Trade off。常见的有:

  • TopN:相关数据出现次数前 n 的值。
  • 直方图:用于描述数据分布图的工具。按照数据的值大小进行分桶,并用一些简单的数据来描述每个桶,比如落在桶里的值的个数。
  • 2D直方图:由于直方图无法反应列之间的关联,可以用2D直方图(联合分布)做到,但空间占用也比较多。
  • Count-Min Sketch:类似哈希表加上计算器的实现。可以用很少的数据来描述全体数据的特性。
  • HyperLogLog:一种估算海量数据基数的方法。很多情况下,统计并不需要那么精确。工程方面要在使用资源和准确性里找平衡。因此有人提出用HLL,这是一种占用资源少,但能给出较为准确的近似结果的算法。

TiDB收集的统计信息见:https://docs.pingcap.com/zh/tidb/v6.5/statistics#%E7%9B%B4%E6...

3.代价的评估

一个SQL真正的物理执行计划可能是有多个的。在以统计信息为基础之上,我们还需要加入相应的权重,举个例子:

  1. 如果能够在内存中计算完成,就不用去反复发起本地IO
  2. 如果能在本地IO中完成,就不要去发起网络请求

等等...

这在TiDB的代码中也有对应的默认值。

DefOptCPUFactor                                = 3.0
DefOptCopCPUFactor                             = 3.0
DefOptTiFlashConcurrencyFactor                 = 24.0
DefOptNetworkFactor                            = 1.0
DefOptScanFactor                               = 1.5
DefOptDescScanFactor                           = 3.0
DefOptSeekFactor                               = 20.0
DefOptMemoryFactor                             = 0.001
DefOptDiskFactor                               = 1.5
DefOptConcurrencyFactor                        = 3.0
var defaultVer2Factors = costVer2Factors{
    TiDBTemp:      costVer2Factor{"tidb_temp_table_factor", 0.00},
    TiKVScan:      costVer2Factor{"tikv_scan_factor", 40.70},
    TiKVDescScan:  costVer2Factor{"tikv_desc_scan_factor", 61.05},
    TiFlashScan:   costVer2Factor{"tiflash_scan_factor", 11.60},
    TiDBCPU:       costVer2Factor{"tidb_cpu_factor", 49.90},
    TiKVCPU:       costVer2Factor{"tikv_cpu_factor", 49.90},
    TiFlashCPU:    costVer2Factor{"tiflash_cpu_factor", 2.40},
    TiDB2KVNet:    costVer2Factor{"tidb_kv_net_factor", 3.96},
    TiDB2FlashNet: costVer2Factor{"tidb_flash_net_factor", 2.20},
    TiFlashMPPNet: costVer2Factor{"tiflash_mpp_net_factor", 1.00},
    TiDBMem:       costVer2Factor{"tidb_mem_factor", 0.20},
    TiKVMem:       costVer2Factor{"tikv_mem_factor", 0.20},
    TiFlashMem:    costVer2Factor{"tiflash_mem_factor", 0.05},
    TiDBDisk:      costVer2Factor{"tidb_disk_factor", 200.00},
    TiDBRequest:   costVer2Factor{"tidb_request_factor", 6000000.00},
}

4.执行计划枚举与择优

当我们可以评估出物理执行计划的代价时,将会枚举所有可以用上物理执行计划,并在其中选择代价最小的物理执行计划。一般枚举分为两个流派:

  1. 自底向上:代表有System R。
  2. 自顶向下:代表有Cascade。

自底向上没法解决一个问题。当上层对下层的输出结果顺序感兴趣时,那就不能只能从底层的视角来寻找局部最优。

举个例子,多表Join。一开始两个表Join可能HashJoin代价很低,但是Join下一个表时,用MergeJoin从整体来看代价更低。从这个case来看,自底向上来做计划取优并不合适。

所以有了Cascade:

  1. 搜索方案是自顶向下的。这意味着它可以避免局部最优而导致全局不优。从Operator Tree 自顶向下遍历时,可以做一系列变换:
    • Implementation:逻辑算子可以转换成物理算子;例如Join转换成NestLoop或者HashJoin等
    • Exploration:逻辑算子可以做等价变换;例如交换Inner Join的两个子节点,即可枚举Join顺序

图片来自于:Cascades Optimizer——https://zhuanlan.zhihu.com/p/73545345

  1. 它是基于Volcano模型演进而来的。
  2. 用面向对象的方式进行建模,编写规则时不需要关心搜索过程。相比传统优化器中面向过程去一条条的编写,的确是很大的改进。

5.TiDB的实现

大致的代码调用链为:

-- session/session.go

\-- ExecuteStmt //SQL执行的入口

|-- executor/compiler.go

\-- Compile //将SQL变成可执行的计划

|--planner/planner/optmize.go

\-- Optimize //优化的入口

\-- optimize //这里有两个入口。一种是新的优化器入口,一种是老的优化器入口。根据配置来选择。最终都会返回最优的物理执行计划。

    |-- planner/cascades/optmize.go

        \--FindBestPlan 见5.1
    
        \-- onPhasePreprocessing //见5.3

        \-- implGroup

        |--planner/core/optmizer.go //见5.4

            \-- DoOptimize

            \-- physicalOptimize

            |--planner/core/find_best_task.go

            \-- findBestTask

            \-- enumeratePhysicalPlans4Task

            \-- compareTaskCost
    
            \-- getTaskPlanCost
    
            |-- planner/core/plan_cost_ver2.go

            \-- getPlanCost

5.1 逻辑优化

核心入口为:


// FindBestPlan is the optimization entrance of the cascades planner. The
// optimization is composed of 3 phases: preprocessing, exploration and implementation.
//
// ------------------------------------------------------------------------------
// Phase 1: Preprocessing
// ------------------------------------------------------------------------------
//
// The target of this phase is to preprocess the plan tree by some heuristic
// rules which should always be beneficial, for example Column Pruning.
//
// ------------------------------------------------------------------------------
// Phase 2: Exploration
// ------------------------------------------------------------------------------
//
// The target of this phase is to explore all the logically equivalent
// expressions by exploring all the equivalent group expressions of each group.
//
// At the very beginning, there is only one group expression in a Group. After
// applying some transformation rules on certain expressions of the Group, all
// the equivalent expressions are found and stored in the Group. This procedure
// can be regarded as searching for a weak connected component in a directed
// graph, where nodes are expressions and directed edges are the transformation
// rules.
//
// ------------------------------------------------------------------------------
// Phase 3: Implementation
// ------------------------------------------------------------------------------
//
// The target of this phase is to search the best physical plan for a Group
// which satisfies a certain required physical property.
//
// In this phase, we need to enumerate all the applicable implementation rules
// for each expression in each group under the required physical property. A
// memo structure is used for a group to reduce the repeated search on the same
// required physical property.
func (opt *Optimizer) FindBestPlan(sctx sessionctx.Context, logical plannercore.LogicalPlan) (p plannercore.PhysicalPlan, cost float64, err error) {
    logical, err = opt.onPhasePreprocessing(sctx, logical)
    if err != nil {
        return nil, 0, err
    }
    rootGroup := memo.Convert2Group(logical)
    err = opt.onPhaseExploration(sctx, rootGroup)
    if err != nil {
        return nil, 0, err
    }
    p, cost, err = opt.onPhaseImplementation(sctx, rootGroup)
    if err != nil {
        return nil, 0, err
    }
    err = p.ResolveIndices()
    return p, cost, err
}

注释+代码很干净,这里一共做三件事

  1. onPhasePreprocessing:注释很实在,说for example Column Pruning,结果里面真的只做了列裁剪。
  2. onPhaseExploration:探索所有逻辑等价存在的可能
  3. onPhaseImplementation:根据代价寻找最优的物理执行计划

这块官网的博客写的非常好,我就不重复了:https://cn.pingcap.com/blog/tidb-cascades-planner/

5.2 统计信息收集

这块代码主要集中在stats.go里,里面的核心数据结构是stats_info.go里的StatsInfo。调用链大致为:

|-- planner/cascades/optimizer.go

\--fillGroupStats

|-- planner/core/stats.go

\--DeriveStats
type LogicalPlan interface {
    Plan
    //......忽略一些代码
    // DeriveStats derives statistic info for current plan node given child stats.
    // We need selfSchema, childSchema here because it makes this method can be used in
    // cascades planner, where LogicalPlan might not record its children or schema.
    DeriveStats(childStats []*property.StatsInfo, selfSchema *expression.Schema, childSchema []*expression.Schema, colGroups [][]*expression.Column) (*property.StatsInfo, error)
    //......忽略一些代码
}

有很多结构体实现了这个方法:

  1. 收集统计信息主要是analyze.go#Next方法中调用的#analyzeWorker。

5.3 新版本 物理执行计划的选择

代码入口为:

// implGroup finds the best Implementation which satisfies the required
// physical property for a Group. The best Implementation should have the
// lowest cost among all the applicable Implementations.
//
// g:            the Group to be implemented.
// reqPhysProp: the required physical property.
// costLimit:   the maximum cost of all the Implementations.
func (opt *Optimizer) implGroup(g *memo.Group, reqPhysProp *property.PhysicalProperty, costLimit float64) (memo.Implementation, error) {
    groupImpl := g.GetImpl(reqPhysProp)
    if groupImpl != nil {
        if groupImpl.GetCost() <= costLimit {
            return groupImpl, nil
        }
        return nil, nil
    }
    // Handle implementation rules for each equivalent GroupExpr.
    var childImpls []memo.Implementation
    err := opt.fillGroupStats(g)
    if err != nil {
        return nil, err
    }
    outCount := math.Min(g.Prop.Stats.RowCount, reqPhysProp.ExpectedCnt)
    for elem := g.Equivalents.Front(); elem != nil; elem = elem.Next() {
        curExpr := elem.Value.(*memo.GroupExpr)
        impls, err := opt.implGroupExpr(curExpr, reqPhysProp)
        if err != nil {
            return nil, err
        }
        for _, impl := range impls {
            childImpls = childImpls[:0]
            for i, childGroup := range curExpr.Children {
                childImpl, err := opt.implGroup(childGroup, impl.GetPlan().GetChildReqProps(i), impl.GetCostLimit(costLimit, childImpls...))
                if err != nil {
                    return nil, err
                }
                if childImpl == nil {
                    impl.SetCost(math.MaxFloat64)
                    break
                }
                childImpls = append(childImpls, childImpl)
            }
            if impl.GetCost() == math.MaxFloat64 {
                continue
            }
            implCost := impl.CalcCost(outCount, childImpls...)
            if implCost > costLimit {
                continue
            }
            if groupImpl == nil || groupImpl.GetCost() > implCost {
                groupImpl = impl.AttachChildren(childImpls...)
                costLimit = implCost
            }
        }
    }
    // Handle enforcer rules for required physical property.
    for _, rule := range GetEnforcerRules(g, reqPhysProp) {
        newReqPhysProp := rule.NewProperty(reqPhysProp)
        enforceCost := rule.GetEnforceCost(g)
        childImpl, err := opt.implGroup(g, newReqPhysProp, costLimit-enforceCost)
        if err != nil {
            return nil, err
        }
        if childImpl == nil {
            continue
        }
        impl := rule.OnEnforce(reqPhysProp, childImpl)
        implCost := enforceCost + childImpl.GetCost()
        impl.SetCost(implCost)
        if groupImpl == nil || groupImpl.GetCost() > implCost {
            groupImpl = impl
            costLimit = implCost
        }
    }
    if groupImpl == nil || groupImpl.GetCost() == math.MaxFloat64 {
        return nil, nil
    }
    g.InsertImpl(reqPhysProp, groupImpl)
    return groupImpl, nil
}

这里个函数会找出潜在符合条件的物理执行计划,并不断的搜索。但它有一个剪枝逻辑——会记录当前最小的cost,如果一个执行计划自上向下搜索时,如果超过了这个cost,则直接返回。这就是在第3节提到的自顶向下的优化。

接下来我们看一下memo.Implementation的定义:


package memo

import (
    plannercore "github.com/pingcap/tidb/planner/core"
)

// Implementation defines the interface for cost of physical plan.
type Implementation interface {
    CalcCost(outCount float64, children ...Implementation) float64
    SetCost(cost float64)
    GetCost() float64
    GetPlan() plannercore.PhysicalPlan

    // AttachChildren is used to attach children implementations and returns it self.
    AttachChildren(children ...Implementation) Implementation

    // GetCostLimit gets the costLimit for implementing the next childGroup.
    GetCostLimit(costLimit float64, children ...Implementation) float64
}

其中CalcCost方法就是用来计算物理执行计划的代价。一共有这么多结构体实现了它:

5.3.1 代价的评估

我们以开头的例子,讲解代价的评估。

select代价计算方式
// CalcCost calculates the cost of the table scan Implementation.
func (impl *TableScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
    ts := impl.plan.(*plannercore.PhysicalTableScan)
    width := impl.tblColHists.GetTableAvgRowSize(impl.plan.SCtx(), impl.tblCols, kv.TiKV, true)
    sessVars := ts.SCtx().GetSessionVars()
    impl.cost = outCount * sessVars.GetScanFactor(ts.Table) * width
    if ts.Desc {
        impl.cost = outCount * sessVars.GetDescScanFactor(ts.Table) * width
    }
    return impl.cost
}

// GetScanFactor returns the session variable scanFactor
// returns 0 when tbl is a temporary table.
func (s *SessionVars) GetScanFactor(tbl *model.TableInfo) float64 {
    if tbl != nil {
        if tbl.TempTableType != model.TempTableNone {
            return 0
        }
    }
    return s.scanFactor 
}


// CalcCost implements Implementation interface.
func (impl *IndexScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
    is := impl.plan.(*plannercore.PhysicalIndexScan)
    sessVars := is.SCtx().GetSessionVars()
    rowSize := impl.tblColHists.GetIndexAvgRowSize(is.SCtx(), is.Schema().Columns, is.Index.Unique)
    cost := outCount * rowSize * sessVars.GetScanFactor(is.Table)
    if is.Desc {
        cost = outCount * rowSize * sessVars.GetDescScanFactor(is.Table)
    }
    cost += float64(len(is.Ranges)) * sessVars.GetSeekFactor(is.Table)
    impl.cost = cost
    return impl.cost
}

这里我们以全表扫描表和命中索引的代码为例子。其中默认的scanFactor是1.5。这里可以看到indexScan和tableScan少了一个因数:width。这个变量代表了所需列的平均大小。这么看基本上是indexScan最优了。

这里Desc的判断其实是个针对CPU流水线的优化。这里假设常规都是order by asc,这样就可以提升流水线的效率。
join代价计算方式
// CalcCost implements Implementation CalcCost interface.
func (impl *HashJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
    hashJoin := impl.plan.(*plannercore.PhysicalHashJoin)
    // The children here are only used to calculate the cost.
    hashJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
    selfCost := hashJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), false, 0, nil)
    impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
    return impl.cost
}

// CalcCost implements Implementation CalcCost interface.
func (impl *MergeJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
    mergeJoin := impl.plan.(*plannercore.PhysicalMergeJoin)
    // The children here are only used to calculate the cost.
    mergeJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
    selfCost := mergeJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), 0)
    impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
    return impl.cost
}

具体的计算都在plan_cost_v1.go里:


// GetCost computes cost of hash join operator itself.
func (p *PhysicalHashJoin) GetCost(lCnt, rCnt float64, isMPP bool, costFlag uint64, op *physicalOptimizeOp) float64 {
    buildCnt, probeCnt := lCnt, rCnt
    build := p.children[0]
    // Taking the right as the inner for right join or using the outer to build a hash table.
    if (p.InnerChildIdx == 1 && !p.UseOuterToBuild) || (p.InnerChildIdx == 0 && p.UseOuterToBuild) {
        buildCnt, probeCnt = rCnt, lCnt
        build = p.children[1]
    }
    sessVars := p.ctx.GetSessionVars()
    oomUseTmpStorage := variable.EnableTmpStorageOnOOM.Load()
    memQuota := sessVars.MemTracker.GetBytesLimit() // sessVars.MemQuotaQuery && hint
    rowSize := getAvgRowSize(build.statsInfo(), build.Schema().Columns)
    spill := oomUseTmpStorage && memQuota > 0 && rowSize*buildCnt > float64(memQuota) && p.storeTp != kv.TiFlash
    // Cost of building hash table.
    cpuFactor := sessVars.GetCPUFactor()
    diskFactor := sessVars.GetDiskFactor()
    memoryFactor := sessVars.GetMemoryFactor()
    concurrencyFactor := sessVars.GetConcurrencyFactor()

    cpuCost := buildCnt * cpuFactor
    memoryCost := buildCnt * memoryFactor
    diskCost := buildCnt * diskFactor * rowSize
    // Number of matched row pairs regarding the equal join conditions.
    helper := &fullJoinRowCountHelper{
        sctx:            p.SCtx(),
        cartesian:       false,
        leftProfile:     p.children[0].statsInfo(),
        rightProfile:    p.children[1].statsInfo(),
        leftJoinKeys:    p.LeftJoinKeys,
        rightJoinKeys:   p.RightJoinKeys,
        leftSchema:      p.children[0].Schema(),
        rightSchema:     p.children[1].Schema(),
        leftNAJoinKeys:  p.LeftNAJoinKeys,
        rightNAJoinKeys: p.RightNAJoinKeys,
    }
    numPairs := helper.estimate()
    // For semi-join class, if `OtherConditions` is empty, we already know
    // the join results after querying hash table, otherwise, we have to
    // evaluate those resulted row pairs after querying hash table; if we
    // find one pair satisfying the `OtherConditions`, we then know the
    // join result for this given outer row, otherwise we have to iterate
    // to the end of those pairs; since we have no idea about when we can
    // terminate the iteration, we assume that we need to iterate half of
    // those pairs in average.
    if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
        p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
        if len(p.OtherConditions) > 0 {
            numPairs *= 0.5
        } else {
            numPairs = 0
        }
    }
    if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
        numPairs = getOperatorActRows(p)
    }
    // Cost of querying hash table is cheap actually, so we just compute the cost of
    // evaluating `OtherConditions` and joining row pairs.
    probeCost := numPairs * cpuFactor
    probeDiskCost := numPairs * diskFactor * rowSize
    // Cost of evaluating outer filter.
    if len(p.LeftConditions)+len(p.RightConditions) > 0 {
        // Input outer count for the above compution should be adjusted by SelectionFactor.
        probeCost *= SelectionFactor
        probeDiskCost *= SelectionFactor
        probeCost += probeCnt * cpuFactor
    }
    diskCost += probeDiskCost
    probeCost /= float64(p.Concurrency)
    // Cost of additional concurrent goroutines.
    cpuCost += probeCost + float64(p.Concurrency+1)*concurrencyFactor
    // Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
    if p.UseOuterToBuild {
        if spill {
            // It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
            cpuCost += buildCnt * cpuFactor
        } else {
            cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
        }
        diskCost += buildCnt * diskFactor * rowSize
    }

    if spill {
        memoryCost *= float64(memQuota) / (rowSize * buildCnt)
    } else {
        diskCost = 0
    }
    if op != nil {
        setPhysicalHashJoinCostDetail(p, op, spill, buildCnt, probeCnt, cpuFactor, rowSize, numPairs,
            cpuCost, probeCost, memoryCost, diskCost, probeDiskCost,
            diskFactor, memoryFactor, concurrencyFactor,
            memQuota)
    }
    return cpuCost + memoryCost + diskCost
}


// GetCost computes cost of merge join operator itself.
func (p *PhysicalMergeJoin) GetCost(lCnt, rCnt float64, costFlag uint64) float64 {
    outerCnt := lCnt
    innerCnt := rCnt
    innerKeys := p.RightJoinKeys
    innerSchema := p.children[1].Schema()
    innerStats := p.children[1].statsInfo()
    if p.JoinType == RightOuterJoin {
        outerCnt = rCnt
        innerCnt = lCnt
        innerKeys = p.LeftJoinKeys
        innerSchema = p.children[0].Schema()
        innerStats = p.children[0].statsInfo()
    }
    helper := &fullJoinRowCountHelper{
        sctx:          p.SCtx(),
        cartesian:     false,
        leftProfile:   p.children[0].statsInfo(),
        rightProfile:  p.children[1].statsInfo(),
        leftJoinKeys:  p.LeftJoinKeys,
        rightJoinKeys: p.RightJoinKeys,
        leftSchema:    p.children[0].Schema(),
        rightSchema:   p.children[1].Schema(),
    }
    numPairs := helper.estimate()
    if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
        p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
        if len(p.OtherConditions) > 0 {
            numPairs *= 0.5
        } else {
            numPairs = 0
        }
    }
    if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
        numPairs = getOperatorActRows(p)
    }
    sessVars := p.ctx.GetSessionVars()
    probeCost := numPairs * sessVars.GetCPUFactor()
    // Cost of evaluating outer filters.
    var cpuCost float64
    if len(p.LeftConditions)+len(p.RightConditions) > 0 {
        probeCost *= SelectionFactor
        cpuCost += outerCnt * sessVars.GetCPUFactor()
    }
    cpuCost += probeCost
    // For merge join, only one group of rows with same join key(not null) are cached,
    // we compute average memory cost using estimated group size.
    NDV, _ := getColsNDVWithMatchedLen(innerKeys, innerSchema, innerStats)
    memoryCost := (innerCnt / NDV) * sessVars.GetMemoryFactor()
    return cpuCost + memoryCost
}

HashJoin要考虑到内存不够的情况,因此在计算到数据不够时,会将对应的数据压入硬盘。但它对数据的顺序并无要求,因此可以并发的去做。见:

    // Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
    if p.UseOuterToBuild {
        if spill {
            // It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
            cpuCost += buildCnt * cpuFactor
        } else {
            cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
        }
        diskCost += buildCnt * diskFactor * rowSize
    }

而MergeJoin的代价显然会更小,但能够选则到这个计划也有较高的要求:当两个关联表要 Join 的字段需要按排好的顺序读取时,适用 Merge Join 算法。

5.4 老版本 物理执行计划的选择

5.4.1 代价的评估

这块代码主要是在plan_cost_ver1.goplan_cost_ver2.go。v2对代价公式进行了更精确的回归校准,调整了部分代价公式,比此前版本的代价公式更加准确。代码上也更为简洁:v2只暴露出了一个公共方法,内部通过不同的类型做转发。

// GetPlanCost returns the cost of this plan.
func GetPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
    return getPlanCost(p, taskType, option)
}

func getPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
    if p.SCtx().GetSessionVars().CostModelVersion == modelVer2 {
        planCost, err := p.getPlanCostVer2(taskType, option)
        return planCost.cost, err
    }
    return p.getPlanCostVer1(taskType, option)
}

根据不同的PhysicalPlan类型,会找到不同绑定方法:

v1的部分方法展示:

select代价计算方式
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = child-cost + filter-cost
func (p *PhysicalSelection) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    inputRows := getCardinality(p.children[0], option.CostFlag)
    cpuFactor := getTaskCPUFactorVer2(p, taskType)

    filterCost := filterCostVer2(option, inputRows, p.Conditions, cpuFactor)

    childCost, err := p.children[0].getPlanCostVer2(taskType, option)
    if err != nil {
        return zeroCostVer2, err
    }

    p.planCostVer2 = sumCostVer2(filterCost, childCost)
    p.planCostInit = true
    return p.planCostVer2, nil
}

这部分代码简单易读。代价就是子查询的代价+筛选的代价。

那么问题来了,中索引的和不中索引的代价应该是不一样的。这里没有体现出来啊。仔细看childCost, err := p.children[0].getPlanCostVer2(taskType, option),这里是会去获取子物理执行计划的代价。

// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
func (p *PointGetPlan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    if p.accessCols == nil { // from fast plan code path
        p.planCostVer2 = zeroCostVer2
        p.planCostInit = true
        return zeroCostVer2, nil
    }
    rowSize := getAvgRowSize(p.stats, p.schema.Columns)
    netFactor := getTaskNetFactorVer2(p, taskType)

    p.planCostVer2 = netCostVer2(option, 1, rowSize, netFactor)
    p.planCostInit = true
    return p.planCostVer2, nil
}

func netCostVer2(option *PlanCostOption, rows, rowSize float64, netFactor costVer2Factor) costVer2 {
    return newCostVer2(option, netFactor,
        rows*rowSize*netFactor.Value,
        func() string { return fmt.Sprintf("net(%v*rowsize(%v)*%v)", rows, rowSize, netFactor) })
}



// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = rows * log2(row-size) * scan-factor
// log2(row-size) is from experiments.
func (p *PhysicalTableScan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    rows := getCardinality(p, option.CostFlag)
    var rowSize float64
    if p.StoreType == kv.TiKV {
        rowSize = getAvgRowSize(p.stats, p.tblCols) // consider all columns if TiKV
    } else { // TiFlash
        rowSize = getAvgRowSize(p.stats, p.schema.Columns)
    }
    rowSize = math.Max(rowSize, 2.0)
    scanFactor := getTaskScanFactorVer2(p, p.StoreType, taskType)

    p.planCostVer2 = scanCostVer2(option, rows, rowSize, scanFactor)

    // give TiFlash a start-up cost to let the optimizer prefers to use TiKV to process small table scans.
    if p.StoreType == kv.TiFlash {
        p.planCostVer2 = sumCostVer2(p.planCostVer2, scanCostVer2(option, 10000, rowSize, scanFactor))
    }

    p.planCostInit = true
    return p.planCostVer2, nil
}

func scanCostVer2(option *PlanCostOption, rows, rowSize float64, scanFactor costVer2Factor) costVer2 {
    if rowSize < 1 {
        rowSize = 1
    }
    return newCostVer2(option, scanFactor,
        // rows * log(row-size) * scanFactor, log2 from experiments
        rows*math.Log2(rowSize)*scanFactor.Value,
        func() string { return fmt.Sprintf("scan(%v*logrowsize(%v)*%v)", rows, rowSize, scanFactor) })
}

scanFactor的代价默认是40.7,netFactor的代价默认是3.96。结合代码来看,命中索引的代价更优。

join代价计算方式
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = build-child-cost + build-filter-cost +
// (probe-cost + probe-filter-cost) / concurrency
// probe-cost = probe-child-cost * build-rows / batchRatio
func (p *PhysicalIndexJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    return p.getIndexJoinCostVer2(taskType, option, 0)
}

func (p *PhysicalIndexHashJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    return p.getIndexJoinCostVer2(taskType, option, 1)
}

func (p *PhysicalIndexMergeJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    return p.getIndexJoinCostVer2(taskType, option, 2)
}

func (p *PhysicalIndexJoin) getIndexJoinCostVer2(taskType property.TaskType, option *PlanCostOption, indexJoinType int) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    build, probe := p.children[1-p.InnerChildIdx], p.children[p.InnerChildIdx]
    buildRows := getCardinality(build, option.CostFlag)
    buildRowSize := getAvgRowSize(build.Stats(), build.Schema().Columns)
    probeRowsOne := getCardinality(probe, option.CostFlag)
    probeRowsTot := probeRowsOne * buildRows
    probeRowSize := getAvgRowSize(probe.Stats(), probe.Schema().Columns)
    buildFilters, probeFilters := p.LeftConditions, p.RightConditions
    probeConcurrency := float64(p.ctx.GetSessionVars().IndexLookupJoinConcurrency())
    cpuFactor := getTaskCPUFactorVer2(p, taskType)
    memFactor := getTaskMemFactorVer2(p, taskType)
    requestFactor := getTaskRequestFactorVer2(p, taskType)

    buildFilterCost := filterCostVer2(option, buildRows, buildFilters, cpuFactor)
    buildChildCost, err := build.getPlanCostVer2(taskType, option)
    if err != nil {
        return zeroCostVer2, err
    }
    buildTaskCost := newCostVer2(option, cpuFactor,
        buildRows*10*cpuFactor.Value,
        func() string { return fmt.Sprintf("cpu(%v*10*%v)", buildRows, cpuFactor) })
    startCost := newCostVer2(option, cpuFactor,
        10*3*cpuFactor.Value,
        func() string { return fmt.Sprintf("cpu(10*3*%v)", cpuFactor) })

    probeFilterCost := filterCostVer2(option, probeRowsTot, probeFilters, cpuFactor)
    probeChildCost, err := probe.getPlanCostVer2(taskType, option)
    if err != nil {
        return zeroCostVer2, err
    }

    var hashTableCost costVer2
    switch indexJoinType {
    case 1: // IndexHashJoin
        hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
    case 2: // IndexMergeJoin
        hashTableCost = newZeroCostVer2(traceCost(option))
    default: // IndexJoin
        hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
    }

    // IndexJoin executes a batch of rows at a time, so the actual cost of this part should be
    //  `innerCostPerBatch * numberOfBatches` instead of `innerCostPerRow * numberOfOuterRow`.
    // Use an empirical value batchRatio to handle this now.
    // TODO: remove this empirical value.
    batchRatio := 6.0
    probeCost := divCostVer2(mulCostVer2(probeChildCost, buildRows), batchRatio)

    // Double Read Cost
    doubleReadCost := newZeroCostVer2(traceCost(option))
    if p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate > 0 {
        batchSize := float64(p.ctx.GetSessionVars().IndexJoinBatchSize)
        taskPerBatch := 1024.0 // TODO: remove this magic number
        doubleReadTasks := buildRows / batchSize * taskPerBatch
        doubleReadCost = doubleReadCostVer2(option, doubleReadTasks, requestFactor)
        doubleReadCost = mulCostVer2(doubleReadCost, p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate)
    }

    p.planCostVer2 = sumCostVer2(startCost, buildChildCost, buildFilterCost, buildTaskCost, divCostVer2(sumCostVer2(doubleReadCost, probeCost, probeFilterCost, hashTableCost), probeConcurrency))
    p.planCostInit = true
    return p.planCostVer2, nil
}

关键在于:

    switch indexJoinType {
    case 1: // IndexHashJoin
        hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
    case 2: // IndexMergeJoin
        hashTableCost = newZeroCostVer2(traceCost(option))
    default: // IndexJoin
        hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
    }

对应方法:

func hashBuildCostVer2(option *PlanCostOption, buildRows, buildRowSize, nKeys float64, cpuFactor, memFactor costVer2Factor) costVer2 {
    // TODO: 1) consider types of keys, 2) dedicated factor for build-probe hash table
    hashKeyCost := newCostVer2(option, cpuFactor,
        buildRows*nKeys*cpuFactor.Value,
        func() string { return fmt.Sprintf("hashkey(%v*%v*%v)", buildRows, nKeys, cpuFactor) })
    hashMemCost := newCostVer2(option, memFactor,
        buildRows*buildRowSize*memFactor.Value,
        func() string { return fmt.Sprintf("hashmem(%v*%v*%v)", buildRows, buildRowSize, memFactor) })
    hashBuildCost := newCostVer2(option, cpuFactor,
        buildRows*cpuFactor.Value,
        func() string { return fmt.Sprintf("hashbuild(%v*%v)", buildRows, cpuFactor) })
    return sumCostVer2(hashKeyCost, hashMemCost, hashBuildCost)
}

func newZeroCostVer2(trace bool) (ret costVer2) {
    if trace {
        ret.trace = &costTrace{make(map[string]float64), ""}
    }
    return
}

简单的看一下代码,我们可以发现,从大多数的场景来看,按照代价从小到大来排,这几种Join是MergeJoin<HashJoin<IndexJoin。

5.4.2执行计划枚举与择优

总得来说这块代码较为简单,本质就是枚举所有符合条件的物理执行计划,并挑选出代价最小的执行计划,故不再列举代码。有兴趣的同学可以根据以下大纲自行翻阅:

|--planner/core/find_best_task.go

\-- findBestTask

\-- enumeratePhysicalPlans4Task

\-- compareTaskCost

\-- getTaskPlanCost

|-- planner/core/plan_cost_ver2.go

\-- getPlanCost

6.其他

6.1 参考与引用的文章

6.2 知识补充:code generation && vectorized execution

数据库引擎执行器中非常出名的两种优化方式,code generation和 vectorized execution。

code generation主要是根据上下文来生成一整段优化过的代码,这与那种嵌套大量if...else、for循环、虚方法的代码完全相反,完全面向性能考虑。

vectorized execution基于拉模型。相比于一次拉一个tuple来说,它的批量拉取减少了多次拉取的开销,同时还可以使用到SIMD。基于这种场景,vectorized execution的优化更加适用于列式数据库。

6.其他

6.1 参考与引用的文章

6.2 知识补充:code generation && vectorized execution

数据库引擎执行器中非常出名的两种优化方式,code generation和 vectorized execution。

code generation主要是根据上下文来生成一整段优化过的代码,这与那种嵌套大量if...else、for循环、虚方法的代码完全相反,完全面向性能考虑。

vectorized execution基于拉模型。相比于一次拉一个tuple来说,它的批量拉取减少了多次拉取的开销,同时还可以使用到SIMD。基于这种场景,vectorized execution的优化更加适用于列式数据库。

6.3 FAQ

Q1:统计信息是如何影响执行计划的?

以常见的统计信息为例:

  1. 表的行数(Rows) :这个信息帮助优化器评估扫描表的成本。如果一个表的行数很少,全表扫描可能比使用索引更为高效。相反,如果表非常大,优化器可能倾向于使用索引来减少需要读取的数据量。
  2. 平均行长度(Average Row Length) :用于估算读取数据所需的I/O成本,尤其是在考虑使用全表扫描时。较大的平均行长度意味着每次读取时会从磁盘获取更少的行,可能促使优化器选择其他策略。
  3. 不同值的数量(Distinct Values or NDV) : 决定是否使用索引以及索引的效率。如果一列的NDV很低,意味着索引选择性不高,可能不会使用索引;如果NDV很高,索引更有可能被选用来快速定位到特定的行。
  4. 最大值/最小值(Max Value/Min Value) : 这些边界值可以辅助优化器估计查询范围,特别是在确定查询是否能有效利用索引范围扫描时。
  5. 高频值的直方图(Histograms)影响: 直方图提供了数据分布的更精细视图,对处理偏斜数据特别有用。优化器可以利用这些信息来更好地估计查询过滤条件将匹配多少行,从而影响JOIN顺序、索引选择等。
  6. 索引的唯一性:如果索引的键值高度唯一(如主键),优化器更倾向于使用该索引来避免表扫描,因为索引查找可以直接定位到一行或多行数据。
  7. 索引的叶节点层数或高度:索引的高度决定了访问索引的深度,进而影响查询成本。较浅的索引(较少层数)通常意味着更快的访问速度。
  8. 索引的基数(Index Selectivity) :基数越高,表示索引区分度越高,越有利于过滤数据。这直接影响到是否使用索引以及预期的检索速度。
  9. 空间使用情况:虽然不是直接用于执行计划,但空间使用情况可以影响数据库的整体性能,比如碎片化严重的表或索引可能会影响I/O效率,从而间接影响执行计划的选择。
  10. 分区表统计信息:对于分区表,每个分区的统计信息帮助优化器决定是否需要访问所有分区或仅访问满足条件的分区,这可以显著减少查询处理的数据量。

Q2:当数据表有3个索引,查询的SQL都可以命中这3个索引,优化器会怎么选择?

参考上Q1回答中提到的索引的唯一性、索引的叶节点层数或高度、索引的基数(Index Selectivity),优化器选择索引时主要会根据这些代价来计算成本。当然,如果扫描的数据过多(比如占很多NDV),那么优化器不会选择索引,直接全表扫描,避免回表的代价。

Q3:当TiDB进行大量数据变更后,有些原本能命中索引的SQL不会命中了,这是为什么?

参考Q2的答案。


泊浮目
4.9k 声望1.3k 粉丝