Nebula Graph source code interpretation series｜Vol.03 Planner implementation

Nebula Graph 源码解读系列｜ Vol.03 Planner 的实现

In the last article, we mentioned that the Validator converts the abstract syntax tree (AST) generated by the Parser into an execution plan. This time, let's talk about how the execution plan is generated.

Overview

Planner is an execution plan (Execution Plan) generator. It generates an unoptimized execution plan that can be executed by the executor based on the semantically valid query syntax tree verified by the Validator, and the execution plan will be submitted later. An optimized execution plan is generated by the Optimizer and finally handed over to the Executor for execution. The execution plan consists of a series of nodes (PlanNode).

Source directory structure

src/planner
├── CMakeLists.txt
├── match/
├── ngql/
├── plan/
├── Planner.cpp
├── Planner.h
├── PlannersRegister.cpp
├── PlannersRegister.h
├── SequentialPlanner.cpp
├── SequentialPlanner.h
└── test

Among them, Planner.h defines the data structure of SubPlan and several interfaces of planner.

struct SubPlan {
    // root and tail of a subplan.
    PlanNode*   root{nullptr};
    PlanNode*   tail{nullptr};
};

PlannerRegister is responsible for registering available planners. Nebula Graph currently registers SequentialPlanner, PathPlanner, LookupPlanner, GoPlanner, MatchPlanner.

The sentence corresponding to SequentialPlanner is SequentialSentences, and SequentialSentence is a combined sentence composed of multiple Sentences and spaced semicolons. Each statement may be a GO / LOOKUP / 06157227b42553, so SequentialPlanner generates multiple plans by calling MATCH several other statements, and connects them end to end Validator::appendPlan

Nebula Graph 源码解读系列｜ Vol.03 Planner 的实现

The match directory defines the connection strategy between planner and SubPlan of openCypher related statements and clauses (such as MATCH, UNWIND, WITH, RETURN, WHERE, ORDER BY, SKIP, LIMIT). SegmentsConnector uses corresponding connection strategies (AddInput, addDependency, innerJoinSegments, etc.) according to the relationship between SubPlans to connect them end to end into a complete plan.

src/planner/match
├── AddDependencyStrategy.cpp
├── AddDependencyStrategy.h
├── AddInputStrategy.cpp
├── AddInputStrategy.h
├── CartesianProductStrategy.cpp
├── CartesianProductStrategy.h
├── CypherClausePlanner.h
├── EdgeIndexSeek.h
├── Expand.cpp
├── Expand.h
├── InnerJoinStrategy.cpp
├── InnerJoinStrategy.h
├── LabelIndexSeek.cpp
├── LabelIndexSeek.h
├── LeftOuterJoinStrategy.h
├── MatchClausePlanner.cpp
├── MatchClausePlanner.h
├── MatchPlanner.cpp
├── MatchPlanner.h
├── MatchSolver.cpp
├── MatchSolver.h
├── OrderByClausePlanner.cpp
├── OrderByClausePlanner.h
├── PaginationPlanner.cpp
├── PaginationPlanner.h
├── PropIndexSeek.cpp
├── PropIndexSeek.h
├── ReturnClausePlanner.cpp
├── ReturnClausePlanner.h
├── SegmentsConnector.cpp
├── SegmentsConnector.h
├── SegmentsConnectStrategy.h
├── StartVidFinder.cpp
├── StartVidFinder.h
├── UnionStrategy.h
├── UnwindClausePlanner.cpp
├── UnwindClausePlanner.h
├── VertexIdSeek.cpp
├── VertexIdSeek.h
├── WhereClausePlanner.cpp
├── WhereClausePlanner.h
├── WithClausePlanner.cpp
├── WithClausePlanner.h
├── YieldClausePlanner.cpp
└── YieldClausePlanner.h

The ngql directory defines planners related to nGQL statements (such as GO, LOOKUP, FIND PATH)

src/planner/ngql
├── GoPlanner.cpp
├── GoPlanner.h
├── LookupPlanner.cpp
├── LookupPlanner.h
├── PathPlanner.cpp
└── PathPlanner.h

The plan directory defines 7 categories, with a total of more than 100 Plan Nodes.

src/planner/plan
├── Admin.cpp
├── Admin.h
├── Algo.cpp
├── Algo.h
├── ExecutionPlan.cpp
├── ExecutionPlan.h
├── Logic.cpp
├── Logic.h
├── Maintain.cpp
├── Maintain.h
├── Mutate.cpp
├── Mutate.h
├── PlanNode.cpp
├── PlanNode.h
├── Query.cpp
├── Query.h
└── Scan.h

Description of some nodes:

Admin is a node related to database management
Algo is algorithm-related nodes such as paths and subgraphs
Logic is a logic control node, such as loop, binary selection, etc.
Maintain is a schema related node
Mutate is a DML related node
Query is the node related to query calculation
Scan is an index scan related node

Each PlanNode generates a corresponding executor in the Executor (executor) stage, and each executor is responsible for a specific function.

eg. GetNeighbors node:

static GetNeighbors* make(QueryContext* qctx,
                              PlanNode* input,
                              GraphSpaceID space,
                              Expression* src,
                              std::vector<EdgeType> edgeTypes,
                              Direction edgeDirection,
                              std::unique_ptr<std::vector<VertexProp>>&& vertexProps,
                              std::unique_ptr<std::vector<EdgeProp>>&& edgeProps,
                              std::unique_ptr<std::vector<StatProp>>&& statProps,
                              std::unique_ptr<std::vector<Expr>>&& exprs,
                              bool dedup = false,
                              bool random = false,
                              std::vector<storage::cpp2::OrderBy> orderBy = {},
                              int64_t limit = -1,
                              std::string filter = "")

GetNeighbors is a semantic encapsulation of the kv of the edge of the storage layer: it finds the end of the edge according to the starting point of the given type of edge. In the process of finding edges, GetNeighbors can get edge properties (edgeProps). Because the outgoing edge is stored in the same partition (data slice) along with the starting point, we can also easily obtain the attribute (vertexProps) of the starting point of the edge.

Aggregate node:

static Aggregate* make(QueryContext* qctx,
                               PlanNode* input, 
                               std::vector<Expression*>&& groupKeys = {},
                               std::vector<Expression*>&& groupItems = {})

The Aggregate node is an aggregate computing node, which is grouped according to groupKeys, and aggregated according to groupItems as the value within the group.

Loop node:

static Loop* make(QueryContext* qctx,
                      PlanNode* input,
                      PlanNode* body = nullptr,
                      Expression* condition = nullptr);

loop is a loop node, it will continue to execute the PlanNode fragment between the body and the nearest start node until the condition value is false.

InnerJoin node:

static InnerJoin* make(QueryContext* qctx,
                           PlanNode* input,
                           std::pair<std::string, int64_t> leftVar,
                           std::pair<std::string, int64_t> rightVar,
                           std::vector<Expression*> hashKeys = {},
                           std::vector<Expression*> probeKeys = {})

The InnerJoin node does inline the two tables (Table, DataSet), and leftVar and rightVar are used to refer to the two tables respectively.

Entry function

The planner entry function is Validator::toPlan

Status Validator::toPlan() {
    auto* astCtx = getAstContext();
    if (astCtx != nullptr) {
        astCtx->space = space_;
    }
    auto subPlanStatus = Planner::toPlan(astCtx);
    NG_RETURN_IF_ERROR(subPlanStatus);
    auto subPlan = std::move(subPlanStatus).value();
    root_ = subPlan.root;
    tail_ = subPlan.tail;
    VLOG(1) << "root: " << root_->kind() << " tail: " << tail_->kind();
    return Status::OK();
}

Specific steps

1. Call getAstContext()

First call getAstContext() to obtain the AST context verified and rewritten by the validator. These context-related data structures are defined in src/context .

src/context/ast
├── AstContext.h
├── CypherAstContext.h
└── QueryAstContext.h

struct AstContext {
    QueryContext*   qctx; // 每个查询请求的 context
    Sentence*       sentence; // query 语句的 ast
    SpaceInfo       space; // 当前 space
};

The ast context of openCypher related syntax is defined in CypherAstContext, and the ast context of nGQL related syntax is defined in QueryAstContext.

2. Call Planner::toPlan(astCtx)

Then call Planner::toPlan(astCtx) , find the registered planner corresponding to the statement in PlannerMap according to the ast context, and then generate the corresponding execution plan.

Each Plan consists of a series of PlanNodes, and there are two relationships execution dependency and data dependency

Execution dependency: In terms of execution order, plan is a directed acyclic graph, and the dependency between nodes is determined when the plan is generated. In the execution phase, the executor will generate a corresponding operator for each node, and start scheduling from the root node. At this time, when it finds that this node depends on other nodes, it will first recursively call the dependent node, and always find the node without any dependencies ( Start node), and then start execution. After executing this node, continue to execute other nodes on which this node is dependent until the root node.
Data dependency: The data dependency of a node is generally the same as the execution dependency, that is, the output from the previous node scheduled for execution. Some nodes, such as InnerJoin, have multiple inputs, so its input may be the output of a node that is several nodes apart from it.

Nebula Graph 源码解读系列｜ Vol.03 Planner 的实现

(Solid line is execution dependency, dashed line is data dependency)

for example

Let's take MatchPlanner as an example to see how an execution plan is generated:

Statement:

MATCH (v:player)-[:like*2..4]-(v2:player)\
WITH v, v2.age AS age ORDER BY age WHERE age > 18\
RETURN id(v), age

After the statement is checked and rewritten by MatchValidator, a tree composed of context will be output.

Nebula Graph 源码解读系列｜ Vol.03 Planner 的实现

Each Clause and SubClause corresponds to a context:

enum class CypherClauseKind : uint8_t {
    kMatch,
    kUnwind,
    kWith,
    kWhere,
    kReturn,
    kOrderBy,
    kPagination,
    kYield,
};

struct CypherClauseContextBase : AstContext {
    explicit CypherClauseContextBase(CypherClauseKind k) : kind(k) {}
    virtual ~CypherClauseContextBase() = default;

    const CypherClauseKind  kind;
};

struct MatchClauseContext final : CypherClauseContextBase {
    MatchClauseContext() : CypherClauseContextBase(CypherClauseKind::kMatch) {}

    std::vector<NodeInfo>                       nodeInfos; // pattern 中涉及的顶点信息
    std::vector<EdgeInfo>                       edgeInfos; // pattern 中涉及的边信息
    PathBuildExpression*                        pathBuild{nullptr}; // 构建 path 的表达式
    std::unique_ptr<WhereClauseContext>         where; // filter SubClause
    std::unordered_map<std::string, AliasType>* aliasesUsed{nullptr}; // 输入的 alias 信息
    std::unordered_map<std::string, AliasType>  aliasesGenerated; // 产生的 alias 信息
};
...

Then:

1. Find a sentence planner

Find the planner of the corresponding statement, the statement type is Match. Find the planner MatchPlanner of the sentence in PlannersMap.

2. Generate plan

Call the MatchPlanner::transform method to generate a plan:

StatusOr<SubPlan> MatchPlanner::transform(AstContext* astCtx) {
    if (astCtx->sentence->kind() != Sentence::Kind::kMatch) {
        return Status::Error("Only MATCH is accepted for match planner.");
    }
    auto* matchCtx = static_cast<MatchAstContext*>(astCtx);

    std::vector<SubPlan> subplans;
    for (auto& clauseCtx : matchCtx->clauses) {
        switch (clauseCtx->kind) {
            case CypherClauseKind::kMatch: {
                auto subplan = std::make_unique<MatchClausePlanner>()->transform(clauseCtx.get());
                NG_RETURN_IF_ERROR(subplan);
                subplans.emplace_back(std::move(subplan).value());
                break;
            }
            case CypherClauseKind::kUnwind: {
                auto subplan = std::make_unique<UnwindClausePlanner>()->transform(clauseCtx.get());
                NG_RETURN_IF_ERROR(subplan);
                auto& unwind = subplan.value().root;
                std::vector<std::string> inputCols;
                if (!subplans.empty()) {
                    auto input = subplans.back().root;
                    auto cols = input->colNames();
                    for (auto col : cols) {
                        inputCols.emplace_back(col);
                    }
                }
                inputCols.emplace_back(unwind->colNames().front());
                unwind->setColNames(inputCols);
                subplans.emplace_back(std::move(subplan).value());
                break;
            }
            case CypherClauseKind::kWith: {
                auto subplan = std::make_unique<WithClausePlanner>()->transform(clauseCtx.get());
                NG_RETURN_IF_ERROR(subplan);
                subplans.emplace_back(std::move(subplan).value());
                break;
            }
            case CypherClauseKind::kReturn: {
                auto subplan = std::make_unique<ReturnClausePlanner>()->transform(clauseCtx.get());
                NG_RETURN_IF_ERROR(subplan);
                subplans.emplace_back(std::move(subplan).value());
                break;
            }
            default: { return Status::Error("Unsupported clause."); }
        }
    }

    auto finalPlan = connectSegments(astCtx, subplans, matchCtx->clauses);
    NG_RETURN_IF_ERROR(finalPlan);
    return std::move(finalPlan).value();
}

The match statement may be composed of multiple MATCH / UNWIND / WITH / RETURN Clause, so in transform, according to the type of Clause, directly call the corresponding ClausePlanner to generate SubPlan, and finally connect them by SegmentsConnector according to various connection strategies.

In our example statement,

first claim is Match Clause: MATCH (v:player)-[:like*2..4]-(v2:player) , so the MatchClause::transform method will be called:

StatusOr<SubPlan> MatchClausePlanner::transform(CypherClauseContextBase* clauseCtx) {
    if (clauseCtx->kind != CypherClauseKind::kMatch) {
        return Status::Error("Not a valid context for MatchClausePlanner.");
    }

    auto* matchClauseCtx = static_cast<MatchClauseContext*>(clauseCtx);
    auto& nodeInfos = matchClauseCtx->nodeInfos;
    auto& edgeInfos = matchClauseCtx->edgeInfos;
    SubPlan matchClausePlan;
    size_t startIndex = 0;
    bool startFromEdge = false;

    NG_RETURN_IF_ERROR(findStarts(matchClauseCtx, startFromEdge, startIndex, matchClausePlan));
    NG_RETURN_IF_ERROR(
        expand(nodeInfos, edgeInfos, matchClauseCtx, startFromEdge, startIndex, matchClausePlan));
    NG_RETURN_IF_ERROR(projectColumnsBySymbols(matchClauseCtx, startIndex, matchClausePlan));
    NG_RETURN_IF_ERROR(appendFilterPlan(matchClauseCtx, matchClausePlan));
    return matchClausePlan;
}

The transform method is divided into the following steps:

Looking for a starting point for expansion:

There are currently three strategies for finding a starting point, which are registered in startVidFinders by the planner:

// MATCH(n) WHERE id(n) = value RETURN n
startVidFinders.emplace_back(&VertexIdSeek::make);

// MATCH(n:Tag{prop:value}) RETURN n
// MATCH(n:Tag) WHERE n.prop = value RETURN n
startVidFinders.emplace_back(&PropIndexSeek::make);

// seek by tag or edge(index)
// MATCH(n: tag) RETURN n
// MATCH(s)-[:edge]->(e) RETURN e
startVidFinders.emplace_back(&LabelIndexSeek::make);

Among the three strategies, VertexIdSeek is the best, which can determine the specific starting point VID; PropIndexSeek is the second, which will be converted to an IndexScan with attribute filter; LabelIndexSeek will be converted to an IndexScan.

The findStarts function will traverse all the node information in the match pattern for each starting point strategy, until it finds a node that can be used as a starting point, and generates corresponding Plan Nodes for finding the starting point.

The point-finding strategy of the example sentence is LabelIndexScan, and the starting point is determined by v. Finally, an IndexScan node is generated, and the index is the index on the player tag.

According to the starting point and match pattern, multi-step expansion:

The match pattern of the sentence in the example is (v:player)-[:like*1..2]-(v2:player) , with v as the starting point, expand one to two steps along the edge like, and the ending point has a player type tag.

Do expansion first:

Status Expand::doExpand(const NodeInfo& node, const EdgeInfo& edge, SubPlan* plan) {
    NG_RETURN_IF_ERROR(expandSteps(node, edge, plan));
    NG_RETURN_IF_ERROR(filterDatasetByPathLength(edge, plan->root, plan));
    return Status::OK();
}

Multi-step expansion will generate Loop nodes. Loop body is expandStep which means to expand one step according to a given starting point. Expanding one step requires generating GetNeighbors nodes. The end of each step of expansion is used as the starting point of the next step of expansion, and the loop continues until it reaches the maximum number of steps specified in the pattern.

When doing the M-th step expansion, take the end point of the path of length M-1 obtained earlier as the starting point of this expansion, extend one step outward, and construct a step composed of the starting point of the edge and the edge itself according to the result of the expansion. For a path with a length of 1, then make an InnerJoin between the path with a step length of 1 and the previous path with a step length of M-1 to obtain a set of paths with a step length of M.

Then call to filter this group of paths to remove paths with duplicate edges (the expansion of openCypher paths does not allow duplicate edges), and finally output the end of the path as the starting point for the next expansion. The next step is to expand and continue to do the above steps until it reaches the maximum number of steps specified in Max.

After the loop, the UnionAllVersionVar node will be generated, and the paths with steps from 1 to M steps constructed in each loop of the loop body are combined. filterDatasetByPathLength() function will generate a Filter node to filter out paths whose step length is less than the minimum number of steps specified in the match pattern.

The resulting path is shaped like (v)-like-()-e-(v)-? , and it lacks the attribute information of the end point of the last step. Therefore, we also need to generate a GetVertices node, and then make an InnerJoin between the obtained end point and the previous M-step path, and the result is a collection of paths that meet the requirements of the match pattern!

The principle of match multi-step expansion will be explained in more detail in the article Variable Length Pattern Match.

// Build Start node from first step
SubPlan loopBodyPlan;
PlanNode* startNode = StartNode::make(matchCtx_->qctx);
startNode->setOutputVar(firstStep->outputVar());
startNode->setColNames(firstStep->colNames());
loopBodyPlan.tail = startNode;
loopBodyPlan.root = startNode;

// Construct loop body
NG_RETURN_IF_ERROR(expandStep(edge,
                              startNode,                // dep
                              startNode->outputVar(),   // inputVar
                              nullptr,
                              &loopBodyPlan));

NG_RETURN_IF_ERROR(collectData(startNode,           // left join node
                               loopBodyPlan.root,   // right join node
                               &firstStep,          // passThrough
                               &subplan));
// Union node
auto body = subplan.root;

// Loop condition
auto condition = buildExpandCondition(body->outputVar(), startIndex, maxHop);

// Create loop
auto* loop = Loop::make(matchCtx_->qctx, firstStep, body, condition);

// Unionize the results of each expansion which are stored in the firstStep node
auto uResNode = UnionAllVersionVar::make(matchCtx_->qctx, loop);
uResNode->setInputVar(firstStep->outputVar());
uResNode->setColNames({kPathStr});

subplan.root = uResNode;
plan->root = subplan.root;

Output table, determine the column name of the table:

Use all the named symbols appearing in the match pattern as table column names to generate a table for use in subsequent clauses. This will generate a Project node.

second clause is WithClause, call WithClause::transform to generate SubPlan :

WITH v, v2.age AS age ORDER BY age WHERE age > 18

The WITH clause first yields v and v2.age as a table, then uses age as the sort item to sort, and then filters the sorted table.

The YIELD part will generate a Project node, the ORDER BY part will generate a Sort node, and the WHERE part will generate a Filter node corresponding to one.

Nebula Graph 源码解读系列｜ Vol.03 Planner 的实现

third clause is Return Clause, which will generate a Project node .

RETURN id(v), age

The complete execution plan of the final integrated statement is as follows:

Nebula Graph 源码解读系列｜ Vol.03 Planner 的实现

The above is the introduction of this article.

Exchange graph database technology? Please join Nebula exchange group under Nebula fill in your card , Nebula assistant will pull you into the group ~

Overview

Source directory structure

Entry function

Specific steps

1. Call getAstContext()

2. Call Planner::toPlan(astCtx)

for example

1. Find a sentence planner

2. Generate plan

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

枫清·天枢多模态智能引擎 V2.1.2 版本正式发布！