Analyze the design and practice of Nebula Graph subgraph

This article was first published in the Nebula Graph public NebulaGraphCommunity , Follow to see the technical practice of the large factory graph database.

解析 Nebula Graph 子图设计及实践

Preface

In the previous Query Engine source code analysis, we introduced the main changes and general structure Query Engine

架构变化

You can roughly understand how a user sends a query statement through the client, how Query Engine parses the statement, builds the statement into an abstract syntax tree, verifies it in the abstract syntax tree, and generates an execution plan. This article will continue to explain the content behind the Query Engine through the new subgraph algorithm module in 2.0, and focus on the execution plan generation process, so as to strengthen your better understanding of the source code.

Definition of subgraph

A subgraph refers to a graph in which the node set and the edge set are a subset of the node set and a subset of the edge set of a certain graph, respectively. An intuitive understanding is to start from the starting point specified by the user and expand step by step along the specified edge until the number of steps set by the user is reached, and then return to all point sets and edge sets encountered in the expansion process.

The syntax of subgraphs

GET SUBGRAPH [<step_count> STEPS] FROM {<vid>, <vid>...} [IN <edge_type>, <edge_type>...]
[OUT <edge_type>, <edge_type>...] [BOTH <edge_type>, <edge_type>...]

step_count: Specify the number of hops from the starting point, and return the step_count hops from 0 to 060d59b2602c1f. Must be a non-negative integer. The default value is 1
vid: Specify the starting point ID
edge_type: Specify the edge type. You can use IN , OUT and BOTH to specify the direction of the edge type at the starting point. The default is BOTH

Implementation of subgraph

When receiving a Query Engine GET SUBGRAPH command, Parser module (implemented by flex and bison) will according to the rules (already written parser.yy in get_subgraph_sentence the desired content extraction rule) out from the query, generates an abstract syntax tree, as follows Shown:

解析 Nebula Graph 子图设计及实践

Then enter the Validate stage. At this time, the generated abstract syntax tree is verified. The purpose is to verify whether the user's input is legal (refer to the Query Engine article). After the verification is passed, the content in the syntax tree will be extracted Come out and generate an execution plan.

So how is this execution plan generated? Different database vendors with the same function may generate different execution plans, but the principles are the same. It depends on how its own operators interact with the query layer and storage layer. Because every one of our query statements must fetch data from the storage layer to the end. In Nebula Graph, the Query Engine and the storage layer interact through RPC (fbthrift) (the interface is defined under the interface directory in the common warehouse). There are two very critical interfaces getNeighbors and getProps that need to be understood.

The definition format of fbthrift in getNeighbors is as follows:

struct GetNeighborsRequest {    
    1: common.GraphSpaceID                      space_id,
    2: list<binary>                             column_names,
    3: map<common.PartitionID, list<common.Row>>
        (cpp.template = "std::unordered_map")   parts,
    4: TraverseSpec                             traverse_spec
}

For the detailed definition of each variable in this structure, please refer to https://github.com/vesoft-inc/nebula-common/blob/master/src/common/interface/storage.thrift , which has detailed comments.

Its main function is that Query Engine passes in the starting point and the edge type information to be expanded according to the defined structure, and then the storage layer will find the starting point, and then find out the attributes of the point and the edge attributes of the point to assemble to form a return to the Query Engine, which returns a reference table format https://github.com/vesoft-inc/nebula-common/blob/master/src/common/interface/storage.thrift definition of GetNeighborsResponse , And then in Query Engine we can extract the content we want through this table.

For example, in the basketba ll data set, when the starting point is Tim Duncan, Manu Ginobili like both directions along the 060d59b2603047 side. Want to get the four attributes $^.[player.name](http://player.name/) , like._dst , $$.[player.name](http://player.name/) and like.likeness The data returned is roughly as follows:

数据图

Table 1

Because it is a two-way expansion, the fourth column + like represents the out edge, and the fifth column - like represents the in edge.

In the storage layer of Nebula Graph, the edges are stored together with the starting point, so all the attribute information of the starting point and the out edge can be obtained through the getNeighbor interface, but if you want to get the attribute information of the destination point during the expansion process, you need Use the getProps interface. Of course, if I only want to get the properties of a certain point or edge through the fetch statement, I also need to call this interface. You can understand the definition of getPropRequest under https://github.com/vesoft-inc/nebula-common/blob/master/src/common/interface/storage.thrift

Implementation plan

With the above interface definition we can start to execute the plan. The first operators needed are start, getNeighbor, subgraph, loop, and datacollect.

start operator: It is equivalent to the leaf node in the execution plan and does not do anything. The purpose is to tell the scheduler that there is no operator to rely on afterwards, or it can be understood as a termination condition in the recursive algorithm.
Loop operator: It is equivalent to the while grammar in C language. This operator has three members depend, condition and loopBody. In multi-statement and PIPE , depend will use the current temporarily not listed, and condition is equivalent to the termination condition. loopBody is equivalent to the loop body in while.
Subgraph operator: Responsible for bringing up the _dst (destination point) attribute in the result of the getNeighbor operator, and then filtering out the destination points that have been visited (to avoid repetitive acquisition of data from the storage layer), and then use them as the next expansion of the getNeighbor operator Time input.
datacollect operator: responsible for collecting the point and edge attributes obtained during the expansion process into vertex and edge types at the end.

For detailed information of each operator, please refer to the source code https://github.com/vesoft-inc/nebula-graph/tree/master/src/executor .
Let's take Figure 1 as an example, how we build the subgraph

构建子图
figure 1

Expand one step case

When starting from point A along the like edge, it is easy to get the information of all the points and edges of one step. Only the two operators getNeighbor and dataCollect are needed. The execution plan is shown in the figure below:

拓展一步的情况

Expand the multi-step case

The one-step scene is actually a special case of the multi-step scene. Therefore, one-step scenes can be merged into multi-step scenes. When starting from point A and expanding along the like side for three steps, according to the existing operator, the destination points can be extracted after getNeighbor is expanded, and then these destination points are used as the starting point to call the getNeighbor interface again. This loop twice That's it (the termination condition of the loop operator is set to the current number of steps), so the execution plan is as shown in the figure below:

拓展多步的情况

input and output

In general, the input of each operator is the output of the dependent operator. At this time, the input and output of each operator can be intuitively determined according to the dependency of the execution plan. But in some cases, such as: subgraph, in a multi-step scenario, the input of each getNeighbor operator should be the destination point of the last extended edge, that is, the output of the subgraph operator, so the output of the subgraph operator should It is the input of the getNeighbor operator. At this time, it is inconsistent with the execution plan dependency of the above figure. At this time, you need to set the input and output of each operator by yourself. In Query Engine 2.0, we have introduced that the input and output of each operator are stored in a hash table, where value is a vector type. As shown in the following table ResultMap:

ResultMap

The starting point is stored in ResultMap["StartVid"]
The input of getNeighbor operator is ResultMap["StartVid"], and the output is stored in ResultMap["GN_1"]
The input of the subgraph operator is ResultMap["GN_1"], and the output is stored in ResultMap["StartVid"]
The loop operator does not generate data and is used as a logic loop, so there is no need to set input and output
The input of the dataCollect operator is ResultMap["GN_1"], and the output is stored in ResultMap["DATACOLLECT_2"]

At this time, the getNeighbor operator will put the result of each time at the end of the vector in ResultMap["GN_1"], and then the subgraph operator will take the value from the end of the vector in ResultMap["GN_1"], and then calculate The starting point to be expanded next time is stored in ResultMap["StartVid"].

When the first step is expanded, the result of ResultMap is as follows:

For the convenience of display, the result of _dst only writes the attributes of 060d59b26035c7, which will actually bring all the attributes on the upper side and all the attributes of the starting point, similar to Table 1.

The subgraph operator receives the input of "GN_1", extracts the _dst , and then puts the result into "StartVid". When the second step is expanded, the result of ResultMap is as follows:

ResultMap

When the third step is expanded, the result of ResultMap is as follows:

ResultMap

Finally, the dataCollect operator takes out all the point sets and edge sets encountered in the expansion process from ResultMap["GN_1"], and assembles them into the final result and returns it to the user.

Instance

Let's execute an example of a subgraph to see the specific structure of the execution plan in Nebula Graph. Open nebula-console, switch space to basketball, and enter EXPLAIN format="dot" GET SUBGRAPH 2 STEPS FROM 'Tim Duncan' IN like, serve . At this time, nebula-console will generate dot format data, and then open Graphviz Online For this website, paste the generated dot data, and you can see the following structure:

dot 结果

Among them, the Start_0 operator is the dependency of depend in the loop operator. Since there is no multi-statement or PIPE statement, no processing is done.

The above is an explanation of the subgraph. If you encounter problems in the process of using the subgraph or other Nebula, please come to the forum to communicate with us: https://discuss.nebula-graph.com.cn/

Want to exchange graph database technology with other big companies? The NUC 2021 conference is waiting for you to communicate: NUC 2021 registration portal

Analyze the design and practice of Nebula Graph subgraph

Preface

Definition of subgraph

The syntax of subgraphs

Implementation of subgraph

Implementation plan

Instance

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

53 倍性能提升！TiDB 全局索引如何优化分区表查询？

分布式数据库解析

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

Analyze the design and practice of Nebula Graph subgraph

Preface

Definition of subgraph

The syntax of subgraphs​

Implementation of subgraph

Implementation plan

Instance

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

53 倍性能提升！TiDB 全局索引如何优化分区表查询？

分布式数据库解析

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

The syntax of subgraphs