Nebula Graph 源码解读系列 | Vol.06 MATCH 中变长 Pattern 的实现

content

  • problem analysis

    • Fixed-length Pattern
    • Combination of variable length pattern and variable length pattern
  • Implementation plan

    • Take a step forward
    • Expand multiple steps
    • save route
    • Variable length stitching
  • Summarize

As the core of the openCypher language, MATCH allows users to easily express the association relationships in the gallery through the concise pattern form. Variable length mode is a common form used to describe paths in Pattern. Supporting variable length mode is the first step for Nebula to be compatible with openCypher MATCH.

As can be understood from the previous series of articles, Nebula’s execution plan is composed of many physical operators, and each operator is responsible for executing unique calculation logic. The implementation of MATCH will also involve these operators in the previous article. Such as GetNeighbors, GetVertices, Join, Project, Filter, Loop, etc. Because Nebula's execution plan is different from the tree structure in a relational database, the execution process is actually a looped graph. How to turn the variable-length pattern in MATCH into Nebula's physical plan is the focus of the problem that Planner wants to solve. The following briefly introduces the idea of solving the variable length pattern problem in Nebula.

problem analysis

Fixed-length Pattern

When using the MATCH statement, the fixed-length pattern is also a more commonly used query form. If the fixed-length pattern is understood as a variable-length pattern that extends X steps outward, and it is considered a special case of the latter, then the realization of fixed-length and variable-length patterns can be unified, as shown below:

// 定长 Pattern MATCH (v)-[e]-(v2)
// 变长 Pattern MATCH (v)-[e*1..1]-(v2)

The difference in the above example is the type of variable e. When the length is fixed, e represents an edge, and when the length is variable, e represents an edge list of length 1.

Combination of variable length pattern and variable length pattern

In openCypher's MATCH grammar, Pattern can be flexibly combined to express complex paths. As shown below, change the length of the Pattern and then connect the variable length of the Pattern:

MATCH (v)-[e*1..3]-(v2)-[ee*2..4]-(v3)

The above-mentioned process can be an extended process, and a very complicated path can be combined through different arrangements of variable-length fixed-length patterns. Therefore, we must find a mode to generate plan in order to recursively iterate the whole process conveniently. The following factors need to be considered:

  1. The path of the following variable-length pattern depends on all the previous variable-length paths;
  2. All the symbols (or variables) behind the variable-length Pattern indicate that the result is "changing";
  3. Each step needs to de-duplicate the starting point before expanding outward;

We can notice that if ()-[:like*m..n]- part of Pattern can be generated, then the subsequent combination iteration becomes traceable, as shown below:

()-[:like*m..n]- ()-[:like*k..l]- ()
 \____________/   \____________/   \_/
    Pattern1         Pattern2       Pattern3

Implementation plan

Let's analyze ()-[:like*m..n]- in the model to see how it is converted into Nebula's physical execution plan. The description of the above mode means that the expansion of m to n steps outwards, and the expansion of one step in Nebula is done through the GetNeighbors operator. If you want to expand out for multiple steps, you need to continue to call the GetNeighbors operator on the basis of the previous step of expansion, and the end-to-end connection of the point and edge data obtained each time will be spliced into a path (path). Although what the user needs last is the path from m to n step, it still needs to expand from step 1 to step n during the execution process. And the path results in each step of the expansion process need to be saved for output or use in the next step. Finally, just take out the path between m and n steps in length.

Take a step forward

Let's first take a look at what the plan to take one step looks like. Because Nebula's data storage method is that the starting point and the outgoing edge are placed together, there is no need to cross partitions to obtain the data of the starting point and the outgoing edge. However, the end point data of an edge generally spans partitions, and the attributes of the points need to be obtained separately through the GetVertices interface. In addition, before expanding outwards, it is best to de-duplicate the starting point data of the expansion to avoid repeated scans of storage. So the one-step execution plan is shown in the figure below:

拓展一步

Expand multiple steps

The process of expanding multiple steps is actually repeating the above process, but we will notice that GetNeighbors can get the attributes of the starting point, so when expanding the next step, one step of GetVertices operation can be omitted. The two-step execution plan of the expansion becomes:

拓展一步

save route

Since it may be necessary to return to the path of each step of expansion in the end, it is necessary to save all the paths in the above expansion process. The path connecting the two steps can be completed by the join operator. At the same time, because ()-[e:like*m..n]- represents a list of data (list of edges), the expansion path of each step above needs to merge the result set through the way of union. The execution plan further evolved into:

拓展一步

Variable length stitching

From the above process, ()-[e:like*m..n]- can be generated. When multiple similar patterns are spliced, the above process is iterated again. However, before the pattern iteration, we need to filter the results obtained from the above plan, because we expect to get the results from steps m to n. The above data set contains all the results from step 1 to step n. Simply filter the length of the path. The plan after the splicing of the variable length mode becomes:

拓展一步

Through the above step-by-step decomposition, we finally got the execution plan expected by the original MATCH statement. It can be seen that it takes a lot of effort to convert a complex pattern into the underlying extended interface. Of course, the above plan can be optimized, such as encapsulating the multi-step expansion process with the Loop operator, and reusing the one-step expansion sub-plan, which will not be expanded in detail here. Interested users can refer to nebula source code to achieve .

Summarize

The above process demonstrates the execution plan generation process of a variable-length Pattern MATCH statement. I believe you will have such a doubt at this time. Why do some basic path extensions generate such a complex execution plan in Nebula? Compared with the implementation of Neo4j, a few operators can complete the same work, what will become a cumbersome DAG here?

The essential reason for this problem is that Nebula's operators are closer to the underlying interface and lack some semantic abstractions for higher-level graph operations. If the operator strength is too fine, it will lead to the implementation of the upper-level optimization and other implementations that need to consider too many details. The execution operator will be further sorted out later to gradually improve the MATCH function and improve performance.

"The Complete Guide to Open Source Distributed Graph Database Nebula Graph", also known as: Nebula Small Book, which records in detail the knowledge points and specific usage of the graph database and the graph database Nebula Graph. Read the portal: https://docs.nebula -graph.com.cn/site/pdf/NebulaGraph-book.pdf

Exchange graph database technology? Please join Nebula exchange group under Nebula fill in your card , Nebula assistant will pull you into the group ~


NebulaGraph
169 声望684 粉丝

NebulaGraph:一个开源的分布式图数据库。欢迎来 GitHub 交流:[链接]