Pattern matching of GraphX graph computing practice to extract specific subgraphs

This article was first published on the Nebula Graph Community public account

foreword

Nebula Graph itself provides high-performance OLTP queries that can better implement various real-time query scenarios. At the same time, it also provides the nebula-algorithm library based on Spark GraphX to support real-time graph algorithms. I give Nebula a compliment here. good!

But in practice, I found that in some OLAP scenarios, Nebula's support is not so perfect if you want to implement pattern matching analysis.

Here my interpretation of pattern matching is: in a large image, the corresponding sub-images are extracted according to specific rules.

To give a simple example, for example, if you want to perform second-degree diffusion on each point, and filter it according to a certain logic, and finally retain the subgraphs that meet the requirements of second-degree diffusion, such a task is not very easy to achieve with nebula-algorithm .

Of course, in the above example, we can query the corresponding data by writing nGQL statements, but the advantage of Nebula is in the OLTP scenario, querying for a specific point. For the calculation of full-graph data, neither the computing architecture nor the memory size is particularly suitable. Therefore, in order to supplement the function of this part (pattern matching), Spark GraphX is used here to meet the computing requirements of OLAP.

Introduction to GraphX

GraphX is a distributed graph computing engine in the Spark ecosystem. It provides many graph computing interfaces to facilitate graph operations. I will not introduce too much about the basic knowledge of GraphX here, mainly to introduce the idea of implementing pattern matching.

The implementation of pattern matching mainly depends on an important API: PregelAPI , which is a BSP (BSP: Bulk Synchronous Parallel) computing model, and a calculation is implemented by a series of supersteps.

It is not particularly easy to understand just by looking at the definition, so I will directly introduce its implementation in GraphX and understand how it is used.

How Pregel works

The source code is defined as follows:

 def pregel[A: ClassTag](
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)(
      vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] = {
    Pregel(graph, initialMsg, maxIterations, activeDirection)(vprog, sendMsg, mergeMsg)
  }

The relevant parameters have the following meanings:

initialMsg: The initialization information of the node, call the vprog function to process the initialMsg;
maxIrerations: the maximum number of iterations;
activeDiraction: Controls the sending direction of sendMsg, only triples that meet the direction requirements will enter the next iteration;
vprog: A function to update node information. After the node receives the message, it executes the relevant logic to update the node information;
sendMsg: send messages between nodes, the parameter is a triplet, and the direction condition of activeDiraction is satisfied, the message Msg is sent to VertexID, VertexID can be src point or dst point;
mergeMsg: When multiple messages are received with the same VertexID, multiple messages are merged into one, which is convenient for vprog processing.

It's not very clear just looking at the definition and logic, so let's introduce the iterative process of Pregel:

For a graph object, only the points in the active state will participate in the next iteration, and the condition of the active state is to complete an action of sending/receiving message A;
First initialize all nodes, that is, call the vprog method once for each point, the parameter is initialMsg, so that all nodes are in the active state;
Then, the graph is divided into several triplet triplets. The composition of the triplet is: src point, edge, dst point, and only the triplet in the direction of the active point activeDirection is reserved;
Execute the sendMsg method to send message A to a VertexID point. Since the return value is an Iterator, it can send messages to src and dst at the same time. If Iterator.empty is sent, it is considered that no message has been sent;
Since a VertexID point will receive multiple messages, the mergeMsg method is called to merge the messages and merge them into one A;
After merging, vprog is called to update the message of the node, thus completing an iteration;
Repeat steps 3-6, execute maxIterations iterations or exit if all points are not active, and complete all calculations of Pregel.

Pattern matching ideas

After knowing the calculation principle of Pregel, how to implement pattern matching is mainly based on the idea of iteration, constantly aggregating edge information to points, and controlling the logic of sending messages in the iterative process to realize the path of a specific pattern .

We can define a message as a collection of multiple paths. When sending a message, it is the path collection of the sending point. Each path adds an edge e, which realizes the traversal of the path. In fact, for a point, the essence is a The process of breadth-first traversal.

Or take the second-degree query as an example, see the following example:

GraphX 图计算实践之模式匹配抽取特定子图

First, perform an initialization for each point, the attribute of each point is an empty path set, and the path set is represented by a two-dimensional array , so that all points become active states.

Then, for the first iteration, you can see that there will be two triples A-E1->B , B-E2->C , then it is easy to get the result of this iteration: A：[] , B：[[E1]] , C：[[E2]]

Carry out the second iteration. Here, there is a restriction. The path that has been sent will no longer be sent, that is, it is judged whether E has been received to prevent repeated sending. So the result of the second iteration is only B-E2->C this triple is valid, that is, add an E2 to each path in the set of B, and send it to C, and C can merge the paths, then The result is: A：[] , B：[[E1]] , C：[[E2] , [E1,E2]] .

At this time, there are two edges E1 and E2 in the set on the C node, which happens to be the result of the 2-degree traversal of the A node.

Here is a simple example, just to illustrate such an idea. The core logic is to pass edges to realize path traversal . In fact, each node will receive information of many points, then the results of the points can be filtered, according to the head node. Just group it. See the following example for implementation:

GraphX 图计算实践之模式匹配抽取特定子图

In this example, according to the requirements, the result that can be obtained is the 2-degree path subgraph of A and G. I will not repeat the iterative results, and directly list the attributes of nodes C and F: C：[[E2],[E6],[E1,E2],[E5,E6]] , F：[[E4],[E3,E4]] Of course, points H, B, D also have paths, but in fact, you can clearly see that the desired result is on nodes C and F.

So, the result is there but it is scattered, how to combine it? We can take the starting point of the first edge of each point path as the key, because each path is in order during iteration. In fact, this key is the target point. For example, the starting points of E1 and E3 are A, E5 The starting point is G, we add a key to each path, change it to key:path , filter out paths with less than 2 edges, and then group by key to get the subgraph path corresponding to the target point Now, does this get the 2-degree edge of A and G?

idea extension

The example of 2-degree diffusion is relatively simple. In actual business, there will be many situations. Of course, the structure of the graph will be more complicated, such as:

How to traverse the points of different labels
How different types of edges are traversed
How to solve the loop
Whether the direction of the edge is directed or undirected
How to handle multiple edges
...

and so on, but the core point remains unchanged, that is, to implement breadth-first traversal based on Pregel, and to accumulate edges to form path information. The main logic lies in the sendMsg method to control whether to send or not to send, to determine the direction of the path. Satisfy business requirements for pattern matching. One iteration is to accumulate the path information of one layer, so the number of iterations is consistent with the depth of the graph . After the iteration is completed, there are some results at each point. They may be intermediate results or final results. Generally, they are grouped according to the specified key (usually the head node) and then some business logic filtering (such as path length) is performed. , the subgraph of the specified structure can be obtained, and then it can be used for business analysis operations.

In addition, simple pattern matching such as second degree diffusion can also be achieved with the help of GraphFrames . By using operators similar to Spark SQL, it is very easy to obtain the calculation results, which greatly reduces the difficulty of the code. However, due to the lack of documentation and the flexibility of various operators of GraphX, it is not recommended for complex patterns. If you are interested, you can learn about it.

Summarize

Leverage GraphX's Pregel API for breadth-first traversal to achieve the benefits of pattern matching:

GraphX has a variety of graph operators that can flexibly process graph data;
Based on Pregel, using paths as messages can flexibly control the structure of schema subgraphs, and theoretically, schema extraction of any structure can be achieved;
It can support full-image pattern matching with a large amount of data, making up for the deficiencies of Nebula library OLAP;
Seamless integration into the big data ecosystem for easy analysis and use of results.

Although pattern matching can be achieved using this method, it also has many disadvantages, such as:

The message of each iteration is a collection of paths, and the message will be larger as the time goes on, resulting in a large amount of JOIN data and a high memory usage. It can be solved by optimizing and filtering out unnecessary information sent;
The number of iterations is limited, and if there are too many, memory explosion will occur, but in general business, it is rare to have more than 10 layers;
Since the node ID is usually a String, the mapping table needs to be made in advance, and it needs to be converted back after the calculation, resulting in a lot of shuffles during the calculation process.

In response to the above problems, if you have a better implementation solution, or can be better implemented through other computing engines, please be sure to communicate with me for guidance!

Finally, although GraphX is difficult to use, and the calculation is highly dependent on memory, it is still an excellent graph computing framework, especially the distributed feature, which can calculate a large amount of data, and Spark can better It is integrated with the big data ecosystem, and the official nebula-spark-connector is provided to facilitate reading and writing Nebula data, which is very good to use.

That's it for my sharing, everyone is welcome to exchange better ideas!

I am Fanfan, a big data development engineer, currently engaged in the development of graph products, dedicated to the use of large-scale graph data in business. Recently, I have used GraphX to practice the pattern matching development of some business requirements, and share some ideas here.

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~

Pattern matching of GraphX graph computing practice to extract specific subgraphs

foreword

Introduction to GraphX

How Pregel works

Pattern matching ideas

idea extension

Summarize

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

从零构建知识图谱：使用大语言模型处理复杂数据的11步实践指南

neo4j迁移到dozerdb