1
头图
This article was first published on Nebula Graph Community public number

基于 Nebula Graph 构建图学习能力

Friends who often read technical articles may notice that in addition to the article you are reading, there will be several articles of the same topic and author waiting for you to read below or on the right side of the article page; Your partner may have discovered that once you have purchased or viewed a certain product, guess that the recommended product you like will be close to the product you just picked up... These things that are often encountered in daily life may be learned by a person named Tu " Guy" provides technical support.

Graph learning, as the name suggests, is machine learning based on graphs. According to the words of Yang Xin, the leader of the graph learning interest team introduced in this project, graph learning is a deep learning method based on the structure and attributes of graphs to generate a vector as A representation of a point, an edge, or an entire graph.

In the previous Nebula Hackathon 2021 entry, the graph learning interest team "proudly" said that Nebula Graph should have the ability to support graph learning. In the rest of this article, you will learn how they achieved this goal. of.

Figure learning small class

Before talking about the realization of the project, Yang Xin, the project leader, gave everyone a lesson on "Graphic Diagram Learning".

基于 Nebula Graph 构建图学习能力

Taking the vector representation of vertices as an example to explain the learning process of the following graph, the first step is to sample the vertex neighbors in the graph to get the topology and attributes of the neighbors; the second step is to aggregate the neighbor vertices through a custom aggregation function. The last step is to obtain the vector representation of each vertex in the graph based on the aggregated information.

Yang Xin said that the quality of a graph learning algorithm is judged by the richness of vertex attributes and topology structures in the generated vector.

the beginning of the story

When choosing a graph learning framework, the graph learning interest team has its own set of selection criteria: firstly, it is an open source component, which is convenient for secondary development, secondly, it supports distributed distribution, and finally, the degree of coupling between the framework itself and the graph database should be low and easy to customize chemical development. After in-depth research, due to the high coupling between DGL (link: https://www.dgl.ai/ ) and the underlying graph database, PYG (link: https://pytorch-geometric.readthedocs.io/en/ latest/ ) does not support distributed, and finally they adopted Euler (GitHub address: https://github.com/alibaba/euler ), and in Nebula Hackathon 2021, the focus of the graph learning interest team is to use Nebula Graph replaces Euler's native graph database, allowing community users to try graph learning capabilities based on Nebula Graph at low cost.

Design ideas

基于 Nebula Graph 构建图学习能力

In terms of scheme design, the architecture is divided into three layers: the bottom layer is the Nebula Graph graph database, and the middle layer is the graph sampling operator layer, which provides the upper-layer Euler graph algorithm with the ability to sample graph data. The graph sampling operator is the focus of their work - rewriting 22 TF sampling operators and 6 nGQL sampling grammars.

High-intensity development volume

Taking the global weighted sampling grammar design as an example,

基于 Nebula Graph 构建图学习能力

Here, the method of pre-computing (offline) + computing push-down (real-time) is used to achieve global sampling. The main process is (Figure 1-4 above)

  1. graphd submits asynchronous build tasks to metad;
  2. metad sends tasks to each stored node for calculation;
  3. storaged reports the calculation results to metad for summary;
  4. metad synchronizes the results to graphd via heartbeat.

In detail, in the process of graph learning and training, data sampling is a very high-frequency operation, and the sampling performance has a great impact on the time-consuming of graph learning and training.

Here, Yang Xin and others use the alias sampling algorithm as the core algorithm of the global weighted sampling grammar. Its basic principle is to first calculate a sampling table for all the data to be sampled according to their respective weights. ), vid1, sampling probability (prob), and vid2, where the row number is a number starting from 0, and vid1 and vid2 represent the vid of a vertex that can be sampled; the sampling process is to first randomly select a row in the table, Then randomly generate a value p between 0 and 1. If p < prob, the data sampled this time is vid1, otherwise vid2 is sampled.

At present, graph learning is based on static graph data training, so we can generate an alias sampling table offline by pre-computing, and the real-time sampling process is simplified to two random sampling based on the pre-computing sampling table, and the time complexity is O( n) is reduced to O(1), which can greatly improve the sampling efficiency.

The following is a specific implementation of global weighted sampling based on Nebula Graph, which is very similar to the index reconstruction logic of Nebula Graph.

  1. The Graph service calls the Meta service to start an asynchronous task of building a sample table, and supports the status query of the asynchronous task.
  2. The role of the Meta service is to control the execution of asynchronous tasks, including allocating the data to be processed by each Storage node (determined according to the leader distribution of the partition), asynchronously triggering each Storage node to calculate the sampling table, recording the execution status of the task, and recording global information (points, edge weights and sums), global information is cached in the Graph service via heartbeat logic.
  3. The task of the Storage service is to generate a sample table of the data it is responsible for. The calculation of the sampling table is the core logic of the whole process. The calculation process needs to number all the points and edges, and classify the points and edges according to their weights. Obviously, all these data cannot be stored in the memory, so we use Storage A sampling statistics RocksDB instance is added to store data such as sampling table, total number of points and edges, weights of points and edges, and calculation of intermediate variables. Take the process of calculating the sampling table of a certain type of point as an example:

The first step is to traverse the whole graph to count the weight and quantity of this type of points. At the same time, in order to prepare for the generation of the sampling table, the data such as the serial number of the point (the reading order of the point), vid, and weight are stored in the graph (B1 ) structure. The data structure of the key is type + tagid + sid , type indicates that the data is of B1 type, tagid is the tag of the point, and sid is the serial number of the point, which is the same as the key structure of the sampling table.

The second step is to classify the data. According to the weight value, all points are divided into two parts with the reciprocal of the total number of points as the boundary, and they are stored in the structure of (B2) and (B3) respectively. Compared with the structure of (B1), the composition of the key is The type value is different.

The third step is to traverse (B2) and (B3) respectively, and calculate the sampling probability prob of each point according to the theory of the alias sampling algorithm. Here each point is identified by sid, which means that the sampling probability of the sid-th point is actually calculated, and the probability value will be updated to the (B1) structure as part of the sampling table.

The fourth step is to delete the intermediate variables (B2) and (B3) structures to free up disk space.

The process of real-time sampling using the sampling table is relatively simple. After the pre-calculation of the sampling table is completed, the statistical information of the point and edge weights will be reported to the Meta service storage, and then the Graph service will be notified through the heartbeat to cache the data locally. According to these data, the weight ratio of points and edges in each Storage service can be calculated, and the sampling number specified by the user is allocated to each Storage service in proportion, and then a sufficient number of samples are sampled on the Storage service through two random number operations. The data is returned, and finally data aggregation is performed on the Graph service, thus completing a weighted sampling.

In addition, in order to improve the robustness of asynchronous tasks, functions such as failure retry, task repetition limit, dirty data cleaning after rebuilding the sampling table, and exporting the sampling table are also implemented.

Referring to the problems encountered in the process of project design and rewriting other operators, Yang Xin, the leader of the graph learning interest team, said that because the data of Nebula Graph is stored in disk, Nebula Graph should be used to replace Euler's native in-memory graph database. The sampling efficiency of Euler is definitely lower than that of the original Euler, Therefore, how to ensure that the Euler's graph learning and training time after the transformation is as close to the original Euler as possible is the primary problem to be solved, at least to ensure that the sampling performance of the two cannot be at the data level. Gap .

The graph learning algorithm in native Euler is implemented by Python. The sampling operator implemented by C++ is called through the OP mechanism of TensorFlow, and then the in-memory graph database is directly called in the sampling operator to obtain data, thus completing a data sampling.

The initial plan of the graph interest team is to implement the sampling operator in Python, so that a data sampling process becomes a graph learning algorithm that directly calls the sampling operator, and then executes the sampling syntax through the Python client of Nebula Graph in the sampling operator. retrieve data. In this way, the transformed system will have a clearer processing flow, but after testing, they found that its training time is much longer than that of the native Euler. Specifically, each data sampling during the training process takes dozens of times that of the native Euler. , this result is obviously not satisfactory. By analyzing the execution time of each stage, they found that the data sampling time was much longer than the execution time of the sampling syntax, so they believed that the Python client spent a lot of time on deserialization. Therefore, the graph interest team carried out the first optimization. used fbthrift fastproto to replace the serialization component of the Nebula Graph native client. After optimization, the performance was doubled, and the overall training time was indeed reduced, but the result was still cannot meet their usage requirements.

At this time, the second scheme was used, rewriting the C++ version of the sampling operator, and executing the sampling syntax to obtain data in the sampling operator through the C++ client of Nebula Graph. Compared with the first scheme, there is more adaptation work. The rewriting of the operator is mainly to adapt the input and output of the C++ client, and the C++ client is also transformed to sample the call of the operator.

This is also their final solution. Although the modified Euler is still higher than the original Euler in terms of training time, the gap is controlled within two times, which is in line with expectations.

Interesting Hackathon

When it comes to what projects are worth the attention of the Tutu Learning Interest Team this time, Yang Xin, the captain, said that he is more concerned about the project of the Siberian Tigers team - the in-depth query of how to hang Nebula. The optimization idea of this project is very important to them. Instructive.

Because in actual use, the business side of the team where the graph interest team is located has many multi-hop query requirements, including GO query, FINDPATH query, etc. This should be the part that Nebula Graph is good at, and it is true in most scenarios, but when encountering vertices with a large out-degree, the query performance will deteriorate sharply. The main reason is that based on the current Nebula Graph architecture, multi-hop queries can only perform next-hop calculations after aggregating the result set of one hop into the Graph service, and cannot push down multi-hop calculations at all. And this project just provides them with an idea to solve this problem.

Attachment: Project design document of the Figure Learning Interest Team: https://docs.qq.com/doc/DVnZQUVBzd1dmS0lS .

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first at , and the Nebula assistant will pull you into the group~~

Follow the public number


NebulaGraph
169 声望684 粉丝

NebulaGraph:一个开源的分布式图数据库。欢迎来 GitHub 交流:[链接]