This article was first published on Nebula Graph Community public number

图计算 on nLive:Nebula 的图计算实践

In the #GraphCalculationonnLive# live event, nebula-plato maintainer Hao Tong and nebula-algorithm maintainer Nicole from the Nebula R&D team shared with everyone the graph calculation in his and her eyes.

Guests

  • Wang Changyuan: Forum ID: Nicole, maintainer of nebula-algorithm;
  • Hao Tong: Forum ID: caton-hpg, maintainer of nebula-plato;

The first is Hao Tong, the maintainer of nebula-plato.

nebula-plato for graph computing

图计算 on nLive:Nebula 的图计算实践

The sharing of nebula-plato mainly consists of an overview of the graph computing system, an introduction to the Gemini graph computing system, an introduction to the Plato graph computing system, and how Nebula integrates with Plato.

graph computing system

division of graphs

图计算 on nLive:Nebula 的图计算实践

The overview of the graph computing system focuses on the division, sharding, and storage methods of the following graph.

The graph itself consists of vertices and edges, and the graph structure itself is a divergent structure without boundaries. To segment a graph, it must be either a vertex or an edge cut.

图计算 on nLive:Nebula 的图计算实践

Vertex cutting means that a vertex is cut into multiple parts, and each partition will store some vertices, which will cause two problems: the consistency of vertex data and the problem of network overhead. In addition, the tangent point also has data redundancy caused by storing multiple copies of a vertex.

图计算 on nLive:Nebula 的图计算实践

For the cut edge, one edge will be cut into two parts and stored in two partitions respectively. There is a network overhead during the computation (iterative computation). Similarly, edge-cutting can also lead to a storage crisis caused by data redundancy.

图计算 on nLive:Nebula 的图计算实践

In addition to the segmentation problem of the graph data structure itself, the division of the graph also has the problem of partitioning the data storage. There are usually two ways to partition graph data, one is Hash, and the other is range sharding. Range sharding refers to dividing the data into continuous ranges determined by the shard key value. For example, machine 1 has keys such as bannana and car, and machine 2 has keys in the dark range, which are sharded according to similar regulations.

How the graph is stored

图计算 on nLive:Nebula 的图计算实践

The storage medium of the graph computing system is divided into memory and external memory. The memory method will put all the data in the memory for iterative calculation; while the external memory storage needs to constantly read and write to the disk, which is very inefficient. At present, the mainstream graph computing systems are based on memory storage, but the memory has a disadvantage. If the data cannot be stored in the memory, it will be very troublesome.

The storage methods of graph computing mainly include: adjacency matrix, adjacency list, CSR/CSC, the first two are familiar to everyone, here is a brief introduction to CSR/CSC. CSR is a compressed sparse row that stores out-edge information for vertices. for example:

图计算 on nLive:Nebula 的图计算实践

Now we compress this matrix (above), only compress the content with data in the storage, and remove the content without data in the matrix, which will get the picture on the far right. But in this case, how to judge which vertices 2 and 5 are outgoing edges? A field offset is introduced here.

图计算 on nLive:Nebula 的图计算实践

For example, now we want to take the neighbors of vertex 1, and vertex 1 is at the first offset. What are its neighbors? As shown in the figure above, the red box shows: the offset of 1 is 0~2, but 2 is not included, and the corresponding neighbors are 2 and 5.

图计算 on nLive:Nebula 的图计算实践

Take the neighbors of node 2. The neighbors of node 2 are in the range of 2~6, so the corresponding neighbors are 1, 3, 4, and 5, which is CSR. In the same way, CSC is a compressed sparse column, similar to CSR, but it is compressed according to the column method, and it stores the incoming edge information.

Graph Computing Mode

图计算 on nLive:Nebula 的图计算实践

There are usually two graph computing modes, one is BSP (Bulk Synchronous Parallel) and the other is GAS (Gather, Apply, Scatter). BSP is an overall synchronous computing mode. Parallel computing will be performed inside the block, and synchronous computing will be performed between blocks. Like the common graph computing framework Pregel, are all using the BSP programming mode.

PowerGraph , which uses the GAS programming model. Compared with BSP, GAS is subdivided into three stages: Gather collection, Apply application, and Scatter distribution. Each iteration will go through these three stages. In the Gather stage, first collect the data of the 1st degree neighbors, and then perform a calculation (Apply) locally. If the data changes, the result of the change will be distributed to Scatter. The above is the processing process of the GAS programming model.

Synchronous and asynchronous graph computation

In graph computing systems, two terms are often seen: synchronous and asynchronous. Synchronization means that the calculation results generated in this round take effect in the next round of iterations. The calculation results generated in this round, which take effect immediately in this round, are called asynchronous.

The advantage of the asynchronous method is that the algorithm converges very quickly, but it has a problem.

Because the asynchronous calculation results take effect immediately in the current round, different execution orders of the same node will lead to different results.

In addition, there will be data inconsistencies in the process of asynchronous execution in parallel.

Programming Patterns for Graph Computing Systems

Graph computing system programming models are usually divided into two types, one is a vertex-centric programming model, and the other is an edge-centric programming model.

图计算 on nLive:Nebula 的图计算实践

(Figure: Vertex-Centric Programming Model)

图计算 on nLive:Nebula 的图计算实践

(Figure: Edge-centric programming model)

These two modes are more common in vertex-centric programming models. Vertex-centric means that all operation objects are vertices. For example, vertex v in the above figure is a vertex variable v, and all operations such as scatter and gather are is performed for this vertex data.

Gemini graph computing system

The Gemini graph computing system is a computing-centric distributed graph computing system. Here are its features:

  • CSR/CSC
  • Sparse graph/Dense graph
  • push/pull
  • master/mirror
  • Computation/Communication Collaboration
  • Partitioning method: chunk-base, and supports NUMA structure

Gemini means "Gemini" in Chinese, and many of the technologies mentioned in Gemini's paper appear in pairs. Not only CSR and CSC appear in pairs in storage structure, but also divided into sparse graph and dense graph in graph division.

The sparse graph adopts the push method, which can be understood as sending its own data and changing the data of others; for the dense graph, it adopts the pull method to pull the 1-degree neighbor data and change its own data.

In addition, Gemini divides vertices into master and mirror. Work together on computing and communication to reduce time and improve overall efficiency. The last is the partition method, which adopts chunk-base partition and supports NUMA structure.

Here's an example to speed up your understanding:

图计算 on nLive:Nebula 的图计算实践

The yellow and pink in the picture above are 2 different partitions.

  • During the push operation of the sparse graph, the master vertex (the v vertex in the yellow area in the left image above) will synchronize the data to all mirror vertices (the v vertex in the pink area in the left image above) through the network, and the mirror node executes push and modify Its one-time neighbor node data.
  • In the pull operation of the dense graph, the mirror vertex (the v node in the yellow area on the right in the above figure) will pull the data of its one-time neighbor, and then synchronize it to its master vertex through the network (the v node in the pink area on the right in the above figure) , modifying its own data.

图计算 on nLive:Nebula 的图计算实践

Combined with the example in the figure above, this is a dense graph, which is stored in CSC. Here, the vertex 0 in the Example Graph on the left side of the figure above needs to be pulled. At this time, the mirror nodes (Node1 and Node2) of 0 will pull its one-time neighbor data, and then synchronize to the master (Node0) to change its own data.

The above is the pull operation corresponding to dense graph 0.

Next, let's briefly introduce Plato.

Introduction to the graph computing system Plato

图计算 on nLive:Nebula 的图计算实践

Plato is Tencent's open source graph computing framework. Here we focus on the differences between Plato and Gemini.

Above we have popularized the pull/push method of Gemini: csr: part_by_dst is suitable for push mode and csc: part_by_src is suitable for pull mode. Compared with Gemini, Plato adds csc: part_by_dst for pull mode and csr: part_by_src for push mode based on its pull and push modes. This is the difference between Plato and Gemini in partitioning. Of course, Plato also supports Hash partitioning.

In terms of encoding, Gemini supports vertex ID of int type and does not support String ID, but Plato supports String ID encoding. Plato calculates the corresponding machine by hashing the String ID, sends the String ID to the corresponding machine for encoding, and then sends it back to aggregate the encoded data. After a large map is mapped, each machine can get a global String ID encoding. At the end of the system calculation, the result needs to be output. At this time, the global encoding can convert the int ID back to the String ID locally. This is how Plato String ID encoding works.

Nebula Graph integration with Plato

First of all, Nebula optimizes the encoding of String ID. The global encoding mapping mentioned above is very memory-intensive, especially in the production environment. On Nebula's side, each machine only saves part of the String ID, and when the result is output, the corresponding machine performs Decode and writes it to the disk.

In addition, it also supports reading and writing of Nebula Graph. Data can be read through the scan interface, and then written back to Nebula Graph through nGQL.

On the algorithm, Nebula Graph has also been supplemented, sssp single-source shortest path, apsp all-pair shortest path, jaccard similary similarity, triangle count triangle count, clustering coefficent aggregation coefficient and other algorithms.

Our Plato practice article will also be published in the near future, and there will be a more detailed introduction to the integration.

nebula-algorithm of graph computation

Before starting the introduction of nebula-algorithm, please post its open source address: https://github.com/vesoft-inc/nebula-algorithm .

Nebula graph calculation

图计算 on nLive:Nebula 的图计算实践

At present, Nebula graph computing integrates two different graph computing frameworks, with a total of 2 products: nebula-algorithm and nebula-plato.

nebula-algorithm is the community version. The difference from nebula-plato is that nebula-algorithm provides an API interface to call algorithms. The biggest advantage of is the integration of [GraphX](https://spark.apache.org/ docs/latest/graphx-programming-guide.html), which can seamlessly connect to the Spark ecosystem. It is precisely because nebula-algorithm is implemented based on GraphX, the underlying data structure is RDD abstraction, which will consume a lot of memory during the calculation process, and the relative speed will be relatively slow.

nebula-plato As mentioned above, ID encoding and mapping are required inside the data, even if it is an int ID, but if it is not incremented from 0, ID encoding is required. The advantage of nebula-plato is that the memory consumption is relatively small, so when it runs the algorithm, under the same data and resources, the speed of nebula-plato is relatively fast.

On the left side of the above figure is the architecture of nebula-algorithm and nebula-plato, both of which pull data from the storage layer Nebula Storage. The GraphX side (nebula-algorithm) mainly pulls and stores data through Spark Connector, and writes through Spark Connector.

How to use nebula-algorithm

jar package submission

图计算 on nLive:Nebula 的图计算实践

nebula-algorithm currently provides two ways to use it, one is to submit the jar package directly, and the other is to call the API.

The whole process through the jar package is shown in the figure above: Configure the data source through the configuration file. Currently, the configuration file data source supports Nebula Graph, CSV files on HDFS, and local files. After the data is read, it is constructed into a graph of GraphX, which calls the algorithm library of nebula-algorithm . After the algorithm is executed, a data frame (DF) of the algorithm result will be obtained, which is actually a two-dimensional table. Based on this two-dimensional table, Spark Connector writes data. The write here can write the result back to the graph database, or to HDFS.

API call

图计算 on nLive:Nebula 的图计算实践

It is more recommended that you call through the API. The data is not processed in the later data writing part in the form of jar package as above. With the API call method, data preprocessing can be performed in the data writing part, or statistical analysis of the algorithm results can be performed.
The API call process is shown in the figure above, which is mainly divided into 4 steps:

  1. Custom data source df (id is numeric data)
  2. Define algorithm configuration louvainConfig
  3. execute the algorithm
  4. Statistical calculation or direct display of algorithm results

The code part of the above figure is a specific calling example. First define a Spark entry: SparkSession , and then read the data source df through Spark. This form enriches the data source. It is not limited to reading CSV on HDFS, but also supports reading HBase or Hive data. The above example is applicable to graph data whose vertex ID is of numeric type, and the ID of String type will be introduced later.

Returning to the operation after data reading, algorithm configuration will be performed after data reading. The above example calls the Louvain algorithm, you need to configure the LouvainConfig parameter information, that is, the parameters required by the Louvain algorithm, such as the number of iterations, the threshold and so on.

After the algorithm is executed, you can customize the statistical analysis of the next operation result or the result display. The above example is to directly display the result louvain.show() .

ID Mapping Mapping Principle and Implementation

图计算 on nLive:Nebula 的图计算实践

Let's introduce ID mapping and String ID processing.

Friends who are familiar with GraphX may know that it does not support String ID. What should we do when our data source ID is a String?

In daily communication with community users on GitHub issues and forums, many users have mentioned this issue. Here is a code example: https://github.com/vesoft-inc/nebula-algorithm/blob/master/example/src/main/scala/com/vesoft/nebula/algorithm/PageRankExample.scala

From the above flow chart, we can see that it is actually the same as the previous calling process, but with two more steps: ID Mapping and do the inverse mapping of ID & result to the settlement result. Because the result of the algorithm operation is a numeric type, it is necessary to do an inverse Mapping operation to convert the result into a String type.

图计算 on nLive:Nebula 的图计算实践

The above figure shows the process of ID mapping (Mapping). The data source (box 1) called by the algorithm shows that the data is edge data and is of String type (a, b, c, d), among which 1.0, 2.0, etc. Column data are edge weights. In step 2, the point data will be extracted from the edge data. Here we extracted a, b, c, and d. After extracting the point data, a long-type numeric ID (blue box in the figure above) is generated through ID mapping. After having the ID of the numeric type, we perform a Join operation on the mapped ID data (blue box) and the original edge data (box 1) to obtain an encoded edge data (box 4). The encoded data set can be used as input to the algorithm. After the algorithm is executed, the data result (yellow box) is obtained. We can see that the result is a structure similar to a two-dimensional table.

For ease of understanding, we assume that this is the execution process of the PageRank algorithm, then the result data (yellow box) we get in the right column (2.2, 2.4, 3.1, 1.4) is the calculated PR value. But the result data here is not the final result, don't forget that our original data is point data of String type, so we have to do the fifth step in the process: do the inverse mapping of ID & result to the settlement result, so that We can get the final execution result (green box).

It should be noted that the above figure takes PageRank as an example, because the execution result of the PageRank algorithm (the data in the right column of the yellow box) is a double type value, so there is no need to do ID inverse mapping, but if the algorithm executed in the above process is a connected component Or label propagation, the second column of data in the algorithm execution result needs to be de-mapped by ID.

The implementation in the code of PageRank under the excerpt

def pagerankWithIdMaping(spark: SparkSession, df: DataFrame): Unit = {
    val encodedDF      = convertStringId2LongId(df)
    val pageRankConfig = PRConfig(3, 0.85)
    val pr             = PageRankAlgo.apply(spark, encodedDF, pageRankConfig, false)
    val decodedPr      = reconvertLongId2StringId(spark, pr)
    decodedPr.show()
}

We can see that the mapping from String ID to Long type is performed by executing val encodedDF = convertStringId2LongId(df) before the algorithm is called. After the statement is executed, we will call the algorithm. After the algorithm is executed, we will de- val decodedPr = reconvertLongId2StringId(spark, pr) .

In the live video (Station B: https://www.bilibili.com/video/BV1rR4y1T73V ), the implementation of the PageRank sample code is described. Interested friends can watch the code of the video 24'31 ~ 25'24 Explanation, which also describes the implementation of the encoding map.

Algorithms supported by nebula-algorithm

图计算 on nLive:Nebula 的图计算实践

The above picture shows the graph algorithms that we will support in v3.0. Of course, some of the graph algorithms are also supported in v2.0, but I will not go into details here. You can check the GitHub document: https:/ /github.com/vesoft-inc/nebula-algorithm .

According to the classification, we divide the currently supported algorithms into 6 categories: community class, node importance, relevance, graph structure class, path class, and graph representation. Although only the algorithm classification of nebula-algorithm is listed here, the algorithm classification of the enterprise version of nebula-plato is similar, but the internal algorithms in each major category will be more abundant. According to the feedback from the current community users, the ease of use of the algorithm is mainly based on the two categories of community category and node importance in the above figure. It can be seen that we are also targeting and enriching these two categories of algorithms. If you develop a new algorithm implementation during the use of nebula-algorithm, remember to GitHub to nebula-algorithm to mention a pr to enrich its algorithm library.

The following figure is a brief introduction to the more common Louvain and label propagation algorithms found in the community:

图计算 on nLive:Nebula 的图计算实践

图计算 on nLive:Nebula 的图计算实践

Since I have written related algorithm introductions before, I will not repeat them here. You can read "GraphX Graph Computing Practice in Nebula Graph" .

Here is a brief introduction to the connected component algorithm.

图计算 on nLive:Nebula 的图计算实践

Connected components generally refer to weakly connected components. The algorithm is aimed at undirected graphs, and its calculation process is relatively simple. As shown on the right side of the above figure, for the five small communities divided by dotted lines, in the process of calculating the connected component, the connection line (red box) between each community is not calculated. You can understand that 1 subgraph is extracted from the graph database to calculate 1 connected component, and there are 5 small connected components calculated. At this time, based on the data analysis of the whole graph, connecting edges (red boxes) are added between different small communities to connect them.

Application Scenarios of Community Algorithms

Banking field

图计算 on nLive:Nebula 的图计算实践

Let's look at a specific application scenario. This situation exists in the bank. One ID number corresponds to multiple mobile phone numbers, multiple mobile devices, multiple debit cards, multiple credit cards, and multiple APPs. And these bank data will be scattered and stored. When doing correlation analysis, you can first calculate the small community through the Unicom component. For example, the data information such as different devices and mobile phone numbers owned by the same person are classified into the same connected component, and they are regarded as a cardholder entity, and then the next step is calculated. In simple terms, the decentralized data is aggregated into large nodes for unified analysis through algorithms.

Security field

图计算 on nLive:Nebula 的图计算实践

The above picture is the application of Louvain algorithm in the field of security. It can be seen that the proportion of the algorithm itself is not high in the entire business processing process. About 80% of the entire processing process of is preprocessing data and counting subsequent results. Analysis . Different domains have different data, and domain source data is used for actual graph modeling according to business scenarios.
Taking public security as an example, entities such as people, vehicles, Internet cafes, and hotels are extracted through public security data, that is, the graph database can be divided into these four tags (people, vehicles, Internet cafes, and hotels), and the ownership relationship is abstracted based on the dynamic data of users. Peer relationship and cohabitation relationship correspond to the Edge Type in the graph database. After the data modeling is completed, the algorithm modeling is carried out. According to the business scenario, you can choose to extract the full graph or extract the subgraph for graph calculation. After data extraction, data preprocessing is required. Data preprocessing includes many operations, such as splitting the data into two categories, one for model training and the other for model validation; or the numerical conversion of data weights and features, which are called data preprocessing.

After the data is preprocessed, algorithms such as Louvain and node importance are executed. After the calculation is completed, the new features obtained based on the point data are usually traced back to the graph database, that is, after the graph calculation is completed, a new type of attribute will be added to the Tag of the graph database, and this new attribute is the new feature of the Tag. After the calculation results are written back to the graph database, the data from the graph database can be read to the Studio canvas for visual analysis. Here, domain experts are required to perform visual analysis for specific business needs, or the data is entered into GCN for model training after the calculation is completed, and finally a blacklist is obtained.
The above is an overview of this graph calculation, and the following are some related questions from the community.

community questions

Here are some excerpts from the community users' questions. All questions can be asked by watching the question response section starting from 33'05 of the live video.

Algorithm Internals

dengchang: I want to understand the internal principles of various mature algorithms calculated in the following figure. It would be better if there is an explanation combined with actual scenarios and data.

Nicole: Mr. Zhao can read the previous articles, such as:

Some relatively complex algorithms are inconvenient to explain in the live broadcast, and articles will be published to introduce them in detail later.

Graph Computational Planning

en,coder: At present, I see that the Nebula Algorithm calculation needs to export the database data to Spark, and then import it into the database after the calculation. Whether or not to support non-export, at least the calculation of lightweight algorithms, will be considered in the future, and the results will be displayed in Studio.

Nicole: Let me answer the previous question first. In fact, it is not necessary to import the results into the graph database after computing with nebula-algorithm. Currently, both API calls of nebula-algorithm and jar package submission allow the results to be written to HDFS. Whether you want to import the result data into the graph database depends on what you want to do with the graph computation results later. As for "whether to support non-export in the future, at least lightweight calculation", I understand whether the lightweight algorithm calculation is to first check the data from the graph database, display it on the canvas, and then target the one displayed in the canvas. A small part of the data is lightly calculated, and the calculation results are immediately displayed on the canvas through Studio, rather than being written back to the graph database. If this is the case, in fact, Nebula will consider building a graph computing platform in the future, combining AP and TP. For the data in the canvas, simple lightweight calculations can also be considered. Whether the calculation results should be written back to the graph database is determined by user to configure. Back to the requirement itself, whether to perform lightweight calculation of canvas data still depends on your specific scenario and whether this operation is required.

Tide and Tiger: Does nebula-algorithm plan to support Flink?

Nicole: This may refer to Flink's Gelly doing graph calculations. Currently, there are no related plans.

Fanfan: Are there plans to do nGQL-based pattern matching? For the OLAP computing task of the full graph, there are some pattern matching tasks in the actual scene. Generally, the code is developed by itself, but the efficiency is too low.

Hao Tong: Pattern matching is an OLTP scenario. TP is limited by the slow speed of the disk, so I want to use it in OLAP, but OLAP usually deals with traditional algorithms and does not support pattern matching. In fact, after the subsequent AP and TP are merged, the graph data is placed in the memory, and the speed will be improved.

Best Practice Cases for Graph Computing

Qi Mingyu: What are the best practices in the industry regarding the use of graph computing power to profile equipment risks? For example, in the identification of group control and the mining of black gangs, are there any best practices in the industry to share?

Nicole: For specific business problems, it is necessary to rely on the operation classmates and community users to contract drafts. It is more valuable to explain graph computing from actual cases.

Graph computing memory resource configuration

Liu Yi: How to evaluate the total amount of memory required for graph computation.

Hao Tong: Because different graph computing systems are designed differently, the memory usage is also different. Even in the same graph computing system, different algorithms have some differences in the memory requirements. You can first test with 2 or 3 different data in order to evaluate the best resource allocation.

The above is the sharing of graph computing on nLive, you can view the complete sharing process by watching the video at station B: https://www.bilibili.com/video/BV1rR4y1T73V . If you encounter any problems in using nebula-algorithm and nebula-plato, please come to the forum to communicate with us. Forum portal: https://discuss.nebula-graph.com.cn/ .


Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first at , and the Nebula assistant will pull you into the group~~


NebulaGraph
169 声望684 粉丝

NebulaGraph:一个开源的分布式图数据库。欢迎来 GitHub 交流:[链接]