Graph Database | How to build a corporate equity graph system from zero to one? - Nebula 的图数据库世界

This article was first published on the Nebula Graph Community public account

从零到一：如何构建一个企业股权图谱系统？

We know that whether it is a regulatory department, a company or an individual, there is a need to do some background checks on a company or legal person. These investigations can be a variety of information such as legal proceedings, public shareholding, and corporate employment. This background information can help us make important business decisions and avoid risks: for example, according to the company's equity relationship, to understand whether there is a conflict of interest, such as whether to choose to do business with a company.

When meeting such relational analysis requirements, we often face some challenges, such as:

How to reflect the relationship between these data in the system? so that they can be mined and exploited
The relationship between a variety of heterogeneous data and data sources may lead to more changes with the development of the business. In a structured database, this means schema changes.
The analysis system needs to obtain the required query results in real time as much as possible, which usually involves multi-hop relational queries
Can domain experts obtain and share information quickly, flexibly, and visually?

So how to build such a system to solve the above challenges?

Where does the data reside?

Premise: Data set preparation. In order to better demonstrate and solve this problem for everyone, I wrote a wheel that can randomly generate data related to the ownership structure. An example of the generated data is here .
Here, we have data on legal persons and companies , as well as the relationship between the company and its subsidiaries , the company holds company shares , the legal person holds the position of the company , the legal person holds the company shares and the relationship data of the intimacy between legal persons .

Where does the data reside? This is a key question, and here we spoil it, the answer is: graph databases. Then we briefly explain why such an equity graph system is better running on a graph database.

Under such a simple data model, we can directly model the relational database like this:

why_0_tabular

The problem with such modeling is that this way of logical association makes it not very efficient to query, express, store, or introduce new associations regardless of the association relationship of the data.

The query expression is not efficient because the relational database is designed for the table structure, which determines that the relational query needs to write nested JOIN.
- This is the challenge 1 mentioned earlier: it can be expressed, but it is relatively reluctant, and it becomes difficult to encounter a slightly complicated situation.
Storage is inefficient because the table structure is designed for data records, not relationships between data: although we are used to associating entities (such as legal persons) with entities in the data (such as holding shares hold_sharing_relationship ) It is logically feasible to express and store records in another table, but in the case of multiple hops and a large number of requests for data relationship jumps, the cost of cross-table JOIN becomes a bottleneck.
- This is the aforementioned challenge 3 : unable to cope with the performance needs of multiple queries.
The cost of introducing a new relationship is high, as mentioned above. Under the table structure, a new table is used to express the shareholding hold_sharing_relationship this relationship is feasible, but it is very inflexible and expensive. It means that we limit the type of starting point and ending point when introducing this relationship. For example, the relationship of equity holding may be legal person->company, or it may be company->company. With the evolution of business, we may also need to introduce government -> The company's new relationship, and these changes require a lot of work: changing the Schema.
- This is the challenge 2 mentioned above: unable to cope with the flexible and changeable requirements of the business on the data relationship.

When a general system cannot meet specific requirements that cannot be ignored, a new system will be born, which is the graph database. For such a scenario, the graph database naturally designs the entire database specifically for the relational relationship scenario:

Semantics for relational representation. (Challenge 1)
- In the following table, I list the difference between an equivalent one-hop query in a table-structured database and a graph database. You should be able to see how the query expression "find all the people who hold shares of the company with p_100" can be expressed naturally in the graph database. This is only the difference between a query. If it is multi-hop, their complexity The distinction will be more obvious.

table structure database	Graph database (property graph)

Store associations as physical connections to minimize jump query costs. (Challenge 3, 2)
- In graph data, the cost of expanding from a point (finding the other end of one or more relationships) is very small, because the graph database is a proprietary system, benefiting from its main concern with the design of the "graph" structure , to find a certain entity (such as with a legal person A) all associations (may be employment, relatives, holdings, etc.) all other entities (company, legal person) The cost of this lookup is O(1), because they are in The data structure of the graph database is really linked together.
- You can get a glimpse of the advantages of the graph database under this kind of query from the quantitative reference data in the table below. The difference between this advantage in the case of multi-hop high concurrency is the difference between "can" and "can't" as an online system, which is "real-time" " and "offline" difference.
- Under the relationship-oriented data modeling and data structure, the cost of introducing new entities and relationships is much lower, as mentioned above:
  Introducing a new "government agency" type of entity in the Nebula Graph graph data, and adding a government agency->company "shareholding" relationship is much less expensive than in a non-graph model database.

table structure database	Graph database (property graph)
4-hop query delay of 1544 seconds	4-hop query latency of 1.36 seconds

Modeling is intuitive; graph databases have data visualization capabilities for data connections (challenge 4)
- In the second column of the table below, you can compare the differences in the modeling of the equity analysis data in the two databases in this article. Especially in the scenario where the relationship is concerned, we can feel that the model building of the attribute graph is very important. It is in line with the intuition of the human brain, and this may also have some relationship with the structure of neurons in the brain.
- The built-in visualization tools in the graph database provide general users with the ability to easily understand data relationships, and also provide domain expert users with an intuitive interface to express and request complex data relationships.

table structure database	Graph database (property graph)

Overall comparison of table structure database and graph database:

<!---

 GO FROM "p_100" OVER hold_share YIELD dst(edge) AS corp_with_share |\
GO FROM $-.corp_with_share OVER hold_share REVERSELY YIELD properties(vertex).name;

 SELECT a.id, a.name, c.name
FROM person a
JOIN hold_share b ON a.id=b.person_id
JOIN corp c ON c.id=b.corp_id
WHERE c.name IN (SELECT c.name
FROM person a
JOIN hold_share b ON a.id=b.person_id
JOIN corp c ON c.id=b.corp_id
WHERE a.id = 'p_100')

-->

	table structure database	Graph database (property graph)
Inquire
modeling
performance	4-hop query delay of 1544 seconds	4-hop query latency of 1.36 seconds

To sum up, in this tutorial, we will use a graph database for data storage.

Graph Data Modeling

Earlier in our discussion of where the data resides, we've already revealed the way it is modeled in a graph database: essentially, in this graph, there will be two entities:

people
company

Four relationships:

人 – 作为亲人 –> 人
人 – 作为角色 –> 公司
人 or 公司 – 持有股份 –> 公司
公司 – 作为子机构 –> 公司

Here, both entities and relationships can contain more information, which in the graph database is the attributes of the entities and relationships themselves. As shown in the figure below:

The properties of ---9e8875fbdb1fb22c21409bba43290202 人 include name , age
The attributes of ---c169dff7861cf4081f6e0ba956efdd52 公司 include name , location
持有股份 This relationship has attributes share (share)
任职 This relationship has attributes role , level

why_0_graph_based

Data storage

In this tutorial, the graph database we use is called Nebula Graph, which is an open source distributed graph database under the Apache 2.0 license.

Nebula Graph in Github: https://github.com/vesoft-inc/nebula

When importing data into Nebula Graph, please refer to this document and this video on how to choose a tool.

Here, since the data format is a csv file and it is enough to use the client resources of a single machine, we can choose to use nebula-importer to complete this work.

Tip: Before importing data, please deploy a Nebula Graph cluster. The easiest way to deploy is to use the nebula-up tool, which can start a Nebula Graph core and a visual graph exploration tool at the same time on a Linux machine with just one command. Nebula Graph Studio . If you prefer to deploy with Docker, please refer to this document .
This article assumes that we use Nebula-UP to deploy:
 curl -fsSL nebula-up.siwei.io/install.sh | bash

The data here is generated by the generator, you can generate random data sets of any size on demand, or choose a generated data here

With this data , we can start importing.

 $ pip install Faker==2.0.5 pydbgen==1.0.5
$ python3 data_generator.py
$ ls -l data
total 1688
-rw-r--r--  1 weyl  staff   23941 Jul 14 13:28 corp.csv
-rw-r--r--  1 weyl  staff    1277 Jul 14 13:26 corp_rel.csv
-rw-r--r--  1 weyl  staff    3048 Jul 14 13:26 corp_share.csv
-rw-r--r--  1 weyl  staff  211661 Jul 14 13:26 person.csv
-rw-r--r--  1 weyl  staff  179770 Jul 14 13:26 person_corp_role.csv
-rw-r--r--  1 weyl  staff  322965 Jul 14 13:26 person_corp_share.csv
-rw-r--r--  1 weyl  staff   17689 Jul 14 13:26 person_rel.csv

The import tool nebula-importer is a golang binary file. The way to use it is to write the imported Nebula Graph connection information and the meaning of the fields in the data source into the configuration file in YAML format, and then call it through the command line. You can refer to the documentation or its GitHub repository for examples.

Here I have written and prepared a nebula-importer configuration file, here under the same repo of the data generator.

Finally, just execute the following command to start data import:

Note that at the time of writing this article, the new version of nebula is 2.6.1, and the corresponding nebula-importer here is v2.6.0. If you encounter an import error, it may be that the version does not match. You can adjust the version number in the command below accordingly.

 git clone https://github.com/wey-gu/nebula-shareholding-example
cp -r data_sample /tmp/data
cp nebula-importer.yaml /tmp/data/
docker run --rm -ti \
    --network=nebula-docker-compose_nebula-net \
    -v /tmp/data:/root \
    vesoft/nebula-importer:v2.6.0 \
    --config /root/nebula-importer.yaml

do you know? TL;DR
In fact, this importer configuration helps us do the graph modeling operation in Nebula Graph. Their instructions are below, and we don't need to execute them manually.

 CREATE SPACE IF NOT EXISTS shareholding(partition_num=5, replica_factor=1, vid_type=FIXED_STRING(10));
USE shareholding;
CREATE TAG person(name string);
CREATE TAG corp(name string);
CREATE TAG INDEX person_name on person(name(20));
CREATE TAG INDEX corp_name on corp(name(20));
CREATE EDGE role_as(role string);
CREATE EDGE is_branch_of();
CREATE EDGE hold_share(share float);
CREATE EDGE reletive_with(degree int);

Query data in the gallery

Tips: Did you know that you can also access the same dataset online by finding equity penetration in Nebula-Playground without deployment and installation.

We can use Nebula Graph Studio to access the data, just visit port 7001 of the server address where we deploy Nebula-UP:

Assuming that the server address is 192.168.8.127 , there are:

Nebula Studio Address: 192.168.8.127:7001
Nebula Graph Address: 192.168.8.127:9669
Default username: root
Default password: nebula

Visit Nebula Studio:

studio_login

Select map space: Shareholding

studio_space_selection

After that, we can explore, for example, equity penetration within three hops of a company. For specific operations, please refer to: Introduction to Equity Penetration Online Playground :

Studio 股权穿透

Build a graph system

The code for this part is open sourced on GitHub:
https://github.com/wey-gu/nebula-corp-rel-search
The demo of this project was also shown in the speech at PyCon China 2021: Video address

On this basis, we can build an equity query system for end users to use. We already have a graph database as the storage engine for this graph. In theory, if the business allows, we can directly use or encapsulate the Nebula Graph. It is completely feasible and compliant to use Studio to provide services. However, in some cases, we need to implement the interface ourselves, or we need to encapsulate an API to provide upstream (multi-terminal) graph query functions.

To this end, I wrote a simple example project for everyone to provide such services, and his architecture is also very straightforward:

The front end accepts the penetrating legal person and company to be queried by the user, sends requests to the back end as needed, and uses D3.js to render the returned result as a relational graph
The backend accepts API requests from the frontend, converts the request into a Graph DB query, and returns the results expected by the frontend

 ┌───────────────┬───────────────┐
  │               │  Frontend     │
  │               │               │
  │    ┌──────────▼──────────┐    │
  │    │ Vue.JS              │    │
  │    │ D3.JS               │    │
  │    └──────────┬──────────┘    │
  │               │  Backend      │
  │    ┌──────────┴──────────┐    │
  │    │ Flask               │    │
  │    │ Nebula-Python       │    │
  │    └──────────┬──────────┘    │
  │               │  Graph Query  │
  │    ┌──────────▼──────────┐    │
  │    │ Graph Database      │    │
  │    └─────────────────────┘    │
  │                               │
  └───────────────────────────────┘

Backend Service --> Graph Database

For detailed data format analysis, you can refer to here

check sentence

We assume that the entity requested by the user is c_132 , then the syntax for requesting relationship penetration in steps 1 to 3 is:

 MATCH p=(v)-[e:hold_share|:is_branch_of|:reletive_with|:role_as*1..3]-(v2) \
WHERE id(v) IN ["c_132"] RETURN p LIMIT 100

Here the edge () wraps the points in the graph, and [] wraps the relationship between the points: edges, so:

(v)-[e:hold_share|:is_branch_of|:reletive_with|:role_as*1..3]-(v2) :

(v)-[xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx]-(v2) should be better understood, which means to expand from v to v2 .

Now we introduce the part wrapped in the middle [] , here, its semantics is: after four types of edges ( : is the type of edge, | Represents either) via a variable hop count: *1..3 (one hop to three hops).

So, to put it simply, our expansion path is: starting from the point v , and expanding to the point v2 through four kinds of relationships one to three hops, and returning to the entire expansion path p , limit 100 path results, where v is c_132 .

Nebula Python Client/ SDK

We already know the syntax of the query statement, so we only need to send the query request through the client of the graph database according to the request in the back-end program, and process the return structure. In today's example, I chose to use Python to implement the backend logic, so I used the Nebula-python library, which is Nebula's Python Client.

do you know? As of now, Nebula has client support for Java, GO, Python, C++, Spark, Flink, Rust (not GA), NodeJS (not GA) on GitHub, and clients for more languages will be released slowly Oh.

Below is an example of a Python Client executing a query and returning the result. It is worth noting that at the time I implemented this code, Nebula Python did not yet support returning JSON (via session.execute_json() ) results, if you want to implement your own , I highly recommend trying JSON, you don't have to fetch data from the object a little bit, but with iPython/IDLE REPL , it is not so troublesome to quickly understand the structure of the returned object.

 $ python3 -m pip install nebula2-python==2.5.0 # 注意这里我引用旧的记录，它是 2.5.0，
$ ipython
In [1]: from nebula2.gclient.net import ConnectionPool
In [2]: from nebula2.Config import Config
In [3]: config = Config()
   ...: config.max_connection_pool_size = 10
   ...: # init connection pool
   ...: connection_pool = ConnectionPool()
   ...: # if the given servers are ok, return true, else return false
   ...: ok = connection_pool.init([('192.168.8.137', 9669)], config)
   ...: session = connection_pool.get_session('root', 'nebula')
[2021-10-13 13:44:24,242]:Get connection to ('192.168.8.137', 9669)

In [4]: resp = session.execute("use shareholding")
In [5]: query = '''
   ...: MATCH p=(v)-[e:hold_share|:is_branch_of|:reletive_with|:role_as*1..3]-(v2) \
   ...: WHERE id(v) IN ["c_132"] RETURN p LIMIT 100
   ...: '''
In [6]: resp = session.execute(query) # Note: after nebula graph 2.6.0, we could use execute_json as well

In [7]: resp.col_size()
Out[7]: 1

In [9]: resp.row_size()
Out[10]: 100

Let's analyze it, we know that this request is essentially a path, it has a .nodes() method and .relationships() method to get the points and edges on the path:

 In [11]: p=resp.row_values(22)[0].as_path()

In [12]: p.nodes()
Out[12]:
[("c_132" :corp{name: "Chambers LLC"}),
 ("p_4000" :person{name: "Colton Bailey"})]

In [13]: p.relationships()
Out[13]: [("p_4000")-[:role_as@0{role: "Editorial assistant"}]->("c_132")]

For these methods, it has side .edge_name() , .properties() , .start_vertex_id() , .end_vertex_id() , where edge_name type of edge is obtained.

 In [14]: rel=p.relationships()[0]

In [15]: rel
Out[15]: ("p_4000")-[:role_as@0{role: "Editorial assistant"}]->("c_132")

In [16]: rel.edge_name()
Out[16]: 'role_as'

In [17]: rel.properties()
Out[17]: {'role': "Editorial assistant"}

In [18]: rel.start_vertex_id()
Out[18]: "p_4000"

In [19]: rel.end_vertex_id()
Out[19]: "c_132"

For points, you can use these methods .tags() , properties , get_id() , here tags is the type of tag obtained in Nebula tag .

These concepts can be explained in more detail in the documentation .

 In [20]: node=p.nodes()[0]

In [21]: node.tags()
Out[21]: ['corp']

In [22]: node.properties('corp')
Out[22]: {'name': "Chambers LLC"}

In [23]: node.get_id()
Out[23]: "c_132"

Front-end rendering point and edge as a picture

For a detailed analysis, you can also refer to here

For the convenience of implementation, we use Vue.js and vue-network-d3 (Vue Binding of D3).

Through the abstraction of vue-network-d3, it can be seen that the data is fed to him, and the point and edge information can be rendered into a good-looking graph

 nodes: [
        {"id": "c_132", "name": "Chambers LLC", "tag": "corp"},
        {"id": "p_4000", "name": "Colton Bailey", "tag": "person"}],
relationships: [
        {"source": "p_4000", "target": "c_132", "properties": { "role": "Editorial assistant" }, "edge": "role_as"}]

d3-demo

frontend <-- backend

Details can be found here

We can know from the preliminary research of D3 that the backend only needs to return the following JSON format data.

Nodes:

 [{"id": "c_132", "name": "Chambers LLC", "tag": "corp"},
 {"id": "p_4000", "name": "Colton Bailey", "tag": "person"}]

Relationships:

 [{"source": "p_4000", "target": "c_132", "properties": { "role": "Editorial assistant" }, "edge": "role_as"},
 {"source": "p_1039", "target": "c_132", "properties": { "share": "3.0" }, "edge": "hold_share"}]

So, combined with the previous analysis of Python return results with iPython, the logic is probably:

 def make_graph_response(resp) -> dict:
    nodes, relationships = list(), list()
    for row_index in range(resp.row_size()):
        path = resp.row_values(row_index)[0].as_path()
        _nodes = [
            {
                "id": node.get_id(), "tag": node.tags()[0],
                "name": node.properties(node.tags()[0]).get("name", "")
                }
                for node in path.nodes()
        ]
        nodes.extend(_nodes)
        _relationships = [
            {
                "source": rel.start_vertex_id(),
                "target": rel.end_vertex_id(),
                "properties": rel.properties(),
                "edge": rel.edge_name()
                }
                for rel in path.relationships()
        ]
        relationships.extend(_relationships)
    return {"nodes": nodes, "relationships": relationships}

The front-end-to-backend communication is HTTP, so we can use Flask to encapsulate this function into a RESTful API:

The front-end program POSTs via HTTP to /api

refer here

 from flask import Flask, jsonify, request



app = Flask(__name__)


@app.route("/")
def root():
    return "Hey There?"


@app.route("/api", methods=["POST"])
def api():
    request_data = request.get_json()
    entity = request_data.get("entity", "")
    if entity:
        resp = query_shareholding(entity)
        data = make_graph_response(resp)
    else:
        data = dict() # tbd
    return jsonify(data)


def parse_nebula_graphd_endpoint():
    ng_endpoints_str = os.environ.get(
        'NG_ENDPOINTS', '127.0.0.1:9669,').split(",")
    ng_endpoints = []
    for endpoint in ng_endpoints_str:
        if endpoint:
            parts = endpoint.split(":")  # we dont consider IPv6 now
            ng_endpoints.append((parts[0], int(parts[1])))
    return ng_endpoints

def query_shareholding(entity):
    query_string = (
        f"USE shareholding; "
        f"MATCH p=(v)-[e:hold_share|:is_branch_of|:reletive_with|:role_as*1..3]-(v2) "
        f"WHERE id(v) IN ['{ entity }'] RETURN p LIMIT 100"
    )
    session = connection_pool.get_session('root', 'nebula')
    resp = session.execute(query_string)
    return resp

The result of this request is the JSON expected by the front end, like this:

 curl --header "Content-Type: application/json" \
     --request POST \
     --data '{"entity": "c_132"}' \
     http://192.168.10.14:5000/api | jq

{
  "nodes": [
    {
      "id": "c_132",
      "name": "\"Chambers LLC\"",
      "tag": "corp"
    },
    {
      "id": "c_245",
      "name": "\"Thompson-King\"",
      "tag": "corp"
    },
    {
      "id": "c_132",
      "name": "\"Chambers LLC\"",
      "tag": "corp"
    },
...
    }
  ],
  "relationships": [
    {
      "edge": "hold_share",
      "properties": "{'share': 0.0}",
      "source": "c_245",
      "target": "c_132"
    {
      "edge": "hold_share",
      "properties": "{'share': 9.0}",
      "source": "p_1767",
      "target": "c_132"
    },
    {
      "edge": "hold_share",
      "properties": "{'share': 11.0}",
      "source": "p_1997",
      "target": "c_132"
    },
...
    },
    {
      "edge": "reletive_with",
      "properties": "{'degree': 51}",
      "source": "p_7283",
      "target": "p_4723"
    }
  ]
}

put together

The code of the project is all on GitHub. In the end, there are actually only one or two hundred lines of code. After putting everything together, the code is:

 ├── README.md         # You could find Design Logs here
├── corp-rel-backend
│   └── app.py        # Flask App to handle Requst and calls GDB
├── corp-rel-frontend
│   └── src
│       ├── App.vue
│       └── main.js   # Vue App to call Flask App and Renders Graph
└── requirements.txt

final effect

We made a simple but informative little system that accepts an entity ID entered by the user, and then presses Enter:

The front-end program sends the request to the back-end
The backend splices the query statement of Nebula Graph, and requests Nebula Graph through the Nebula Python client
Nebula Graph accepts requests to make penetrating queries and returns structures to the backend
The backend builds the result into a format accepted by the frontend D3 and passes it to the frontend
The front end receives the data of the graph structure, and the data for rendering the equity penetration is as follows:

Summarize

Now, we know that thanks to the design of the graph database, it is very natural and efficient to build a convenient equity analysis system on it. We can either use the graph database's graph exploration visualization capabilities, or build it ourselves, which can provide users with very efficient and intuitive Multi-hop equity penetration analysis.

If you want to know more about distributed graph databases, you are welcome to pay attention to the open source project Nebula Graph. It has been recognized by many teams and companies in China as a powerful tool for the data technology storage layer in the graph era. You can visit here , or here , Learn more about related shares and articles.

In the future, I will share with you more graph database related articles, videos and open source sample project ideas sharing and tutorials. You are welcome to pay attention to my website: siwei.io.

Nebula Community's first call for papers is underway! 🔗 The prizes are generous, covering the whole scene: code mechanical keyboard⌨️, mobile phone wireless charging 🔋, health assistant smart bracelet⌚️, and more database design, knowledge map practice books 📚 waiting for you to pick up, and Nebula exquisite surrounding delivery non-stop ~🎁

Welcome friends who are interested in Nebula and who like to study and write interesting stories about themselves and Nebula~

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~

Graph Database | How to build a corporate equity graph system from zero to one?

Where does the data reside?

Graph Data Modeling

Data storage

Query data in the gallery

Build a graph system

Backend Service --> Graph Database

check sentence

Nebula Python Client/ SDK

Front-end rendering point and edge as a picture

frontend <-- backend

put together

final effect

Summarize

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式