This article was first published on the Nebula Graph Community public account
This article is compiled from the DTCC keynote speech [Thinking and Practice of Open Source Distributed Graph Database]
content
content
- State of the Graph Database Market
- Advantages of Graph Databases
- Take Nebula Graph as an example
- open source community
State of the Graph Database Market
Before we start, let’s review the changes in the graph database market. Before 2018, the market was about $650,000,000. According to the current market research report, the graph database market will grow at an annual rate of 30%~100% in the next 4 years to reach $4,130,000,000~ $8,000,000,000.
As for the development of the entire graph database market, from the perspective of public information, the concept of Graph Database was first proposed by Gartner in 2013. The two more important players at the time were Neo4j and TitanDB, and JanusGraph was born after the latter was acquired. Generally speaking, when a thing develops to a certain stage, it will begin to decline, but the field of graph is still in a vigorous development stage. In Gartner's top ten trends in data and analysis in 2021, graph still has a place. Graph relates everything. Compared with traditional databases, it has a wider scope of expansion, including: graph computing, graph processing, graph deep learning, graph machine models, and more.
The following picture is the database growth trend given by DB-Engine, which does database ranking. It can be seen that the graph database is the database with the fastest growing attention in the past 10 years. The data in the table comes from social media, such as the number of times a technology point is mentioned on LinkedIn and Twitter, the number of related questions on Stack Overflow, and search engine search trends.
Advantages of Graph Databases
In general, the most obvious advantage of graphs over other databases is that they are intuitive. The following is a table structure diagram and a diagram structure diagram of the relationship between the characters of a Game of Thrones:
(Table Structure)
(Figure structure)
There is also a concise expression, in the same data model (below)
Implementation: Find out how many posts have been sent in a certain time range, how many times each has been responded to, and sort the results. The SQL may be written like this:
--PostgreSQL
WITH RECURSIVE post_all (psa_threadid
, psa_thread_creatorid, psa_messageid
, psa_creationdate, psa_messagetype
) AS (
SELECT m_messageid AS psa_threadid
, m_creatorid AS psa_thread_creatorid
, m_messageid AS psa_messageid
, m_creationdate, 'Post'
FROM message
WHERE 1=1 AND m_c_replyof IS NULL -- post, not comment
AND m_creationdate BETWEEN :startDate AND :endDate
UNION ALL
SELECT psa.psa_threadid AS psa_threadid
, psa.psa_thread_creatorid AS psa_thread_creatorid
, m_messageid, m_creationdate, 'Comment'
FROM message p, post_all psa
WHERE 1=1 AND p.m_c_replyof = psa.psa_messageid
AND m_creationdate BETWEEN :startDate AND :endDate
)
SELECT p.p_personid AS "person.id"
, p.p_firstname AS "person.firstName"
, p.p_lastname AS "person.lastName"
, count(DISTINCT psa.psa_threadid) AS threadCount
END) AS messageCount
, count(DISTINCT psa.psa_messageid) AS messageCount
FROM person p left join post_all psa on (
1=1 AND p.p_personid = psa.psa_thread_creatorid
AND psa_creationdate BETWEEN :startDate AND :endDate
)
GROUP BY p.p_personid, p.p_firstname, p.p_lastname
ORDER BY messageCount DESC, p.p_personid
LIMIT 100;
Expressed in Cypher (graph query language), only the following statement can be used:
--Cypher
MATCH (person:Person)<-[:HAS_CREATOR]-(post:Post)<-[:REPLY_OF*0..]-(reply:Message)
WHERE post.creationDate >= $startDate AND post.creationDate <= $endDate
AND reply.creationDate >= $startDate AND reply.creationDate <= $endDate
person. RETURN
id, person.firstName, person.lastName, count(DISTINCT post) AS threadCount,
count(DISTINCT reply) AS messageCount
ORDER BY
messageCount DESC, person.id ASC
LIMIT 100
In the query part, the expression of graph query is more concise.
In addition, the ecology of the map itself is also quite rich. The following is a panorama of the map technology in 2020, and more technical points will be added in 2021.
Take Nebula Graph as an example
The following takes Nebula as an example to introduce the characteristics of the distributed graph database Nebula Graph, as well as the technical challenges encountered in the development process, and how the R&D team handles such technical challenges.
Before the open source at the end of May 2019, when the prototype of Nebula Graph was designed at the end of 2018, the R&D team set 4 goals: scale, production, OLTP, and open source ecology, and these 4 goals still affect the product planning of Nebula Graph to this day.
The first point of Nebula Graph's design is: scale , which is not the same as the original design of other competing products. When Nebula Graph was initially designed, it was considered that the scale of data processed by the database in the future would be very large - Moore's theorem is always faster than data. However, the single-machine situation cannot cope with the massive data and future data growth very well, so it turns to study how the distributed database processes the data. Therefore, Nebula Graph is designed to produce industrial-level data volumes: trillions of edge data volumes. In the case of large data scale, the entire graph data analysis attributes are many, different from another graph database design that only does graph structure and does not do graph attributes, the design of Nebula Graph at the beginning is: large-scale attribute graph, support With hundreds of attributes, users can implement Nebula Graph in specific scenarios.
The second point is production . In addition to the industrial-level availability mentioned in the first point, there are also requirements for how to design, visualize, program, and operate the query language.
The third point is OLTP . The design goal at that time gave priority to TP scenarios. Even today, Nebula Graph is a graph database that is very concerned about TP scenarios, that is, it is an online, high-concurrency, and low-latency graph database. Generally speaking, Internet business scenarios have a lot of demand for this. For example, when you pay for a transfer transaction, the system may need to check whether it is safe. This processing time may be about tens of milliseconds.
The last point of open source , in addition to building a technical community and developer ecology, mainly considers docking with other big data ecosystems, as well as combining with graph computing and training frameworks.
This is the current ecological map of Nebula Graph. The red color is the core part, which is divided into three modules: meta, graph, and storage, which will be explained in detail below. The upper layer of the kernel part is a query language. In addition to the self-developed nGQL, it is also compatible with openCypher. The upper layer is the client, which currently supports Java, C++, Python, Go and other languages. The upper layer of the client is a programmable SDK, and the upper layer is a big data ecological support layer, such as common Spark, Flink, graph computing GraphX, Plato, and other data source migration tools. The left side is cloud deployment, visualization, and monitoring modules, the right side mainly focuses on data security, such as data backup, multi-room deployment, etc., as well as engineering-related Chaos and performance stress testing, and the far right side is community and developer-related Content.
Let's talk about the core design of Nebula in detail, mainly including meta, graph, and storage. The meta in the upper right corner is mainly metadata, the graph in the middle is the query engine part, and the storage below is the storage engine. The three modules are separated in process and storage and calculation. The design of Nebula Graph also reflects the large-scale consideration of the Nebula Graph at the beginning of the design.
In detail, the metadata module (Meta Service) is mainly used to manage Schema . The Schema of Nebula Graph is not completely Schema freee design. It is mainly reflected in that the attributes of points and edges need to be pre-defined. In addition, space management, permission management, long-time task management, data cleaning and other management are all carried out in the meta module. In specific use, meta generally deploys 3 nodes, and these 3 nodes perform strong synchronization. The storage engine layer Storage Service is a multi-process system, and a strong synchronization is performed between multiple processes .
As mentioned above, Nebula supports a trillion-scale point and edge data volume, so it is necessary to slice the graph. Generally speaking, there are two types of graph slices: 切点
, 切边
. The design of Nebula Graph is to cut edges, put points on Partitions, and put the out and in edges of edges on their respective Partitions, and then cut this into 100 or even 1000 Partitions. Partitions are fine-grained modules, service for each process. Each Partition, such as Partition 1, may have 3 copies, which fall on different machines, Partition 2 falls on 3 different machines, and each Partition is internally consistent. If you want to schedule, move a Partition away .
Above the storage engine is the query engine layer. The query engine is stateless , that is, all data Graph Service either pulls metadata from meta or pulls main data from storage. There is no state in the query engine itself, and there is no communication between the engines. A query will only fall on a graphd, and this graphd will fall on multiple storages.
The above is the explanation of the Nebula Graph storage and computing separation architecture.
The following are the data characteristics. As mentioned above, the Nebula Graph is an attribute graph. Although the connection between points and edges is weak, the attributes of the points and edges are pre-defined by DDL. Of course, you can set multiple versions of Schema. In Nebula Graph, the point type is called Tag, the edge type is EdgeType, and the point can connect multiple edges. Users need to specify a unique identifier in advance when using Nebula Graph, such as int64 or fixed-length string. Points are usually identified by two-tuples, namely: string vid plus point type tag. Edges are identified by four-tuples, namely: starting point, edge type, rank and end point, so if you want to take ID, you will take out two-tuples or four-tuples.
In terms of data types, common data types such as Boolean, Int, Double, or combined types List, Set, or common Path and Subgraph types of graphs are supported. There may be long texts in data attributes, which are generally handed over to Elasticsearch for text indexing.
Here is a supplementary explanation for the previous storage engine. For the query engine graphd, the external interface exposed by the storage engine is a distributed graph service, but if necessary, it can also be exposed as a distributed KV service. In the storage engine, Partition adopts the majority consensus protocol Raft. In Nebula Graph, the point and edge are partitioned storage. Here is the principle of KV to realize partitioned storage:
The above picture is a simple picture: starting point, edge, end point, and now it is modeled with KV, the above is the modeling of the starting point K, the bottom is the modeling of the end point K, and the Value part is the serialization attribute. In the above example, the starting point and the ending point are made into 2 K. Generally speaking, there are three graph elements (two points plus one edge), and the data storage will fall on two different Partitions: the outgoing edge and the starting point are stored in one piece, The incoming and outgoing edges are stored together. The advantage of this design is that it is very convenient to find edges by breadth-first traversal from the starting point, or it is also very convenient to perform reverse breadth traversal from the end point to find the starting point .
So the above is the graph trimming operation:
In Nebula Graph, one edge is stored as 2 copies of data. As mentioned above, the storage layer depends on VID, which ensures strong consistency based on the Raft protocol, and Raft can also support the Listener role to synchronize data to other processes.
Let's talk about Nebula functions, such as indexing. Currently, Nebula Graph uses ES for full-text indexing. Starting from v2.x, the R&D team optimized the write performance of Nebula indexes. Since v2.5.0, the combined use of data expiration TTL function and index function is also supported. In addition, there is the TOSS function. Starting from v2.6.0, Nebula supports the eventual consistency of forward and reverse edges. When inserting or modifying an edge, the forward and reverse edges are either successfully written at the same time or fail to be written at the same time.
In terms of query engine and query language nGQL, there is currently no unified standard for graph database query. Currently, it is mainly divided into two schools: the international standard of Graph Query Language being promoted by Neo4j, TigerGraph, Oracle, Google, and LDBC organizations. , mainly based on the extension on the ISO SQL standard; and Tinkerpop's Gremlin, which is the standard on the Apache side, and generally cloud manufacturers and programmers are more interested in this. At present, the draft document of the IOS SQL organization in June shows that the syntax and semantics of the GQL standard have been generally determined, and several major database vendors have reached an agreement. Individuals expect to see the public text next year.
Let's talk about the evolution of the entire graph query language nGQL of Nebula Graph. In the v1.x version, the query language is completely self-developed, similar to the style of GO STEPS, and multiple clauses are connected by PIPE, such as the following example:
GO N TO M STEPS FROM $ids OVER $edge_type WHERE $filters | FETCH PROP
Starting with v2.x, nGQL is compatible with openCypher. openCypher is part of the Cypher language that Neo4j opened up in 2017. Currently, nGQL supports DQL (tranversal, pattern match) of openCypher. In terms of DDL and DML, the v2.x version still retains its own native nGQL style.
The four design purposes of Nebula Graph mentioned above have one point: production is available. In the operation and maintenance, data isolation, user permission authentication, copy customization, and different VID types must be considered in the multi-graph space. In addition, in the At the cluster level, the majority consensus protocol can be used to implement the CP scheme in the CAP, and the AP scheme can also be implemented across computer rooms. In terms of operation and maintenance deployment, the Nebula Operator released at the end of April supports K8s. In terms of operation and maintenance monitoring, in addition to the Nebula Dashboard developed by the Nebula R&D team, users in the community also support Grafana and Prometheus to varying degrees. Data initialization is very rich, because the bottom layer of Nebula Graph uses RocksDB, and RocksDB can directly calculate the underlying data format on the Spark cluster. For example, if you calculate the data format in 300 Spark machines, and then import it into Nebula Graph, One hundred billion edge data can be quickly updated every hour.
Regarding the performance part, most Nebula performance tests are performed by community users, such as Meituan, WeChat, 360 Finance, WeBank and other technical teams doing performance tests. The performance test report of Nebula Graph is generally the performance comparison of its own version. Based on the public LDBC_SNB_SF100 data set, the approximate results can refer to the following figures:
In general, the performance is significantly improved in the case of deep traversal .
The third design goal of Nebula Graph mentioned above is OLTP, so how to meet the user's AP needs? In this Nebula Graph, it is connected to Spark's Graph X, and supports the Tencent WeChat team's graph computing engine Plato. In the connection of Plato, it is actually the data connection of the two engines. It is necessary to change the internal data format of Nebula into the internal data format of Plato. Partition is mapped one by one, and related articles will be published in the public account later.
open source community
The last part is the community situation of Nebula Graph. The following picture is a chronology of Nebula Graph developer Aurora:
The Nebula Graph project was open sourced in May 19 and released the Alpha version, and the 1.0 GA version was released in June 20, although some enterprise users have already applied Nebula Graph to the production environment. The second major version v2.0 GA was released in March this year. The biggest difference compared to the previous 1.x version is that it supports openCypher. The DB-Engines ranking was also mentioned above. Nebula Graph entered the list in December 2019, when it was the last one, and now it has risen to 15th after 2 years:
Then a domestic university conducted a community popularity ranking of domestic open source products, and Nebula Graph developer vesoft ranked eighth.
Finally, let’s talk about open source thinking: In fact, in the field of graphs, open source is a very common thing, but closed source is not common. Because the graph itself has been a small field in the past few years, it has only become popular recently. Therefore, choosing open source is a good way to brand and build your own technology brand. Furthermore, the open source approach can attract more people to use it, exchange graph technology with more people, and promote mutual thinking. And, in the process of user use, the feedback suggestions can iteratively and quickly improve the product.
QA
The following is an excerpt from the DTCC field user questions:
Q: Are graph databases and multimodal databases conflicting? What is the relationship between the two?
A: Compared with multi-model database, the biggest feature of graph database is its full association, how to realize multi-hop query. Corresponding to the database design, it is to consider what kind of data format the bottom layer is to be made, and how to put the data and calculation together. This does not conflict with the multi-model itself. In order to improve performance, each multi-mode database has different processing methods: using different storage engines, or the same set of storage engines, the data structure may look different.
Q: What is the starting point of graph query design? Why not consider developing based on Gremlin from the beginning?
A: For data analysis students, Gremlin is not a low-threshold language, and it is a little unfriendly. At that time, the design and implementation of Gremlin required that each operator must be executed after it is sent out. For example, I want to do a .out and .in now. I have to execute .out first and then .in, so I can't do it. A global optimization. In 2018, openCypher was not perfect, there were some minor problems, and APOC needed to be supplemented because of its lack of expressiveness. APOC legal agreements were not as certain as openCypher, so based on this, we developed nGQL by ourselves. In 2019, the GQL (Query Language Standardization) movement began. The relationship between GQL, Cypher and openCypher is relatively obvious, so it has evolved into the current nGQL.
If there are any errors or omissions in this article, please go to GitHub: https://github.com/vesoft-inc/nebula issue area to raise issues with us or go to the official forum: https://discuss.nebula-graph.com.cn / 's 建议反馈
Categories and suggestions 👏; Communication graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。