Figure database platform construction and business landing

1. What is a graph database

Graph database is a database for storing and querying graph data structure. Unlike other databases, the relationship occupies the primary position in the graph database. This means that the application does not have to use foreign keys or out-of-band processing (such as MapReduce) to infer data connections. Compared with relational databases or other NoSQL databases, the data model of graph databases is also simpler and more expressive.

Graph databases are widely used in the fields of social networks, knowledge graphs, financial risk control, personalized recommendations, and network security.

2. Graph database survey

2.1 Research background

With the continuous growth of business data such as knowledge graphs, the existing graph database JanusGraph has been more difficult to deal with, and the import time has not been able to meet the requirements of the business. Therefore, finding a better-performing open source attribute graph database has become an urgent task.

The new map database should meet the following requirements:

Able to support a large-scale graph with 1 billion nodes, 10 billion edges, and 17 billion attributes
Full import time does not exceed 10h
The average response time of the second query does not exceed 50ms, and the QPS can reach 5000+
Open source and support distributed attribute graph database

2.2 Research process

The first step is to collect common open source distributed attribute graph databases, as shown in the following table:

In the second step, based on the graph database test reports of Meituan, LightGraph, TigerGraph, GalaxyBase, etc., the performance of several graph databases can be analyzed as follows:

Import: NebulaGraph> HugeGraph> JanusGraph> ArangoDB> OrientDB
Query: NebulaGraph> HugeGraph> JanusGraph> ArangoDB> OrientDB

NebulaGraph has excellent performance both in import and query performance.

In the third step, in order to verify the performance of NebulaGraph, a performance comparison test was performed on NebulaGraph and JanusGraph. The test results are as follows:

In the above figure, the performance of JanusGraph is regarded as 1, the import performance of NebulaGraph is an order of magnitude faster than JanusGraph, and the query performance is 4-7 times that of JanusGraph. And as the amount of concurrency increases, the performance gap will further widen, and JanusGraph starts with 20 threads, and the third-degree neighbor query will have errors. And NebulaGraph has no errors.

NebulaGraph fully imports 1 billion nodes and 10 billion edges in only 10 hours, which meets the requirements. It is currently investigating the SST import, which can greatly increase the import speed.

NebulaGraph uses 120 threads to perform a second-degree neighbor query stress test, and the final QPS is 6000+, which is a little better than a single machine. The success rate is close to 5 9s, and the response practice is relatively stable, with an average of 18.81ms, p95 38ms, and p99 only 115.6ms, which meets the demand.

2.3 Research conclusion

NebulaGraph's import performance, response time, and stability meet the requirements, support data segmentation, the distributed version is free and open source, and it is used by many companies. It has Chinese documents, comprehensive documents, and an active community. It is an ideal choice for open source graph databases.

3. Introduction to NebulaGraph

The picture comes from the official website of NebulaGraph

NebulaGraph is an open source, distributed, and easy-to-expandable native graph database. It can hold a large-scale data set of hundreds of billions of points and trillions of edges, and provides millisecond-level queries.

NebulaGraph is written in C++ based on the characteristics of the graph database, adopts a shared-nothing architecture, supports expansion and contraction without stopping the database service, and provides a lot of native tools, such as Nebula Graph Studio, Nebula Console, Nebula Exchange, etc. Greatly reduce the threshold for using graph databases.

The picture comes from the official website of NebulaGraph

NebulaGraph is composed of three services: Graph service, Meta service and Storage service. It is an architecture that separates storage and computing.

The Meta service is responsible for data management, such as Schema operation, cluster management, and user authority management. The service is provided by the nebula-metad process. In a production environment, it is recommended to deploy 3 nebula-metad processes in the Nebula Graph cluster. Please deploy these processes on different machines to ensure high availability. All nebula-metad processes form a cluster based on the Raft protocol. One process is the leader and the other processes are followers.

The Graph service is mainly responsible for processing query requests, including parsing query statements, verifying statements, generating execution plans, and executing four major steps according to the execution plan. The service is provided by the nebula-graphd process and can be deployed in multiple steps.

The Storage service is responsible for storing data. The service is provided by the nebula-storaged process. All nebula-storaged processes form a cluster based on the Raft protocol. Data is stored in nebula-storaged partitions. Each partition has a leader and other replica sets. The follower of the partition.

4. Graph database platform construction

When I used JanusGraph before, I encountered problems such as slow import, slow query, high concurrency OOM (JanusGraph thread pool uses unbounded queues), FULL GC (Business Gremlin statement contains Value, which leads to continuous expansion of meta space). After switching to NebulaGraph, it was basically resolved.

JanusGraph does not have an easy-to-use management interface. As shown in the above figure, we have developed a set of management interfaces that include multi-graph management, Schema management, graph visualization, graph import, and authority management.

And NebulaGraph Studio provides functions such as multi-graph management, Schema management, graph visualization, graph import, etc., which saves a lot of development work and lowers the barrier to use.

The structure of the entire graph database platform is shown in the figure above. Based on the official tools of NebulaGraph and NebulaGraph, functions such as full import, incremental import, graph export, backup/restore, and query engineering (graph retrieval) are developed.

The official import tool needs to provide an import configuration file. In order to make it easier for business use, we have designed a schema configuration form. The business only needs to fill in the form. When importing, it will automatically create a map, create a schema for the map, and automatically generate an import configuration file. Import data, automatically balance data, balance the leader, create indexes, and perform compact tasks. Currently, it is still the import method of batch writing. SST import will be investigated in the future, and the import performance can be further improved.

The official import tool uses an asynchronous client. It is extremely difficult to control the import rate when importing. If the setting is too large, it will easily lead to a backlog of graph database requests and affect the stable operation of the cluster. If the setting is too small, the speed will not reach the optimum, and the import will be too slow. We modified the source code of the official import tool and changed the asynchronous client to a synchronous client, which can take into account both performance and stability.

The official does not provide an export tool. We have developed an export tool based on the official nebula-spark. In addition to exporting data, it can also export Schema configuration and index configuration to facilitate business data migration.

In order to support data rollback, we have developed a function to quickly backup and restore the data of the specified map. However, this function cannot back up the map metadata. The full import will delete and rebuild the map. Due to the metadata change, the previous backup is useless. . In the future, I will try to import all the data, only clean up the data without deleting the pictures to avoid this problem.

There are many types of edges in the knowledge graph business, and it is often necessary to query dozens of hundreds of edges in one query. In fact, each type of edge only needs to return Top 10 results (sorted by rank). This situation is very difficult to achieve through nGQL. You can only query all the data of these edges, or the Top N data of all edges together. The former has performance problems, and the latter can often only return data of some types of edges, which cannot be satisfied. need. In response to this situation, we have classified the edges, and for those edge types with a small number, a sentence queries all the data. For a large number of edge types, use multiple threads to query the Top 10 of each edge in parallel, so that certain circumvention can be performed.

In order to ensure the high availability of services, we have implemented a dual-computer room deployment. In order not to allow the upper-level business to perceive the switching of the computer room, a query project (graph retrieval) is done on the upper layer of the graph database. The business directly calls the service of the query project, and the query project will select the appropriate graph database cluster query according to the cluster status. In addition, in order to shield the upper-level business from changes and version upgrades of the underlying graph database, the query project will manage all business query statements. When the query statement in the graph database is incompatible due to the version upgrade, you only need to adjust the graph query language in the query project to avoid affecting the upper-level business. At the same time, the query project also caches the query results, which can greatly improve the throughput of graph query.

Of course, we also encountered some problems, such as ranking failure due to large and small end problems, query results only returning edge type id, etc. Due to space reasons, we will not list them all here. These problems have been circumvented through the help of the NebulaGraph community. solve.

*Note: The NebulaGraph issues mentioned above are only for the V1.2.0 version, and many issues have been fixed in subsequent versions.

5. Business landing

5.1 Knowledge Graph and Intelligent Questions and Answers

Before using Atlas, Xiaobu Assistant only supports document-based question and answer DBQA. DBQA uses unstructured text, which is suitable for answering explanatory and narrative questions such as Why and How. However, the accuracy and coverage of answers to factual questions are not high.

After using Atlas, Xiaobu Assistant supports KBQA based on the knowledge base, and the accuracy and coverage of factual questions such as What and When have been greatly improved. For example: Who is xxx's wife? What is the weight of xxx Ultraman? What is the area of Beijing?

In addition to factual questions and answers, Xiaobu Assistant can also use the reasoning capabilities of the graph to implement some complex questions and answers: for example: What is the relationship between xxx and xxx? What was the first phone released by OPPO? What movies do xxx and xxx co-star in? Who are the Gemini stars who were born in xx?

Since knowledge graphs have large-scale semi-structured data, and there are many associations between the data, the use of relational databases cannot meet the storage and query requirements, and the graph database can solve large-scale graph storage and multi-hop queries. challenge.

5.2 Content label

In some recommended scenarios, you need to understand the content of video, audio, or text, and label it with content-related tags. For example, in short video recommendation, understanding the content of the video is conducive to accurate recommendation to users.

For film and television videos, the actors, film and television programs, and roles are constructed into a film and television entertainment map. When a new film and television short video is released, the actors can be identified through the face of the video, and the film and television roles can be identified in the title or subtitles. Use the graph to quickly infer the corresponding film and television works, and label the content of the video to improve the recommendation effect.

5.3 Data blood relationship

In the data warehouse, various ETL jobs are often run, and there are many data tables and tasks. How to visually observe the relationship between the upstream and downstream of the data table and the tasks has become an urgent problem to be solved.

It is very troublesome to use relational databases to process multi-level related queries, not only the development workload is large, but also the query performance is extremely slow. The use of graph database not only greatly reduces the development workload, but also can quickly find out the upstream and downstream relationships of the table, which is convenient for visually observing the blood relationship of the data.

5.4 Service Architecture Topology

In service resource management, business resources are divided into multiple levels, and each level has corresponding servers, services, and management personnel. If you use a relational database to process, when you need to display multi-level resources, querying will be very troublesome. Performance will be poor. At this time, you can put the relationship between resources, managers, servers, and business levels in the graph database. When it is displayed, a query statement can be completed, and the query speed is still very fast.

6. Summary

Through the implementation of business practices such as knowledge graphs, the transition from JanusGraph to NebulaGraph has been completed. The import performance has been improved by an order of magnitude, and query performance and concurrency have been improved by 3-6 times. Moreover, NebulaGraph is more stable than JanusGraph. In the course of practice, I also encountered many problems and got a lot of help from the NebulaGraph community. Thank you very much for the support of the community!

The graph database has developed rapidly in recent years. Neo4J raised US$325 million in the first half of this year, setting a new record in the database’s financing. A report released by Gartner pointed out: “By 2023, graph technology will promote rapid decision-making scenarios for 30% of global enterprises. The annual growth rate of graph technology applications exceeds 100%.” With the popularization of 5G and the Internet of Things, graph databases will become Infrastructure for handling relationships.