This article takes you to understand the storage design and thinking of the "graph database" Nebula

This article was first published on the Nebula Graph Community public account

In the last live broadcast of nebula-storage on nLive, Wang Yujue (four kings) from the Nebula storage team shared the design thinking of nebula storage and answered some questions from community partners. This article is compiled from the live broadcast, and the order is adjusted according to the classification of the problems involved, not completely according to the time of the live broadcast.

Nebula's storage architecture

Nebula 的存储架构

The entire storage is mainly divided into three layers. The bottom is the Store Engine, which is RocksDB, and the middle is the raft consistency protocol layer. The top storage service provides external rpc interfaces, such as fetching point attributes, or edge attributes, or going from a certain Click to find interfaces like its neighbors. When we create a space through the statement CREATE SPACE IF NOT EXISTS my_space_2 (partition_num=15, replica_factor=1, vid_type=FIXED_STRING(30)); , the space is divided into multiple logical units to become partitions according to the filled parameters, each partition will fall on different machines, and multiple copies of the same Partition will form a logic Unit, and through the raft consensus algorithm raft to ensure consistency.

Nebula's storage data format

Nebula 的存储数据格式

Here we focus on why v2.x has these data format changes: In v1.x version, Nebula VID is mainly of int type, so you can see that in v1.x above, whether it is a point or an edge, its VID is Fixed length, occupying 8 bytes. Since version 2.x, in order to support string type VID, VertexID becomes n bytes of indeterminate length. Therefore, you need to specify the length of the VID when you create a Space. This is the main change. In other words, there are some minor changes, removing the timestamp. On the whole, the current storage format is closer to the usage scenario of the graph - starting from a certain point to find its neighbors, and saving edges in the VertexID + EdgeType storage format of v2.x, you can quickly find a certain point and exit the edge .

At the same time, v2.x also changed the encoding format of the key (the bottom layer of Nebula is stored in KV), which is simply to separate points and edges. In this way, when taking all tags of a certain point, it can be scanned directly through a prefix, avoiding the problem of multiple edges in the process of scanning points like v1.x.

underlying data store

In response to the user's question of "how to store data at the bottom of Nebula", Siwang replied: Nebula's storage layer uses KV to store point and edge data. For a point, the key stores the VID and its tag type. In the value of the point, according to the schema of this tag, each attribute in the schema will be encoded and stored in the value. For example, the tag player may have an integer age field such as age. When using storage, the value of the age field will be stored in the value according to a certain encoding. Next, the storage key of the edge will have several more fields, mainly the starting point ID, edge type, ranking and end type of the edge. The unique edge is determined by these four-tuples. The value of the edge is similar to the value of the point. According to the definition of the Schema field of the edge, each field is encoded and stored. Here I want to say that the storage edge in Nebula is to store two copies: the edge in GO FROM is a directed edge, and the storage layer will store the forward edge and the reverse edge. Which points point to point A or which points point A points to can be quickly achieved with a two-way lookup.

Generally speaking, graph storage is divided into two methods: edge-cut and point-cut. As mentioned above, Nebula actually adopts the edge-cut method: one edge stores two copies of KV.

User question: Why use the edge trimming method, and what are the advantages and disadvantages of the cut point and the edge trim?

If the edge is cut, each edge will be stored twice, and the total amount of data will be much larger than the cut point, because the number of edges in the graph data is much larger than the number of points, resulting in a lot of redundancy. The relative benefit is that the starting point and its Edges will be mapped to the same partition when they are mapped, so that some queries triggered from a single point will get results quickly. For point tangents, since points may be distributed on multiple machines, data consistency must be considered when updating data. Generally, tangent points are used more widely in graph computing.

you ask I answer

The following content is collected from the AMA session of the previous event preview, as well as the questions raised in the barrage during the live broadcast.

problem directory

Does the value of the edge store the edge attribute?
Strong Schema Design Reasons
save a side design
How to do physical isolation of graph space
How Meta stores Schema
Store future plans
The principle of VID traversal of points and edges
Data pre-check
Nebula Monitoring
Nebula's affairs
data bloat problem
How to deal with uneven disk capacity itself
Nebula's RocksDB "Magic Change"

Does the value of the edge store the edge attribute?

As mentioned in the underlying storage above, the attributes on the edge type will be specified when the schema of the Edge is created. These attributes will be stored as the value of the underlying RocksDB key. The placeholder of this value is fixed-length, similar to the following problem:

Strong Schema Design Reasons

Is the strong schema for technical reasons or product reasons? Because considering that the string type is variable, the length of each line itself is not fixed, and it feels no different from no schema. If it is not fixed length, how do you know where to query when querying? Is there a flag bit?

In fact, the essential reason is that the advantage of using strong schema is fast. Let's talk about common simple data types, such as int and double. The length of such data types is fixed, and we will directly encode them in the corresponding position of value. Let's talk about the string type. There are two kinds of strings in Nebula: one is a fixed-length string, and the length is fixed. Like the previous simple data types, it is encoded in a fixed position of the value. The other is variable-length strings. Generally speaking, people tend to prefer variable-length strings (flexible), and non-fixed-length strings will be stored in the form of pointers.

For example, an attribute in the schema is a variable-length string type. We will not directly encode and save it like a simple data type, but save an offset pointer in the corresponding position, which actually points to the 100th byte in the value, and then in the This variable-length string is saved only in this position of 100. So when reading a variable-length string, we need to read the value twice, the first time to get the offset, and the second time to actually read the string. Through this form, all types of attributes are converted into "fixed length". The advantage of this design is that the field to be read can be directly calculated according to the attribute to be read and the size of bytes occupied by all fields in front of it. the location stored in value and read it out. During the reading process, there is no need to read irrelevant fields, which avoids the problem that weak schemas need to decode the entire value.

A graph database like Neo4j is generally No Schema, which is more flexible when writing, but consumes some CPU when serializing and deserializing, and needs to be decoded when reading.

Follow up: If there is a variable-length string, will it cause the data length of each line to be different?

Maybe the length of the value will be different, because it is getting longer.

Follow up: If the length of each line is different, why do you need to enforce the schema?
The RocksDB used for the underlying storage of Nebula is organized in the form of blocks. Each block may be 4K in size. When reading, it is also read according to the block size, and the length of each value in each block may be different. The advantage of a strong schema is that it will be faster to read a single piece of data.

save a side design

Nebula stores two copies of the edge, can only one copy of the edge be stored? Is there a problem with saving an edge reverse query?

In fact, this is a good question. In fact, in the earliest design of Nebula, there is only one edge attribute, which is suitable for some business scenarios. For example, you don't need any reverse traversal, in this case you don't need to store reverse edges at all. At present, the biggest significance of storing reverse edges is to facilitate us to do reverse queries. In fact, in the earlier version of Nebula, it is accurate to say that it only stores the key of the reverse edge, the attribute value of the edge type is not stored, and the attribute value only exists on the forward edge. It may bring some problems. When bidirectional traversal or reverse query, the entire code logic including the processing flow will be more complicated.
If only one edge is stored, there is indeed a problem with reverse lookup.

How to do physical isolation of graph space

When you use Nebula, you will first build a map space CREATE SPACE . When building a map space, the system will assign a unique map space ID called spaceId. You can get the spaceId through DESCRIBE SPACE . Then, when Storage finds that a certain machine wants to save some data in the space, it will first build an additional directory, and then build a separate RocksDB on this to start the Rocks instance (instance) to save the data, and physically isolate it in this way. There is a disadvantage in this design: although the instance of rocksdb, or the entire space directory is isolated from each other, it may exist on the same disk, and the current resource isolation is not good enough.

How Meta stores Schema

Let's take CREATE TAG as an example. When we create a tag, we will first send a request to meta to let it write this information. The writing form is very simple, first get the tagId, and then save the tag name. The key stored in the underlying RocksDB is tagId or tag name, and the value is the definition in each of its fields. For example, the first field is age, and the type is integer int; the second field is name, and the type is string. The schema stores the types and names of all fields in the value, and writes them to RocksDB in a serialized form.

Here, the bottom layer of the two services, meta and storage, is that RocksDB uses kv storage, but provides different interfaces. For example, the interface provided by meta may be to save a tag and what attributes are there on the tag; or It is meta information such as machine or space, including user permissions and configuration information, which are stored in meta. storage is also kv storage, but the data stored is point-edge data, and the interface provided is graph operations such as fetching points, fetching edges, and fetching all outgoing edges of a certain point. On the whole, the code of meta and storage in the kv storage layer is exactly the same, but the external interface exposed above is different.

Finally, storage and meta are stored separately, they are not a process and the directory where they are stored is not the same as specified at startup.

Follow-up: The meta machine is down, what should I do?

Yes, in general Nebula recommends that the meta be deployed in three replicas. In this case, there is no problem with hanging only one machine. If the single-copy deployment meta hangs up, it is impossible to perform any operations on the schema, including the inability to create spaces. Because storage and graph are not strongly dependent on meta, they only get information from meta at startup, and then periodically get information stored in meta, so if you run the entire cluster and meta hangs up without doing anything If the schema is modified, it will not have any effect on the graph and storage.

Store future plans

Are there any plans for the storage layer behind Nebula? performance, availability, stability

In terms of performance, the bottom layer of Nebula uses RocksDB, and its performance mainly depends on the way of use and the proficiency of parameter adjustment. Frankly speaking, even Facebook internal staff to adjust parameters is a metaphysical science. Furthermore, I just introduced Nebula's underlying key storage. For example, the relative position of VID or EdgeType in the underlying storage determines to some extent that some queries will have performance impacts. Aside from RocksDB itself, there are actually a lot of performance things to do: First, when writing points or edges, some indexes need to be processed, which will bring additional performance overhead. Also, the Compaction and the actual business workload can have a big impact on performance.

In terms of stability, the bottom layer of Nebula adopts the raft protocol, which is a very critical point to ensure that Nebula Graph does not lose data. Because only this layer is stable, there will be no data inconsistency or data loss when writing data to the RocksDB below. In addition, Nebula itself is designed as a general-purpose database, and will encounter some common problems faced by general-purpose databases, such as DDL changes; while Nebula itself is a distributed graph database, it will also face the problems encountered by distributed systems. problems, like network isolation, network outages, various timeouts, or the node hangs for some reason. For the above problems, there needs to be a coping mechanism. For example, Nebula currently supports dynamic expansion and shrinkage. The whole process is very complicated. Data migration needs to be performed on the meta, as well as the dead nodes and the remaining "alive" nodes. In this process, any step in the middle fails to do Failover processing.

In terms of availability, we will introduce an active-standby architecture in the future. In some scenarios, the amount of data involved will be relatively small, and it is not necessary to store three copies, but a single machine can be stored. In the case where all the data is on a single machine, unnecessary RPC calls can be subtracted and replaced directly with local calls, and the performance may be greatly improved. Because, Nebula deploys a total of 3 services: meta, graph and storage. If it is a stand-alone deployment, graph + storage can be placed on the same machine. Originally, graph needs to call the storage interface through RPC to obtain data and then go back to graph for operation. . If your query statement is a multi-hop query, the call link from the graph to the storage is repeated many times, which will lead to increased network overhead, serialization and deserialization costs. When the two processes (storaged and graphd) are combined, there will be no RPC calls, so the performance will be greatly improved. In addition, the CPU utilization will be very high in this single-machine situation, which is what the Nebula storage team is currently doing, and will meet you in the next major version.

The principle of VID traversal of points and edges

Can it traverse points and edges by VID?

From the above figure, you can see that a Type type is stored. In the v1.x version, both the point and edge Type types are the same, so the problem that the scan point mentioned above will be mixed with multiple edges will occur. Starting from v2.x, the Types of points and edges are distinguished, and the prefix Type value is different. Given a VID, whether it is to check all tags or all edges, only one prefix query is required, and additional data will not be scanned. .

Data pre-check

Nebula is a strong schema. How to judge whether this field conforms to the definition when inserting data?

If it conforms to the definition, it is probably like this. When creating a Schema, it is required to specify that a field is nullable or has a default value, or neither nullable nor a default value. When we insert a piece of data, the insert statement will ask you to "write" what the value of each field is. After this insert Query is sent to the storage layer, the storage layer will check whether all field values are set, or whether the field to which the value is written has a default value or is nullable. The program then checks to see if all fields can be filled with values. If not, the system will report an error to inform the user that there is a problem with the Query and cannot be written. If there is no error, storage will encode the value, and then write it to RocksDB through raft. The whole process is probably like this.

Nebula Monitoring

Can Nebula perform statistics for space? Because I remember it seems to be for the machine.

This is a very good question, and the current answer is no. We are planning this. The main reason for this problem is that there are few metrics. Currently, the metrics we support are only latency, qps, and qps that report an error. Each indicator has a corresponding mean, maximum, minimum, sum and count, and parameters such as p99. It is currently a machine-level metrics, and two optimizations will be made in the follow-up: one is to increase the metrics; the other is to perform statistics at the space level. For each space, we will provide qps for statements such as fetch, go, and lookup. The above are the metrics on the graph side, and the storage block, because it does not have strong resource isolation capability, still provides cluster or single machine level metrics instead of space level.

Nebula's affairs

How is the side transaction of nebula 2.6.0 implemented?

Let’s talk about the background of the transaction below. The background is that Nebula mentioned above has two edges and two kvs. These two kvs may exist on different nodes, which will cause if a machine hangs, there is one edge. It may be that the write was not successful. The so-called edge transaction or TOSS, the main problem it solves is that when one of the machines is down, the storage layer can ensure the final consistency of the two edges (the outgoing edge and the incoming edge). This consistency level is eventual consistency, and strong consistency was not selected because some error messages and data processing problems were encountered during the research and development process, and finally the eventual consistency was selected.

Let's talk about the overall process of TOSS processing. First, send the positive side information to the first machine to write data, write a mark on the machine, and see if the mark is written successfully. If it succeeds, go to the next step. If it fails report an error directly. In the second step, the reverse edge information is sent from the first machine to the second machine. The reason why the machine with the forward edge can send the reverse edge information to the second machine is that the forward and reverse edges in Nebula Only the starting point and the ending point are exchanged, so a machine that saves a forward edge can completely spell out a reverse edge. After the machine storing the reverse edge receives it, it will directly write the edge and tell the first machine whether its writing result is successful or not. After the first machine receives the write result, assuming it is successful, it will delete the mark written in the first step before and replace it with a normal edge. At this time, the normal writing process of the entire edge is completed. Well, this is a chained synchronization mechanism.

Briefly talk about the failure process. At the beginning, the first machine fails to write and immediately reports an error; after the first machine succeeds, the second machine fails to write. In this case, the machine will have a background thread and will keep trying. Fix the edges of the second machine, guaranteed to be the same as the first machine. The more complicated thing is that the first machine will process according to the error code returned by the second machine. At present, all processes will directly delete the markers and replace them with normal positive edges, and write more additional markers to represent the failed edges that need to be recovered now, so that they will eventually be consistent.

Follow-up: Is there no business at the point?

That's right, because the point is only saved in one copy, it doesn't require a transaction. Generally, people who ask this question want to emphasize transactions between points and edges, like seeing if a point exists when inserting an edge, or deleting the corresponding edge when deleting a point. The current Nebula's suspension points are designed for performance reasons. If the above problem is to be solved, a full transaction will be introduced, but the performance will decrease by an order of magnitude. By the way, I just mentioned that TOSS is a chained form of synchronization information. The reason for this is also mentioned above because the first node can completely spell out the data of the second node. However, if the chain is used, the performance degradation will be more serious for the complete transaction, so the design of the future transaction will not adopt this method.

data bloat problem

How is the first imported data stored, because I found that the first imported data will take up a lot of disk?

We found that if the disk usage is high, generally speaking, there are more WAL files. Because the amount of data we import is generally relatively large, this will generate a large amount of wal. The default wal ttl in Nebula is 4 hours. During these 4 hours, the system's WAL log will not be deleted at all, which leads to occupancy. The disk space will be very large. In addition, a piece of data will also be written in RocksDB. Compared with the normal operation of the subsequent cluster for a period of time, the disk usage will be high at this time. The corresponding solution is also relatively simple. When importing data, reduce the wal ttl time, such as saving only half an hour or an hour, so that the disk usage rate will be reduced. Of course, if the disk space is large enough, it is ok to use the default 4 hours without doing any processing. Because after several hours, a background thread will constantly check which wals can be deleted. For example, after the default value of 4 hours, the expired wal system will be deleted once it is found.

In addition to a peak in the initial import, the amount of data written in real time by the online business is not very large, and the wal file is relatively small. It is not recommended to delete the wal file manually, because there may be problems and it can be deleted automatically according to ttl.

What does compact do to improve queries and reduce data storage footprint?
You can read the introduction and articles of RocksDB. Simply put, Compaction is mainly a multi-way merge sort. RocksDB is an LSM-Tree tree structure, and writing is append-only and only appends writing, which will lead to some redundancy in the data. Compaction is used to reduce this redundancy, take sst as input, perform merge sorting and remove redundant data, and finally output some sst. In this input and output process, Compaction will check whether the same key appears in different layers of LSM. If the same key appears multiple times, only the latest key will be retained, and the old key will be deleted, which improves the degree of sst order. At the same time, the number of ssts and the number of layers of LSM-Tree may be reduced, so that the number of ssts that need to be read during query will be reduced, and the query efficiency will be improved.

How to deal with uneven disk capacity itself

Whether the disks of different sizes are considered to be occupied by percentage, because I use two disks of different sizes, and there is a problem with the derivative after one is full.

At present, it is not easy to do. The main reason is that the storage partition distribution search is carried out in a round-robin form. Another reason is that Nebula performs Hash sharding, and the data storage size of each data disk is approaching. This will result in that if the sizes of the two data disks are inconsistent, one disk will be full first and later data will not be written. The solution can be processed from the system layer, directly bind the two disks into the same disk, and mount them with the same path.

Nebula's RocksDB "Magic Change"

In Nebula's RocksDB storage, is the vertex attribute distinguished by the column family?

At present, in fact, we do not use the column family at all, only the default column family. It may be used later, but it will not be used to distinguish vertex attributes, but to divide different partition data into different column families, which has the advantage of direct physical isolation.

The magic-modified wal of Nebula seems to be the wal of the global multi-raft, but it is reflected in the directory as if each graph space is a separate wal. What is the principle?
First of all, Nebula is indeed multi-raft, but there is no concept of global wal. Nebula's wal is for the partition level, each partition has its own wal, and there is no space wal. As for why it is designed this way, it is relatively easy to implement now, although there will be performance losses. For example, if there are multiple wals, the disk write is a random write. But for raft, the write bottleneck is not the network overhead of the system, and the replication overhead of the user's replication operation is the largest.

Nebula Community's first call for papers is underway! 🔗 The prizes are generous, covering the whole scene: code mechanical keyboard ⌨️, mobile phone wireless charging 🔋, health assistant smart bracelet ⌚️, and more database design, knowledge map practice books 📚 waiting for you to pick up, and Nebula exquisite surrounding delivery non-stop ~🎁

Welcome friends who are interested in Nebula and who like to study and write interesting stories about themselves and Nebula~
Nebula 社区首届征文活动

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~

Pay attention to the public account

This article takes you to understand the storage design and thinking of the "graph database" Nebula

Nebula's storage architecture

Nebula's storage data format

underlying data store

you ask I answer

problem directory

Does the value of the edge store the edge attribute?

Strong Schema Design Reasons

How to do physical isolation of graph space

How Meta stores Schema

Store future plans

The principle of VID traversal of points and edges

Data pre-check

Nebula Monitoring

Nebula's affairs

data bloat problem

How to deal with uneven disk capacity itself

Nebula's RocksDB "Magic Change"

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

53 倍性能提升！TiDB 全局索引如何优化分区表查询？

分布式数据库解析

Easysearch 证书：Windows 上创建自签名证书的 7 种方法

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了