Detailed explanation of nebula 2.0 performance test and nebula-importer data import tuning

详解 nebula 2.0 性能测试和 nebula-importer 数据导入调优

This is a sharing of his practice written by a community user-Fanfan, which mainly explains how to perform Nebula performance testing and performance tuning of the data import part. "I" in the following text refers to the user Fanfan.

0. Summary

I was doing research on Nebula before, and then I did a performance test for its use. During the period, I consulted the official staff of Nebula many times, and I would like to express my gratitude to the official staff here. Thank you for your hard work!

I will sort out the process of my own testing, and hope to inspire you a little bit. If you have better opinions, please feel free to let me know!

1. Deploy Nebula cluster

First, 4 physical machines are prepared: 1, 2, 3, 4, each configuration, CPU: 96C, memory: 512G, disk: SSD. Machine allocation:

1：meta，storage
2：storage
3：storage
4：graphd

The installation process will not be described in detail, and the rpm method is used. Other plug-ins: nebula-import-2.0, nebula-bench-2.0, just download the source code and compile it, install it on 4 nodes.

2. Import data

The data structure of the data imported this time is 7 point types and 15 edge types. The data volume is not large, and the structure is very simple. The total data volume is 3400w, but it needs to be processed into so many point edge tables in advance.

Create space first, set vid=100, replica_factor=3, partition_num=100.

Nebula-importer data import optimization

Use nebula-importer to import, and directly open it. The speed is only 3w/s, which is too slow. Check the import document, the parameters used in the whole are only concurrency , channelBufferSize , batchsize

Try to adjust it first, and change it casually. The effect is not obvious. Post to ask the boss. For details, please refer to the forum post nebula-import 2.0 The import speed is too slow , after asking for advice, the gain is great, first change the yaml parameter

concurrency：96 # cpu核数
channelBufferSize：20000
batchsize：2500

The speed is almost 7-8w. Well, it seems to be much faster. If you make it bigger, graphd crashes directly. It seems that it cannot be too large, so these parameters should be as large as possible but not too large .

Then confirm the disk and network, it turned out to be a mechanical disk and a thousand M network. . . to 160ebe17dd1c07 SSD , and then switch to million M network , the speed is directly doubled, about 17w/s, it seems that the hardware is still very important.

Then I thought about whether it has something to do with the data. I noticed that vid and partition_num are quite long. I thought about setting a shorter vid, but I couldn’t change it, because there is indeed such a long one, and then partition_num. I read the official instructions. Disk It’s 2-10 times of that, but it’s changed to 15. It does have an impact, and the speed has reached 25w/s. I'm quite satisfied here, and there may be improvements after revisions, but let's come to an end when the requirements have been met.

summary

The concurrency is set to the number of CPU cores, and the channelBufferSize and batchsize should be as large as possible, but cannot exceed the load of the cluster.
The hardware needs to use SSD and 10,000M network
The space partition partition_num should be reasonable, not too much
Guess the length of vid, the number of attributes, and the number of graphd all have an impact, but haven't tried

3. Stress test

According to the indicators used in the business, one was selected for testing.
The indicators are as follows:

match (v:email)-[:emailid]->(mid:id)<-[:phoneid]-(phone:phone)-[:phoneid]->(ids:id) where id(v)=="replace" with v, count(distinct phone) as pnum,count(distinct mid) as midnum,count(distinct ids) as idsnum , sum(ids.isblack) as black  where pnum > 2 and midnum>5 and midnum < 100 and idsnum > 5 and idsnum < 300 and black > 0 return v.value1, true as result

This sentence is a three-degree diffusion + conditional judgment, and the number of points involved in the concentrated data is probably between 200-400.

The official nebula-bench needs to be modified a bit. Open the go_step.jmx configuration file of jmter, modify ThreadGroup.num_threads to the number of CPU cores, and then other parameters, such as loop, ngql. Set according to the actual situation. The variables in ngql should be replaced by replace.

Since the test data is relatively concentrated data, the test result of this part is 700/s, and the data is expanded to all nodes to reach 6000+/s. Concurrency seems to be ok, and the query speed is also ok, up to 300ms.

Because I am a single node here, I want to add a graphd to test to see if the concurrency is improved, and then directly start a graphd process, and the test results show that there is no improvement.

Then I just saw that 2.0.1 was released, so I re-built the cluster, re-imported the data, using three graphd , the performance directly tripled, the centralized data reached 2100+/s, and all nodes reached nearly 2w. So it's very strange. See the forum post nebula-bench 2.0 to add graph nodes, not be uploaded concurrently.

Guess may be due to the lack of blance or compact after adding graphd, you can try it if you have time.

In addition, since some monitoring components are not used, only Linux commands are used to view, so no accurate machine status information is obtained.

summary

Before testing, ensure the load balance of the cluster and make compact
Adjust the storage configuration appropriately to increase the number of available threads and the memory size of the cache
Concurrency has a lot to do with data. It doesn't make much sense to be simple, and you need to look at it in light of your own data distribution.

4. Configure

Directly post the parameters I modified below. Meta and graphd all use the default configuration, and there is no special modification. Just post storage and explain.

rocksdb_block_cache=102400  # 官方建议 1/3 内存，我这里设置 100G
num_io_threads=48 # 可用线程数，设置为 cpu 核数一半
min_vertices_per_bucket=100 # 一个桶最小的点数量
vertex_cache_bucket_exp=8 # 桶的总数是 2 的 8 次方
wal_buffer_size=16777216  # 16 M

write_buffer_size:268435456   # 256 M

The parameters here are found based on browsing various posts and going to the official code. They are not necessarily particularly accurate. They are also groped, and other parameters have not been specially modified. There are many parameters that have not been exposed, and the official does not recommend random modification, so if you need to understand, you should check it in the source code of GitHub.

end

In general, this test is not particularly professional, but Nebula also performed very good results for specific business scenarios. The adjustment of specific parameters has not been studied thoroughly, and it needs to be studied later in use. If you have good ideas for tuning, please feel free to speak up.

Exchange graph database technology? up for the Nebula exchange meeting, up for the portal 160ebe17dd2067, we are waiting for you in Beijing to communicate~~

Detailed explanation of nebula 2.0 performance test and nebula-importer data import tuning

0. Summary

1. Deploy Nebula cluster

2. Import data

Nebula-importer data import optimization

summary

3. Stress test

summary

4. Configure

end

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

53 倍性能提升！TiDB 全局索引如何优化分区表查询？

分布式数据库解析

Easysearch 证书：Windows 上创建自签名证书的 7 种方法

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了