Nebula Importer data import practice

This article was first published on the Nebula Graph Community public account

foreword

As a relatively mature product, Nebula already has a rich ecosystem. As far as the dimension of data import is concerned, a variety of choices have been provided. There are large and comprehensive Nebula Exchange , small and compact Nebula Importer , and Nebula Spark Connector and Nebula Flink Connector for Spark/Flink engine.

Among the many import methods, which one is more convenient?

Introduction to usage scenarios:

Nebula Exchange
- Need to import streaming data from Kafka and Pulsar platforms into Nebula Graph database
- Need to read batch data from relational databases (such as MySQL) or distributed file systems (such as HDFS)
- Large batches of data need to be generated into SST files that Nebula Graph can recognize
Nebula Importer
- Importer for importing the contents of a local CSV file into Nebula Graph
Nebula Spark Connector:
- Migrate data between different Nebula Graph clusters
- Migrate data between different graph spaces within the same Nebula Graph cluster
- Migrate data between Nebula Graph and other data sources
- Graph computation with Nebula Algorithm
Nebula Flink Connector
- Migrate data between different Nebula Graph clusters
- Migrate data between different graph spaces within the same Nebula Graph cluster
- Migrate data between Nebula Graph and other data sources

The above is taken from the official Nebula documentation: https://docs.nebula-graph.com.cn/2.6.2/1.introduction/1.what-is-nebula-graph/

In general, Exchange is large and complete, and can be combined with most storage engines and imported into Nebula, but it needs to deploy the Spark environment.

Nebula Importer 数据导入实践

Importer is easy to use and requires less dependencies, but you need to generate data files in advance and configure the schema once and for all, but it does not support resuming uploads, which is suitable for medium data volumes.

Spark / Flink Connector needs to be combined with streaming data.

Choose different tools for different scenarios. If you use Nebula as a newcomer to import data, it is recommended to use the Nebula Importer tool, which is easy and quick to get started .

Use of Nebula Importer

In the early days of our contact with Nebula Graph, the ecology was not perfect at that time, and only part of the business was migrated to Nebula Graph. We imported the data of Nebula Graph in full or incrementally by using Hive tables to push to Kafka, and consuming Kafka and writing to Nebula in batches Graph way. Later, as more and more data and business were switched to Nebula Graph, the efficiency of imported data became more and more serious, and the import time increased, so that the full amount of data was still imported during the peak business period, which was unacceptable.

In response to the above problems, after trying the Nebula Spark Connector and Nebula Importer, considering the ease of maintenance and migration, the method of Hive table -> CSV -> Nebula Server -> Nebula Importer is used to import in full, and the overall time-consuming is also greatly improved.

Related configuration of Nebula Importor

system environment

 [root@nebula-server-prod-05 importer]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping:              7
CPU MHz:               2499.998
BogoMIPS:              4999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-15

Disk：SSD
Memory: 128G

Cluster environment

Nebula Version: v2.6.1
Deployment method: RPM
Cluster size: three replicas, six nodes

data scale

 +---------+--------------------------+-----------+
| "Space" | "vertices"               | 559191827 |
+---------+--------------------------+-----------+
| "Space" | "edges"                  | 722490436 |
+---------+--------------------------+-----------+

Importer configuration

 # Graph版本，连接2.x时设置为v2。
version: v2

description: Relation Space import data

# 是否删除临时生成的日志和错误数据文件。
removeTempFiles: false

clientSettings:

  # nGQL语句执行失败的重试次数。
  retry: 3

  # Nebula Graph客户端并发数。
  concurrency: 5

  # 每个Nebula Graph客户端的缓存队列大小。
  channelBufferSize: 1024

  # 指定数据要导入的Nebula Graph图空间。
  space: Relation

  # 连接信息。
  connection:
    user: root
    password: ******
    address: 10.0.XXX.XXX:9669,10.0.XXX.XXX:9669

  postStart:
    # 配置连接Nebula Graph服务器之后，在插入数据之前执行的一些操作。
    commands: |

    # 执行上述命令后到执行插入数据命令之间的间隔。
    afterPeriod: 1s

  preStop:
    # 配置断开Nebula Graph服务器连接之前执行的一些操作。
    commands: |

# 错误等日志信息输出的文件路径。    
logPath: /mnt/csv_file/prod_relation/err/test.log
....

Since the space only shows some global-related configurations, there are many point-and-edge related configurations, which will not be expanded. For details, please refer to GitHub .

Set up Crontab, Hive generates the table and transmits it to Nebula Server, and runs the Nebula Importer task when the traffic is low at night:

 50 03 15 * * /mnt/csv_file/importer/nebula-importer -config /mnt/csv_file/importer/rel.yaml >> /root/rel.log

It took a total of 2 hours, and the full data import was completed around 6:00.

Part of the log is as follows, and the maximum import speed is maintained at about 200,000/s :

 2022/05/15 03:50:11 [INFO] statsmgr.go:62: Tick: Time(10.00s), Finished(1952500), Failed(0), Read Failed(0), Latency AVG(4232us), Batches Req AVG(4582us), Rows AVG(195248.59/s)
2022/05/15 03:50:16 [INFO] statsmgr.go:62: Tick: Time(15.00s), Finished(2925600), Failed(0), Read Failed(0), Latency AVG(4421us), Batches Req AVG(4761us), Rows AVG(195039.12/s)
2022/05/15 03:50:21 [INFO] statsmgr.go:62: Tick: Time(20.00s), Finished(3927400), Failed(0), Read Failed(0), Latency AVG(4486us), Batches Req AVG(4818us), Rows AVG(196367.10/s)
2022/05/15 03:50:26 [INFO] statsmgr.go:62: Tick: Time(25.00s), Finished(5140500), Failed(0), Read Failed(0), Latency AVG(4327us), Batches Req AVG(4653us), Rows AVG(205619.44/s)
2022/05/15 03:50:31 [INFO] statsmgr.go:62: Tick: Time(30.00s), Finished(6080800), Failed(0), Read Failed(0), Latency AVG(4431us), Batches Req AVG(4755us), Rows AVG(202693.39/s)
2022/05/15 03:50:36 [INFO] statsmgr.go:62: Tick: Time(35.00s), Finished(7087200), Failed(0), Read Failed(0), Latency AVG(4461us), Batches Req AVG(4784us), Rows AVG(202489.00/s)

Then at 7:00, according to the timestamp, re-consume Kafka to import the incremental data from the early morning to 7:00 that day to prevent the full data of T+1 from covering the incremental data of the day.

 50 07 15 * * python3  /mnt/code/consumer_by_time/relation_consumer_by_timestamp.py

Incremental consumption takes about 10-15 minutes.

real-time

According to the incremental data obtained after MD5 comparison, import it into Kafka, and consume Kafka data in real time to ensure that the data delay does not exceed 1 minute.

In addition, unanticipated data problems may occur in real time for a long time and are not discovered, so full data will be imported every 30 days, as described above by Importer. Then add TTL=35 days to the edge of Space to ensure that the data that is not updated in time will be filtered and recycled later.

some notes

The forum post https://discuss.nebula-graph.com.cn/t/topic/361 mentions common problems with CSV import, you can refer to it. In addition, here are some suggestions based on experience:

Regarding concurrency, it is mentioned in the question that this concurrency can be specified as your cpu cores, which means how many clients connect to Nebula Server. In actual operation, it is necessary to trade off the impact of import speed and server pressure. In our test, if the concurrency is too high, the disk IO will be too high, causing some alarms to be set.
Importer cannot resume uploading from a breakpoint. If there is an error, it needs to be handled manually. In actual operation, we will program to analyze the Importer's log and deal with it according to the situation. If any part of the data has unexpected errors, we will give an alarm notification and manually intervene to prevent accidents.
After Hive generates the table and transmits it to Nebula Server, the actual time-consuming of this part of the task is closely related to the Hadoop resource situation. There may be insufficient resources, which may cause the Hive and CSV table generation time to be delayed, but the Importer is running normally. This part It is necessary to make predictions in advance. Here we compare the end time of the hive task with the start time of the Importer task to determine whether the process of the Importer needs to run normally.

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~

Nebula Importer data import practice

foreword

Use of Nebula Importer

Related configuration of Nebula Importor

system environment

Cluster environment

data scale

Importer configuration

real-time

some notes

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

好用的开源埋点方案-ClkLog埋点用户分析系统

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

【TVM教程】为 ARM CPU 自动调度神经网络

DNS服务器地址大全

【赵渝强老师】在Docker中运行达梦数据库