How to implement billion-level offline CSV import into Nebula Graph

This article was first published on the Nebula Graph Community public account

This practice is based on business requirements and subsequent expansion, and the Nebula Graph database is determined through technology selection. First, it is necessary to verify the performance of the Nebula Graph database in batch import in actual business scenarios and verify. Perform import work through Spark On Yarn distributed tasks, CSV files are placed on HDFS, and share the best practices of personal Nebula Spark Connector. .

1. Concept, applicable scenarios and advantages of Nebula Spark Connector

I won't go into details here, just screenshots. For more details, please refer to: https://docs.nebula-graph.com.cn/nebula-spark-connector/ .

2. Environmental information

hardware environment

name	value	recommend
local disk SSD	2T	at least 2 T
CPU	16C*4	128C
Memory	128GB	128G

Software Environment

name	version number
Nebula Graph	3.0.0
Nebula Spark Connector	3.0.0
Hadoop	2.7.2U17-10
Spark	2.4.5U5

data level

name	value
The amount of data	200G
Entity Vertext	930 million
Relationship Edge	970 million

3. Deployment plan

Deployment method: distributed, 3 nodes
Just refer to the official website: https://docs.nebula-graph.com.cn/3.0.1/4.deployment-and-installation/2.compile-and-install-nebula-graph/deploy-nebula-graph-cluster /

It's roughly three parts:

Download the kernel RPM package and install it;
Modify configuration files in batches;
Start the cluster service.

The root used in the following operations, if not root, just add sudo to execute.

Download the Nebula Graph RPM package and install it

Execute the following command:

 wget https://os-cdn.nebula-graph.com.cn/package/3.0.0/nebula-graph-3.0.0.el7.x86_64.rpm
wget https://oss-cdn.nebula-graph.com.cn/package/3.0.0/nebula-graph-3.0.0.el7.x86_64.rpm.sha256sum.txt
rpm -ivh nebula-graph-3.0.0.el7.x86_64.rpm

Note: The default installation path: /usr/local/nebula/ , make sure that the disk space is sufficient.

Modify configuration files in batches

 sed -i 's?--meta_server_addrs=127.0.0.1:9559?--meta_server_addrs=172.16.8.15:9559,172.16.8.176:9559,172.16.10.149:9559?g' *.conf
sed -i 's?--local_ip=127.0.0.1?--local_ip=172.16.10.149?g' *.conf
sed -i 's?--meta_server_addrs=127.0.0.1:9559?--meta_server_addrs=172.16.8.15:9559,172.16.8.176:9559,172.16.10.149:9559?g' *.conf
sed -i 's?--local_ip=127.0.0.1?--local_ip=172.16.8.15?g' *.conf
sed -i 's?--meta_server_addrs=127.0.0.1:9559?--meta_server_addrs=172.16.8.15:9559,172.16.8.176:9559,172.16.10.149:9559?g' *.conf
sed -i 's?--local_ip=127.0.0.1?--local_ip=172.16.8.176?g' *.conf

Note: The ip address is the intranet address used for inter-cluster communication.

After startup, add the Storage service:

 ADD HOSTS 172.x.x.15:9779,172.1x.x.176:9779,172.x.1x.149:9779;

Note: The operations required to increase the Storage service to v3.x and above, if you are using v2.x, you can ignore this step.

Start the cluster service

 /usr/local/nebula/scripts/nebula.service start all

The above command starts the service, execute the following command to check whether the service starts successfully:

 ps aux|grep nebula

The result is the following 3 service processes:

 /usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf
/usr/local/nebula/bin/nebula-graphd --flagfile /usr/local/nebula/etc/nebula-graphd.conf
/usr/local/nebula/bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged.conf

Note: If there are less than 3, execute /usr/local/nebula/scripts/nebula.service start all more times, if not restart .

3. Visualization Services

I chose Nebula Graph Studio, visit: http://n01v:7001 to use Studio (Note: This is my own network environment, readers cannot access it)

Login: 10.x.x.1（任意节点）:9669
Username/Password: root/nebula

Here you can read the common nGQL commands in the official documentation: https://docs.nebula-graph.com.cn/3.0.1/2.quick-start/4.nebula-graph-crud

Get started with Nebula Graph

 ADD HOSTS 172.x.x.121:9779, 172.16.11.218:9779,172.16.12.12:9779;

List all the nodes and check whether the STATUS column is ONLINE by SHOW HOSTS; or SHOW HOSTS META; .

Create a Space, which is equivalent to a traditional database database:

 CREATE SPACE mylove (partition_num = 15, replica_factor = 3, vid_type = FIXED_STRING(256));//分区数推荐为节点数的5倍关系，副本数为基数，一般设置为3，vid如果为string类型，长度尽量够用就行，否则占用磁盘空间太多。

Create a Tag, which is equivalent to the entity Vertex:

 CREATE TAG entity (name string NULL, version string NULL);

Create Edge, which is equivalent to the relation Edge:

 CREATE EDGE relation (name string NULL);

When querying, be sure to add LIMIT , otherwise it is easy to check the dead library:

 match (v) return v limit 100;

Fourth, (the focus of this article) use Spark Connector to read CSV and store

Here are 2 references:

Official NebulaSparkWriterExample (scala-json format): https://github.com/vesoft-inc/nebula-spark-utils/blob/master/example/src/main/scala/com/vesoft/nebula/examples/connector/ NebulaSparkWriterExample.scala
NebulaSparkWriterExample (java-json format) provided by Great God: https://www.jianshu.com/p/930e0343a28c

Attached is the sample code for NebulaSparkWriterExample:

 import com.facebook.thrift.protocol.TCompactProtocol
import com.vesoft.nebula.connector.{
  NebulaConnectionConfig,
  WriteMode,
  WriteNebulaEdgeConfig,
  WriteNebulaVertexConfig
}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.slf4j.LoggerFactory

object NebulaSparkWriter {
  private val LOG = LoggerFactory.getLogger(this.getClass)
  var ip = ""

  def main(args: Array[String]): Unit = {
    val part = args(0)
    ip = args(1)

    val sparkConf = new SparkConf
    sparkConf
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .registerKryoClasses(Array[Class[_]](classOf[TCompactProtocol]))
    val spark = SparkSession
      .builder()
      .master("local")
      .config(sparkConf)
      .getOrCreate()
    spark.sparkContext.setLogLevel("WARN")

    if("1".equalsIgnoreCase(part)) writeVertex(spark)
    if("2".equalsIgnoreCase(part)) writeEdge(spark)

    spark.close()
  }

  def getNebulaConnectionConfig(): NebulaConnectionConfig = {
    val config =
      NebulaConnectionConfig
        .builder()
        .withMetaAddress(ip + ":9559")
        .withGraphAddress(ip + ":9669")
        .withTimeout(Integer.MAX_VALUE)
        .withConenctionRetry(5)
        .build()
    config
  }

  def writeVertex(spark: SparkSession): Unit = {
    LOG.info("start to write nebula vertices: 1 entity")
    val df = spark.read.option("sep", "\t").csv("/home/2022/project/origin_file/csv/tag/entity/").toDF("id", "name", "version")

    val config = getNebulaConnectionConfig()
    val nebulaWriteVertexConfig: WriteNebulaVertexConfig = WriteNebulaVertexConfig
      .builder()
      .withSpace("mywtt")
      .withTag("entity")
      .withVidField("id")
      .withVidAsProp(false)
      .withUser("root")
      .withPasswd("nebula")
      .withBatch(1800)
      .build()
    df.coalesce(1400).write.nebula(config, nebulaWriteVertexConfig).writeVertices()
  }

  def writeEdge(spark: SparkSession): Unit = {
    LOG.info("start to write nebula edges: 2 entityRel")
    val df = spark.read.option("sep", "\t").csv("/home/2022/project/origin_file/csv/out/rel/relation/").toDF("src", "dst", "name")

    val config = getNebulaConnectionConfig()
    val nebulaWriteEdgeConfig: WriteNebulaEdgeConfig = WriteNebulaEdgeConfig
      .builder()
      .withSpace("mywtt")
      .withEdge("relation")
      .withSrcIdField("src")
      .withDstIdField("dst")
      .withSrcAsProperty(false)
      .withDstAsProperty(false)
      .withUser("root")
      .withPasswd("nebula")
      .withBatch(1800)
      .build()
    df.coalesce(1400).write.nebula(config, nebulaWriteEdgeConfig).writeEdges()
  }
}

Focus on the NebulaSparkWriterExample sample code

Here are some function items:

spark.sparkContext.setLogLevel("WARN") : Set the log printing level to prevent INFO interference;
withTimeout(Integer.MAX_VALUE) : The connection timeout time should be as large as possible, the default is 1 minute, and the Spark task will fail when the number of timeouts is greater than the number of retries;
option("sep", "\t") : Specify the delimiter of the CSV file, otherwise it will default to 1 column;
toDF("src", "dst", "name") : The data set specifies the schema, that is, Dataset<Row> to DataFrame , otherwise it cannot be specified VidField
withVidField("id") : Because this function only supports setting column names, Schema must be defined;
withVidAsProp(false) : The default ID is the VID field, and the data does not need to be repeatedly stored as an attribute, which takes up disk space;
withSrcIdField("src") : set the start node IdField ;
withDstIdField("dst") : Set the IdField of the termination node;
withSrcAsProperty(false) : save space
withDstAsProperty(false) : save space
withBatch(1000) : batch size, WriteMode.UPDATE default <=512, WriteMode.INSERT can be set to be larger (1500000000000000000000000000000000000000000000000/local SSD)
coalesce(1500) : It can be adjusted according to the number of concurrent tasks. The amount of data in a single partition is too large, which can easily lead to executor OOM;

5. Submit the task to the Spark cluster

 nohup spark-submit  --master yarn --deploy-mode client --class com.xxx.nebula.connector.NebulaSparkWriter --conf spark.dynamicAllocation.enabled=false --conf spark.executor.memoryOverhead=10g  --conf spark.blacklist.enabled=false --conf spark.default.parallelism=1000 --driver-memory 10G --executor-memory 12G --executor-cores 4 --num-executors 180 ./example-3.0-SNAPSHOT.jar >  run-csv-nebula.log 2>&1 &

Auxiliary monitoring iotop command

 Total DISK READ :      26.61 K/s | Total DISK WRITE :     383.77 M/s
Actual DISK READ:      26.61 K/s | Actual DISK WRITE:     431.75 M/s

Auxiliary monitoring top command

 top - 16:03:01 up 8 days, 28 min,  1 user,  load average: 6.16, 6.53, 4.58
Tasks: 205 total,   1 running, 204 sleeping,   0 stopped,   0 zombie
%Cpu(s): 28.3 us, 14.2 sy,  0.0 ni, 56.0 id,  0.6 wa,  0.0 hi,  0.4 si,  0.5 st
KiB Mem : 13186284+total,  1135004 free, 31321240 used, 99406592 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 99641296 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                         
27979 root      20   0 39.071g 0.026t   9936 S 564.6 20.8  83:22.03 nebula-storaged                                                                 
27920 root      20   0 2187476 804036   7672 S 128.2  0.6  17:13.75 nebula-graphd                                                                   
27875 root      20   0 6484644 1.990g   8588 S  58.5  1.6  14:14.22 nebula-metad

Other resource monitoring

Service optimization

nebula-storaged.conf configuration optimization

Here I modified the nebula-storaged.conf configuration item:

 # 一个批处理操作的默认保留字节
--rocksdb_batch_size=4096
# BlockBasedTable中使用的默认块缓存大小
# 单位为 MB. 服务器内存128G，一般设置为三分之一
--rocksdb_block_cache=44024

############## rocksdb Options ##############
--rocksdb_disable_wal=true
# rocksdb DBOptions在json中，每个option的名称和值都是一个字符串，如：“option_name”:“option_value”，逗号分隔
--rocksdb_db_options={"max_subcompactions":"3","max_background_jobs":"3"}
# rocksdb ColumnFamilyOptions在json中，每个option的名称和值都是字符串，如：“option_name”:“option_value”，逗号分隔
--rocksdb_column_family_options={"disable_auto_compactions":"false","write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456"}
# rocksdb BlockBasedTableOptions在json中，每个选项的名称和值都是字符串，如：“option_name”:“option_value”，逗号分隔
--rocksdb_block_based_table_options={"block_size":"8192"}

# 每个请求最大的处理器数量
--max_handlers_per_req=10
# 集群间心跳间隔时间
--heartbeat_interval_secs=10
--raft_rpc_timeout_ms=5000
--raft_heartbeat_interval_secs=10
--wal_ttl=14400
# 批量大小最大值
--max_batch_size=1800
# 参数配置减小内存应用
--enable_partitioned_index_filter=true
# 数据在最底层存储层间接做了过滤，生产环境防止遇到查到超级节点的困扰
--max_edge_returned_per_vertex=10000

Linux system optimization

 ulimit -c unlimited
ulimit -n 130000

sysctl -w net.ipv4.tcp_slow_start_after_idle=0
sysctl -w net.core.somaxconn=2048
sysctl -w net.ipv4.tcp_max_syn_backlog=2048
sysctl -w net.core.netdev_max_backlog=3000
sysctl -w kernel.core_uses_pid=1

6. Verify the import results

 SUBMIT JOB STATS;
SHOW JOB ${ID}
SHOW STATS;

The entity insertion rate is about 27,837 条/s (only applicable to this import performance calculation)
The relationship insertion rate is about 26,276 条/s (only applicable to this import performance calculation)
If the server configuration is better, the performance will be better; in addition, bandwidth, whether it is across data centers, disk IO are also factors that affect performance, and even network fluctuations.

 [root@node02 nebula]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  2.2G   48G   5% /
/dev/sdb1       2.0T  283G  1.6T  16% /usr/local/nebula
tmpfs            13G     0   13G   0% /run/user/62056

7. Performance test

Query the specified node according to the attribute:
```
 MATCH (v:entity) WHERE v.entity.name == 'Lifespan' RETURN v;
```
Execution time consumption 0.002558 (s)

hop

 MATCH (v1:entity)-[e:propertiesRel]->(v2:attribute) WHERE id(v1) == '70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a' RETURN v2 limit 100;

Execution time consumption 0.003571 (s)

double jump

 MATCH p=(v1:entity)-[e:propertiesRel*1..2]->(v2) WHERE id(v1) == '70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a' RETURN p;

Execution time consumption 0.005143 (s)

Get all attribute values of an edge

 FETCH PROP ON propertiesRel '70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a' -> '0000002d2e88d7ba6659db83893dedf3b8678f3f80de4ffe3f8683694b63a256' YIELD properties(edge);

Execution time consumption 0.001304 (s)

 match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1]->(v2) return p;

Execution time consumption 0.02986 (s)

 match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*2]->(v2) return p;

Execution time consumption Execution time consumption 0.07937 (s)

 match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*3]->(v2) return p;

Execution time consumption 0.269 (s)

 match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*4]->(v2) return p;

Execution time consumption 3.524859 (s)

 match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1..2]->(v2) return p;

Execution time consumption 0.072367 (s)

 match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1..3]->(v2) return p;

Execution time consumption 0.279011 (s)

 match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1..4]->(v2) return p;

Execution time consumption 3.728018 (s)

Query the shortest path (bidirectional) from point A_vid to point B_vid, carrying attributes of points and edges:

 FIND SHORTEST PATH WITH PROP FROM "70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a" TO "0000002d2e88d7ba6659db83893dedf3b8678f3f80de4ffe3f8683694b63a256" OVER * BIDIRECT YIELD path AS p;

Execution time consumption 0.003096 (s)

 FIND ALL PATH FROM "70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a" TO "0000002d2e88d7ba6659db83893dedf3b8678f3f80de4ffe3f8683694b63a256" OVER * WHERE propertiesRel.name is not EMPTY or propertiesRel.name >=0 YIELD path AS p;

Execution time consumption 0.003656 (s)

8. Problems encountered:

1.Guava dependency package version conflict problem

 Caused by: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;

After investigation, it was found that a dependent module uses guava version 22.0, while the Spark cluster comes with 14.0, which causes conflicts and cannot work properly. For tasks running on a Spark cluster, Spark loads the guava package with a higher priority than its own package.

The packages we depend on use a relatively new method in guava version 22.0, which does not yet exist in version 14.0. Under the premise that the other party's code cannot be modified, there are the following solutions:

Upgrading the packages of the spark cluster is risky and may cause unknown problems.
Another way is to use the Maven plugin to rename your own guava package.

The second way is adopted here, using the Maven plugin shade (link: https://maven.apache.org/plugins/maven-shade-plugin/ ) to rename the package to solve the problem.

 <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.2.4</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <relocations>
                    <relocation>
                        <pattern>com.google.common</pattern>
                        <shadedPattern>my_guava.common</shadedPattern>
                    </relocation>
                </relocations>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/maven/**</exclude>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
            </configuration>
        </execution>
    </executions>
</plugin>

2. Spark blacklist mechanism problem

 Blacklisting behavior can be configured via spark.blacklist.*.

spark.blacklist.enabled, the default value is false. If this parameter is true, then Spark will no longer schedule tasks to blacklisted executors. The blacklist algorithm can be further controlled by other spark.blacklist configuration options, see the introduction below for details.

Exchange feedback

*Welcome to the forum to discuss with the author: https://discuss.nebula-graph.com.cn