SST import of Nebula Exchange without relying on stand-alone early adopters

无依赖单机尝鲜 Nebula Exchange 的 SST 导入

This article tries to share the steps of writing SST in Nebula Exchange in a minimal way (single machine, containerized Spark, Hadoop, Nebula Graph). This article applies to Nebula-Exchange v2.5 and above.

Original link:

Overseas visit: https://siwei.io/nebula-exchange-sst-2.x/
Domestic visit: https://cn.siwei.io/nebula-exchange-sst-2.x/

What is Nebula Exchange?

I introduced it in Nebula Data Import Options before, Nebula Exchange is an open source Spark Applicaiton of the Nebula Graph community, which is specially used to support batch or streaming data import into the Nebula Graph Database.

Nebula Exchange supports a wide variety of data sources (from Apache Parquet, ORC, JSON, CSV, HBase, Hive MaxCompute to Neo4j, MySQL, ClickHouse, Kafka, Pulsar, and more data sources are on the rise).

无依赖单机尝鲜 Nebula Exchange 的 SST 导入

As shown in the figure above, in Exchange, apart from the fact that different Readers can read different data sources, when the data is processed by the Processor and written to (sink) the Nebula Graph database through the Writer, in addition to the normal ServerBaseWriter writing In addition to the process, it can also bypass the entire writing process and use the computing power of Spark to generate the SST file of the underlying RocksDB in parallel, so as to achieve ultra-high performance data import. This SST file import scene is the familiar part of this article. .

For more information see: Nebula Graph Handbook: What is Nebula Exchange
Nebula Graph official blog also has more practical articles on Nebula Exchange

Step overview

lab environment
Configure Exchange
Generate SST file
Write SST file to Nebula Graph

Experimental environment preparation

To minimize the use of Nebula Exchange's SST functionality, we need:

To build a Nebula Graph cluster and create a schema for importing data, we choose to use the Docker-Compose method, use Nebula-Up to quickly deploy, and simply modify its network to facilitate access to it by the same containerized Exchange program.
Build a containerized Spark runtime environment
Building a containerized HDFS

1. Build a Nebula Graph cluster

With the help Nebula-Up we can deploy a Nebula Graph cluster in the Linux environment with one click:

curl -fsSL nebula-up.siwei.io/install.sh | bash

无依赖单机尝鲜 Nebula Exchange 的 SST 导入

After the deployment is successful, we need to make some changes to the environment. The changes I made here are actually two points:

Keep only one metaD service
External network with Docker

For details of the revised part, refer to Appendix 1 of

Apply the docker-compose modifications:

cd ~/.nebula-up/nebula-docker-compose
vim docker-compose.yaml # 参考附录一
docker network create nebula-net # 需要创建外部网络
docker-compose up -d --remove-orphans

After that, let's create the graph space to be tested and create the Schema of the graph. For this, we can use nebula-console. Similarly, Nebula-Up comes with a containerized nebula-console.

Enter the container where Nebula-Console is located

~/.nebula-up/console.sh
/ #

Initiate a link to the graph database in the console container, where 192.168.x.y is the first NIC address of the Linux VM where I am, please replace it with yours

/ # nebula-console -addr 192.168.x.y -port 9669 -user root -p password
[INFO] connection pool is initialized successfully

Welcome to Nebula Graph!

Create the graph space (we named sst ), and the schema

create space sst(partition_num=5,replica_factor=1,vid_type=fixed_string(32));
:sleep 20
use sst
create tag player(name string, age int);

Sample output

(root@nebula) [(none)]> create space sst(partition_num=5,replica_factor=1,vid_type=fixed_string(32));
Execution succeeded (time spent 1468/1918 us)

(root@nebula) [(none)]> :sleep 20

(root@nebula) [(none)]> use sst
Execution succeeded (time spent 1253/1566 us)

Wed, 18 Aug 2021 08:18:13 UTC

(root@nebula) [sst]> create tag player(name string, age int);
Execution succeeded (time spent 1312/1735 us)

Wed, 18 Aug 2021 08:18:23 UTC

2. Build a containerized Spark environment

With the work done by big data europe, the process is very easy.

It is worth noting that:

The current Nebula Exchange has requirements for the Spark version. In August 2021, I am using the spark-2.4.5-hadoop-2.7 version.
For convenience, I let Spark run on the same machine as Nebula Graph, and specified to run under the same Docker network

docker run --name spark-master --network nebula-net \
    -h spark-master -e ENABLE_INIT_DAEMON=false -d \
    bde2020/spark-master:2.4.5-hadoop2.7

Then, we can enter the environment:

docker exec -it spark-master bash

Once inside the Spark container, you can install maven like this:

export MAVEN_VERSION=3.5.4
export MAVEN_HOME=/usr/lib/mvn
export PATH=$MAVEN_HOME/bin:$PATH

wget http://archive.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz && \
  tar -zxvf apache-maven-$MAVEN_VERSION-bin.tar.gz && \
  rm apache-maven-$MAVEN_VERSION-bin.tar.gz && \
  mv apache-maven-$MAVEN_VERSION /usr/lib/mvn

You can also download the jar package of nebula-exchange in the container like this:

cd ~
wget https://repo1.maven.org/maven2/com/vesoft/nebula-exchange/2.1.0/nebula-exchange-2.1.0.jar

3. Build a containerized HDFS

Also thanks to the work of big-data-eroupe, this is very simple, but we need to make a little modification to make it use the nebula-net Docker network created earlier in the docker-compose.yml file.

For the details of the revision, refer to Appendix 2 of

git clone https://github.com/big-data-europe/docker-hadoop.git
cd docker-hadoop
vim docker-compose.yml
docker-compose up -d

Configure Exchange

The main information filled in this configuration is the Nebula Graph cluster itself and the Space Name to which the data will be written, as well as the configuration related to the data source (here we use csv as an example), and finally configure the output (sink) as sst

Nebula Graph
- GraphD address
- MetaD address
- credential
- Space Name
data source
- source: csv
  - path
  - fields etc.
- ink: sst

Detailed configuration reference Appendix II

Note that the address of metaD can be obtained like this, you can see that 0.0.0.0:49377->9559 indicates that 49377 is the external address.

$ docker ps | grep meta
887740c15750   vesoft/nebula-metad:v2.0.0                               "./bin/nebula-metad …"   6 hours ago    Up 6 hours (healthy)    9560/tcp, 0.0.0.0:49377->9559/tcp, :::49377->9559/tcp, 0.0.0.0:49376->19559/tcp, :::49376->19559/tcp, 0.0.0.0:49375->19560/tcp, :::49375->19560/tcp                  nebula-docker-compose_metad0_1

Generate SST file

1. Prepare source files and configuration files

docker cp exchange-sst.conf spark-master:/root/
docker cp player.csv spark-master:/root/

An example of player.csv :

1100,Tim Duncan,42
1101,Tony Parker,36
1102,LaMarcus Aldridge,33
1103,Rudy Gay,32
1104,Marco Belinelli,32
1105,Danny Green,31
1106,Kyle Anderson,25
1107,Aron Baynes,32
1108,Boris Diaw,36
1109,Tiago Splitter,34
1110,Cory Joseph,27
1111,David West,38

2. Execute the exchange program

Enter the spark-master container and submit and execute the exchange application.

docker exec -it spark-master bash
cd /root/
/spark/bin/spark-submit --master local \
    --class com.vesoft.nebula.exchange.Exchange nebula-exchange-2.1.0.jar\
    -c exchange-sst.conf

Check the execution result:

spark-submit output:

21/08/17 03:37:43 INFO TaskSetManager: Finished task 31.0 in stage 2.0 (TID 33) in 1093 ms on localhost (executor driver) (32/32)
21/08/17 03:37:43 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
21/08/17 03:37:43 INFO DAGScheduler: ResultStage 2 (foreachPartition at VerticesProcessor.scala:179) finished in 22.336 s
21/08/17 03:37:43 INFO DAGScheduler: Job 1 finished: foreachPartition at VerticesProcessor.scala:179, took 22.500639 s
21/08/17 03:37:43 INFO Exchange$: SST-Import: failure.player: 0
21/08/17 03:37:43 WARN Exchange$: Edge is not defined
21/08/17 03:37:43 INFO SparkUI: Stopped Spark web UI at http://spark-master:4040
21/08/17 03:37:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

Verify the generated SST file on HDFS:

docker exec -it namenode /bin/bash

root@2db58903fb53:/# hdfs dfs -ls /sst
Found 10 items
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/1
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/10
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/2
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/3
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/4
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/5
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/6
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/7
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/8
drwxr-xr-x   - root supergroup          0 2021-08-17 03:37 /sst/9

Write SST to Nebula Graph

The operations here are actually reference documents: SST import , and get it. Which is to perform two steps from the console:

Download
Ingest

Among them, Download actually triggers Nebula Graph to initiate the download of HDFS Client from the server, obtain the SST file on HDFS, and put it in the local path that storageD can access. Here, we need to deploy HDFS dependencies on the server. Because we are the minimum practice, I am lazy to manually do this Download operation.

1. Manual download

Here, we need to know the download path of the Nebula Graph server, which is actually /data/storage/nebula/<space_id>/download/ . The Space ID here needs to be obtained manually:

In this example, our Space Name is sst and our Space ID is 49 .

(root@nebula) [sst]> DESC space sst
+----+-------+------------------+----------------+---------+------------+--------------------+-------------+-----------+
| ID | Name  | Partition Number | Replica Factor | Charset | Collate    | Vid Type           | Atomic Edge | Group     |
+----+-------+------------------+----------------+---------+------------+--------------------+-------------+-----------+
| 49 | "sst" | 10               | 1              | "utf8"  | "utf8_bin" | "FIXED_STRING(32)" | "false"     | "default" |
+----+-------+------------------+----------------+---------+------------+--------------------+-------------+-----------+

Therefore, the following operation is to manually download the SST file from HDFS to get , and then copy it to storageD.

docker exec -it namenode /bin/bash

$ hdfs dfs -get /sst /sst
exit
docker cp namenode:/sst .
docker exec -it nebula-docker-compose_storaged0_1 mkdir -p /data/storage/nebula/49/download/
docker exec -it nebula-docker-compose_storaged1_1 mkdir -p /data/storage/nebula/49/download/
docker exec -it nebula-docker-compose_storaged2_1 mkdir -p /data/storage/nebula/49/download/
docker cp sst nebula-docker-compose_storaged0_1:/data/storage/nebula/49/download/
docker cp sst nebula-docker-compose_storaged1_1:/data/storage/nebula/49/download/
docker cp sst nebula-docker-compose_storaged2_1:/data/storage/nebula/49/download/

2. SST file import

Enter the container where Nebula-Console is located

~/.nebula-up/console.sh
/ #

Initiate a link to the graph database in the console container, where 192.168.x.y is the first NIC address of the Linux VM where I am, please replace it with yours

/ # nebula-console -addr 192.168.x.y -port 9669 -user root -p password
[INFO] connection pool is initialized successfully

Welcome to Nebula Graph!

Execute INGEST to let StorageD read SST files

(root@nebula) [(none)]> use sst
(root@nebula) [sst]> INGEST;

We can view the logs of the Nebula Graph server in real time in the following ways

tail -f ~/.nebula-up/nebula-docker-compose/logs/*/*

Successful INGEST log:

I0817 08:03:28.611877   169 EventListner.h:96] Ingest external SST file: column family default, the external file path /data/storage/nebula/49/download/8/8-6.sst, the internal file path /data/storage/nebula/49/data/000023.sst, the properties of the table: # data blocks=1; # entries=1; # deletions=0; # merge operands=0; # range deletions=0; raw key size=48; raw average key size=48.000000; raw value size=40; raw average value size=40.000000; data block size=75; index block size (user-key? 0, delta-value? 0)=66; filter block size=0; (estimated) table size=141; filter policy name=N/A; prefix extractor name=nullptr; column family ID=N/A; column family name=N/A; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ; creation time=0; time stamp of earliest key=0; file creation time=0;
E0817 08:03:28.611912   169 StorageHttpIngestHandler.cpp:63] SSTFile ingest successfully

appendix

Appendix I

docker-compose.yaml

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 48854de..cfeaedb 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -6,11 +6,13 @@ services:
       USER: root
       TZ:   "${TZ}"
     command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
+      - --meta_server_addrs=metad0:9559
       - --local_ip=metad0
       - --ws_ip=metad0
       - --port=9559
       - --ws_http_port=19559
+      - --ws_storage_http_port=19779
       - --data_path=/data/meta
       - --log_dir=/logs
       - --v=0
@@ -34,81 +36,14 @@ services:
     cap_add:
       - SYS_PTRACE

-  metad1:
-    image: vesoft/nebula-metad:v2.0.0
-    environment:
-      USER: root
-      TZ:   "${TZ}"
-    command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
-      - --local_ip=metad1
-      - --ws_ip=metad1
-      - --port=9559
-      - --ws_http_port=19559
-      - --data_path=/data/meta
-      - --log_dir=/logs
-      - --v=0
-      - --minloglevel=0
-    healthcheck:
-      test: ["CMD", "curl", "-sf", "http://metad1:19559/status"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 20s
-    ports:
-      - 9559
-      - 19559
-      - 19560
-    volumes:
-      - ./data/meta1:/data/meta
-      - ./logs/meta1:/logs
-    networks:
-      - nebula-net
-    restart: on-failure
-    cap_add:
-      - SYS_PTRACE
-
-  metad2:
-    image: vesoft/nebula-metad:v2.0.0
-    environment:
-      USER: root
-      TZ:   "${TZ}"
-    command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
-      - --local_ip=metad2
-      - --ws_ip=metad2
-      - --port=9559
-      - --ws_http_port=19559
-      - --data_path=/data/meta
-      - --log_dir=/logs
-      - --v=0
-      - --minloglevel=0
-    healthcheck:
-      test: ["CMD", "curl", "-sf", "http://metad2:19559/status"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 20s
-    ports:
-      - 9559
-      - 19559
-      - 19560
-    volumes:
-      - ./data/meta2:/data/meta
-      - ./logs/meta2:/logs
-    networks:
-      - nebula-net
-    restart: on-failure
-    cap_add:
-      - SYS_PTRACE
-
   storaged0:
     image: vesoft/nebula-storaged:v2.0.0
     environment:
       USER: root
       TZ:   "${TZ}"
     command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
+      - --meta_server_addrs=metad0:9559
       - --local_ip=storaged0
       - --ws_ip=storaged0
       - --port=9779
@@ -119,8 +54,8 @@ services:
       - --minloglevel=0
     depends_on:
       - metad0
-      - metad1
-      - metad2
     healthcheck:
       test: ["CMD", "curl", "-sf", "http://storaged0:19779/status"]
       interval: 30s
@@ -146,7 +81,7 @@ services:
       USER: root
       TZ:   "${TZ}"
     command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
+      - --meta_server_addrs=metad0:9559
       - --local_ip=storaged1
       - --ws_ip=storaged1
       - --port=9779
@@ -157,8 +92,8 @@ services:
       - --minloglevel=0
     depends_on:
       - metad0
-      - metad1
-      - metad2
     healthcheck:
       test: ["CMD", "curl", "-sf", "http://storaged1:19779/status"]
       interval: 30s
@@ -184,7 +119,7 @@ services:
       USER: root
       TZ:   "${TZ}"
     command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
+      - --meta_server_addrs=metad0:9559
       - --local_ip=storaged2
       - --ws_ip=storaged2
       - --port=9779
@@ -195,8 +130,8 @@ services:
       - --minloglevel=0
     depends_on:
       - metad0
-      - metad1
-      - metad2
     healthcheck:
       test: ["CMD", "curl", "-sf", "http://storaged2:19779/status"]
       interval: 30s
@@ -222,17 +157,19 @@ services:
       USER: root
       TZ:   "${TZ}"
     command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
+      - --meta_server_addrs=metad0:9559
       - --port=9669
       - --ws_ip=graphd
       - --ws_http_port=19669
+      - --ws_meta_http_port=19559
       - --log_dir=/logs
       - --v=0
       - --minloglevel=0
     depends_on:
       - metad0
-      - metad1
-      - metad2
     healthcheck:
       test: ["CMD", "curl", "-sf", "http://graphd:19669/status"]
       interval: 30s
@@ -257,17 +194,19 @@ services:
       USER: root
       TZ:   "${TZ}"
     command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
+      - --meta_server_addrs=metad0:9559
       - --port=9669
       - --ws_ip=graphd1
       - --ws_http_port=19669
+      - --ws_meta_http_port=19559
       - --log_dir=/logs
       - --v=0
       - --minloglevel=0
     depends_on:
       - metad0
-      - metad1
-      - metad2
     healthcheck:
       test: ["CMD", "curl", "-sf", "http://graphd1:19669/status"]
       interval: 30s
@@ -292,17 +231,21 @@ services:
       USER: root
       TZ:   "${TZ}"
     command:
-      - --meta_server_addrs=metad0:9559,metad1:9559,metad2:9559
+      - --meta_server_addrs=metad0:9559
       - --port=9669
       - --ws_ip=graphd2
       - --ws_http_port=19669
+      - --ws_meta_http_port=19559
       - --log_dir=/logs
       - --v=0
       - --minloglevel=0
+      - --storage_client_timeout_ms=60000
+      - --local_config=true
     depends_on:
       - metad0
-      - metad1
-      - metad2
     healthcheck:
       test: ["CMD", "curl", "-sf", "http://graphd2:19669/status"]
       interval: 30s
@@ -323,3 +266,4 @@ services:

 networks:
   nebula-net:
+    external: true

Appendix II

https://github.com/big-data-europe/docker-hadoop docker-compose.yml

diff --git a/docker-compose.yml b/docker-compose.yml
index ed40dc6..66ff1f4 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -14,6 +14,8 @@ services:
       - CLUSTER_NAME=test
     env_file:
       - ./hadoop.env
+    networks:
+      - nebula-net

   datanode:
     image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
@@ -25,6 +27,8 @@ services:
       SERVICE_PRECONDITION: "namenode:9870"
     env_file:
       - ./hadoop.env
+    networks:
+      - nebula-net

   resourcemanager:
     image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8
@@ -34,6 +38,8 @@ services:
       SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864"
     env_file:
       - ./hadoop.env
+    networks:
+      - nebula-net

   nodemanager1:
     image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8
@@ -43,6 +49,8 @@ services:
       SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
     env_file:
       - ./hadoop.env
+    networks:
+      - nebula-net

   historyserver:
     image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8
@@ -54,8 +62,14 @@ services:
       - hadoop_historyserver:/hadoop/yarn/timeline
     env_file:
       - ./hadoop.env
+    networks:
+      - nebula-net

 volumes:
   hadoop_namenode:
   hadoop_datanode:
   hadoop_historyserver:
+
+networks:
+  nebula-net:
+    external: true

Appendix III

nebula-exchange-sst.conf

{
  # Spark relation config
  spark: {
    app: {
      name: Nebula Exchange 2.1
    }

    master:local

    driver: {
      cores: 1
      maxResultSize: 1G
    }

    executor: {
        memory:1G
    }

    cores:{
      max: 16
    }
  }

  # Nebula Graph relation config
  nebula: {
    address:{
      graph:["192.168.8.128:9669"]
      meta:["192.168.8.128:49377"]
    }
    user: root
    pswd: nebula
    space: sst

    # parameters for SST import, not required
    path:{
        local:"/tmp"
        remote:"/sst"
        hdfs.namenode: "hdfs://192.168.8.128:9000"
    }

    # nebula client connection parameters
    connection {
      # socket connect & execute timeout, unit: millisecond
      timeout: 30000
    }

    error: {
      # max number of failures, if the number of failures is bigger than max, then exit the application.
      max: 32
      # failed import job will be recorded in output path
      output: /tmp/errors
    }

    # use google's RateLimiter to limit the requests send to NebulaGraph
    rate: {
      # the stable throughput of RateLimiter
      limit: 1024
      # Acquires a permit from RateLimiter, unit: MILLISECONDS
      # if it can't be obtained within the specified timeout, then give up the request.
      timeout: 1000
    }
  }

  # Processing tags
  # There are tag config examples for different dataSources.
  tags: [

    # HDFS csv
    # Import mode is sst, just change type.sink to client if you want to use client import mode.
    {
      name: player
      type: {
        source: csv
        sink: sst
      }
      path: "file:///root/player.csv"
      # if your csv file has no header, then use _c0,_c1,_c2,.. to indicate fields
      fields: [_c1, _c2]
      nebula.fields: [name, age]
      vertex: {
        field:_c0
      }
      separator: ","
      header: false
      batch: 256
      partition: 32
    }

  ]
}

If there are any errors or omissions in this article, please go to GitHub: https://github.com/vesoft-inc/nebula issue area to raise issues with us or go to the official forum: https://discuss.nebula-graph. 16228183a2a53b of suggested feedback classification and suggested suggestions 👏; Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first at , and the Nebula assistant will pull you into the group~~

SST import of Nebula Exchange without relying on stand-alone early adopters

What is Nebula Exchange?

Step overview

Experimental environment preparation

1. Build a Nebula Graph cluster

2. Build a containerized Spark environment

3. Build a containerized HDFS

Configure Exchange

Generate SST file

1. Prepare source files and configuration files

2. Execute the exchange program

Write SST to Nebula Graph

1. Manual download

2. SST file import

appendix

Appendix I

Appendix II

Appendix III

NebulaGraph

`引用和评论`

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing