This article was first published on the Nebula Graph Community public account
This practice is based on business requirements and subsequent expansion, and the Nebula Graph database is determined through technology selection. First, it is necessary to verify the performance of the Nebula Graph database in batch import in actual business scenarios and verify. Perform import work through Spark On Yarn distributed tasks, CSV files are placed on HDFS, and share the best practices of personal Nebula Spark Connector. .
1. Concept, applicable scenarios and advantages of Nebula Spark Connector
I won't go into details here, just screenshots. For more details, please refer to: https://docs.nebula-graph.com.cn/nebula-spark-connector/ .
2. Environmental information
- hardware environment
name | value | recommend |
---|---|---|
local disk SSD | 2T | at least 2 T |
CPU | 16C*4 | 128C |
Memory | 128GB | 128G |
- Software Environment
name | version number |
---|---|
Nebula Graph | 3.0.0 |
Nebula Spark Connector | 3.0.0 |
Hadoop | 2.7.2U17-10 |
Spark | 2.4.5U5 |
- data level
name | value |
---|---|
The amount of data | 200G |
Entity Vertext | 930 million |
Relationship Edge | 970 million |
3. Deployment plan
- Deployment method: distributed, 3 nodes
- Just refer to the official website: https://docs.nebula-graph.com.cn/3.0.1/4.deployment-and-installation/2.compile-and-install-nebula-graph/deploy-nebula-graph-cluster /
It's roughly three parts:
- Download the kernel RPM package and install it;
- Modify configuration files in batches;
- Start the cluster service.
The root used in the following operations, if not root, just add sudo to execute.
Download the Nebula Graph RPM package and install it
Execute the following command:
wget https://os-cdn.nebula-graph.com.cn/package/3.0.0/nebula-graph-3.0.0.el7.x86_64.rpm
wget https://oss-cdn.nebula-graph.com.cn/package/3.0.0/nebula-graph-3.0.0.el7.x86_64.rpm.sha256sum.txt
rpm -ivh nebula-graph-3.0.0.el7.x86_64.rpm
Note: The default installation path: /usr/local/nebula/
, make sure that the disk space is sufficient.
Modify configuration files in batches
sed -i 's?--meta_server_addrs=127.0.0.1:9559?--meta_server_addrs=172.16.8.15:9559,172.16.8.176:9559,172.16.10.149:9559?g' *.conf
sed -i 's?--local_ip=127.0.0.1?--local_ip=172.16.10.149?g' *.conf
sed -i 's?--meta_server_addrs=127.0.0.1:9559?--meta_server_addrs=172.16.8.15:9559,172.16.8.176:9559,172.16.10.149:9559?g' *.conf
sed -i 's?--local_ip=127.0.0.1?--local_ip=172.16.8.15?g' *.conf
sed -i 's?--meta_server_addrs=127.0.0.1:9559?--meta_server_addrs=172.16.8.15:9559,172.16.8.176:9559,172.16.10.149:9559?g' *.conf
sed -i 's?--local_ip=127.0.0.1?--local_ip=172.16.8.176?g' *.conf
Note: The ip address is the intranet address used for inter-cluster communication.
After startup, add the Storage service:
ADD HOSTS 172.x.x.15:9779,172.1x.x.176:9779,172.x.1x.149:9779;
Note: The operations required to increase the Storage service to v3.x and above, if you are using v2.x, you can ignore this step.
Start the cluster service
/usr/local/nebula/scripts/nebula.service start all
The above command starts the service, execute the following command to check whether the service starts successfully:
ps aux|grep nebula
The result is the following 3 service processes:
/usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf
/usr/local/nebula/bin/nebula-graphd --flagfile /usr/local/nebula/etc/nebula-graphd.conf
/usr/local/nebula/bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged.conf
Note: If there are less than 3, execute /usr/local/nebula/scripts/nebula.service start all
more times, if not restart
.
3. Visualization Services
I chose Nebula Graph Studio, visit: http://n01v:7001 to use Studio (Note: This is my own network environment, readers cannot access it)
- Login:
10.x.x.1(任意节点):9669
- Username/Password: root/nebula
Here you can read the common nGQL commands in the official documentation: https://docs.nebula-graph.com.cn/3.0.1/2.quick-start/4.nebula-graph-crud
Get started with Nebula Graph
Register the Nebula cluster:
ADD HOSTS 172.x.x.121:9779, 172.16.11.218:9779,172.16.12.12:9779;
List all the nodes and check whether the STATUS column is ONLINE by SHOW HOSTS;
or SHOW HOSTS META;
.
Create a Space, which is equivalent to a traditional database database:
CREATE SPACE mylove (partition_num = 15, replica_factor = 3, vid_type = FIXED_STRING(256));//分区数推荐为节点数的5倍关系,副本数为基数,一般设置为3,vid如果为string类型,长度尽量够用就行,否则占用磁盘空间太多。
Create a Tag, which is equivalent to the entity Vertex:
CREATE TAG entity (name string NULL, version string NULL);
Create Edge, which is equivalent to the relation Edge:
CREATE EDGE relation (name string NULL);
When querying, be sure to add LIMIT
, otherwise it is easy to check the dead library:
match (v) return v limit 100;
Fourth, (the focus of this article) use Spark Connector to read CSV and store
Here are 2 references:
- Official NebulaSparkWriterExample (scala-json format): https://github.com/vesoft-inc/nebula-spark-utils/blob/master/example/src/main/scala/com/vesoft/nebula/examples/connector/ NebulaSparkWriterExample.scala
- NebulaSparkWriterExample (java-json format) provided by Great God: https://www.jianshu.com/p/930e0343a28c
Attached is the sample code for NebulaSparkWriterExample:
import com.facebook.thrift.protocol.TCompactProtocol
import com.vesoft.nebula.connector.{
NebulaConnectionConfig,
WriteMode,
WriteNebulaEdgeConfig,
WriteNebulaVertexConfig
}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.slf4j.LoggerFactory
object NebulaSparkWriter {
private val LOG = LoggerFactory.getLogger(this.getClass)
var ip = ""
def main(args: Array[String]): Unit = {
val part = args(0)
ip = args(1)
val sparkConf = new SparkConf
sparkConf
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array[Class[_]](classOf[TCompactProtocol]))
val spark = SparkSession
.builder()
.master("local")
.config(sparkConf)
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
if("1".equalsIgnoreCase(part)) writeVertex(spark)
if("2".equalsIgnoreCase(part)) writeEdge(spark)
spark.close()
}
def getNebulaConnectionConfig(): NebulaConnectionConfig = {
val config =
NebulaConnectionConfig
.builder()
.withMetaAddress(ip + ":9559")
.withGraphAddress(ip + ":9669")
.withTimeout(Integer.MAX_VALUE)
.withConenctionRetry(5)
.build()
config
}
def writeVertex(spark: SparkSession): Unit = {
LOG.info("start to write nebula vertices: 1 entity")
val df = spark.read.option("sep", "\t").csv("/home/2022/project/origin_file/csv/tag/entity/").toDF("id", "name", "version")
val config = getNebulaConnectionConfig()
val nebulaWriteVertexConfig: WriteNebulaVertexConfig = WriteNebulaVertexConfig
.builder()
.withSpace("mywtt")
.withTag("entity")
.withVidField("id")
.withVidAsProp(false)
.withUser("root")
.withPasswd("nebula")
.withBatch(1800)
.build()
df.coalesce(1400).write.nebula(config, nebulaWriteVertexConfig).writeVertices()
}
def writeEdge(spark: SparkSession): Unit = {
LOG.info("start to write nebula edges: 2 entityRel")
val df = spark.read.option("sep", "\t").csv("/home/2022/project/origin_file/csv/out/rel/relation/").toDF("src", "dst", "name")
val config = getNebulaConnectionConfig()
val nebulaWriteEdgeConfig: WriteNebulaEdgeConfig = WriteNebulaEdgeConfig
.builder()
.withSpace("mywtt")
.withEdge("relation")
.withSrcIdField("src")
.withDstIdField("dst")
.withSrcAsProperty(false)
.withDstAsProperty(false)
.withUser("root")
.withPasswd("nebula")
.withBatch(1800)
.build()
df.coalesce(1400).write.nebula(config, nebulaWriteEdgeConfig).writeEdges()
}
}
Focus on the NebulaSparkWriterExample sample code
Here are some function items:
-
spark.sparkContext.setLogLevel("WARN")
: Set the log printing level to prevent INFO interference; -
withTimeout(Integer.MAX_VALUE)
: The connection timeout time should be as large as possible, the default is 1 minute, and the Spark task will fail when the number of timeouts is greater than the number of retries; -
option("sep", "\t")
: Specify the delimiter of the CSV file, otherwise it will default to 1 column; -
toDF("src", "dst", "name")
: The data set specifies the schema, that is,Dataset<Row>
toDataFrame
, otherwise it cannot be specifiedVidField
-
withVidField("id")
: Because this function only supports setting column names, Schema must be defined; -
withVidAsProp(false)
: The default ID is the VID field, and the data does not need to be repeatedly stored as an attribute, which takes up disk space; -
withSrcIdField("src")
: set the start nodeIdField
; -
withDstIdField("dst")
: Set theIdField
of the termination node; -
withSrcAsProperty(false)
: save space -
withDstAsProperty(false)
: save space -
withBatch(1000)
: batch size,WriteMode.UPDATE
default <=512,WriteMode.INSERT
can be set to be larger (1500000000000000000000000000000000000000000000000/local SSD) -
coalesce(1500)
: It can be adjusted according to the number of concurrent tasks. The amount of data in a single partition is too large, which can easily lead to executor OOM;
5. Submit the task to the Spark cluster
nohup spark-submit --master yarn --deploy-mode client --class com.xxx.nebula.connector.NebulaSparkWriter --conf spark.dynamicAllocation.enabled=false --conf spark.executor.memoryOverhead=10g --conf spark.blacklist.enabled=false --conf spark.default.parallelism=1000 --driver-memory 10G --executor-memory 12G --executor-cores 4 --num-executors 180 ./example-3.0-SNAPSHOT.jar > run-csv-nebula.log 2>&1 &
Auxiliary monitoring iotop command
Total DISK READ : 26.61 K/s | Total DISK WRITE : 383.77 M/s
Actual DISK READ: 26.61 K/s | Actual DISK WRITE: 431.75 M/s
Auxiliary monitoring top command
top - 16:03:01 up 8 days, 28 min, 1 user, load average: 6.16, 6.53, 4.58
Tasks: 205 total, 1 running, 204 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.3 us, 14.2 sy, 0.0 ni, 56.0 id, 0.6 wa, 0.0 hi, 0.4 si, 0.5 st
KiB Mem : 13186284+total, 1135004 free, 31321240 used, 99406592 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 99641296 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27979 root 20 0 39.071g 0.026t 9936 S 564.6 20.8 83:22.03 nebula-storaged
27920 root 20 0 2187476 804036 7672 S 128.2 0.6 17:13.75 nebula-graphd
27875 root 20 0 6484644 1.990g 8588 S 58.5 1.6 14:14.22 nebula-metad
Other resource monitoring
Service optimization
nebula-storaged.conf configuration optimization
Here I modified the nebula-storaged.conf
configuration item:
# 一个批处理操作的默认保留字节
--rocksdb_batch_size=4096
# BlockBasedTable中使用的默认块缓存大小
# 单位为 MB. 服务器内存128G,一般设置为三分之一
--rocksdb_block_cache=44024
############## rocksdb Options ##############
--rocksdb_disable_wal=true
# rocksdb DBOptions在json中,每个option的名称和值都是一个字符串,如:“option_name”:“option_value”,逗号分隔
--rocksdb_db_options={"max_subcompactions":"3","max_background_jobs":"3"}
# rocksdb ColumnFamilyOptions在json中,每个option的名称和值都是字符串,如:“option_name”:“option_value”,逗号分隔
--rocksdb_column_family_options={"disable_auto_compactions":"false","write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456"}
# rocksdb BlockBasedTableOptions在json中,每个选项的名称和值都是字符串,如:“option_name”:“option_value”,逗号分隔
--rocksdb_block_based_table_options={"block_size":"8192"}
# 每个请求最大的处理器数量
--max_handlers_per_req=10
# 集群间心跳间隔时间
--heartbeat_interval_secs=10
--raft_rpc_timeout_ms=5000
--raft_heartbeat_interval_secs=10
--wal_ttl=14400
# 批量大小最大值
--max_batch_size=1800
# 参数配置减小内存应用
--enable_partitioned_index_filter=true
# 数据在最底层存储层间接做了过滤,生产环境防止遇到查到超级节点的困扰
--max_edge_returned_per_vertex=10000
Linux system optimization
ulimit -c unlimited
ulimit -n 130000
sysctl -w net.ipv4.tcp_slow_start_after_idle=0
sysctl -w net.core.somaxconn=2048
sysctl -w net.ipv4.tcp_max_syn_backlog=2048
sysctl -w net.core.netdev_max_backlog=3000
sysctl -w kernel.core_uses_pid=1
6. Verify the import results
SUBMIT JOB STATS;
SHOW JOB ${ID}
SHOW STATS;
- The entity insertion rate is about
27,837 条/s
(only applicable to this import performance calculation) - The relationship insertion rate is about
26,276 条/s
(only applicable to this import performance calculation) - If the server configuration is better, the performance will be better; in addition, bandwidth, whether it is across data centers, disk IO are also factors that affect performance, and even network fluctuations.
[root@node02 nebula]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 2.2G 48G 5% /
/dev/sdb1 2.0T 283G 1.6T 16% /usr/local/nebula
tmpfs 13G 0 13G 0% /run/user/62056
7. Performance test
Query the specified node according to the attribute:
MATCH (v:entity) WHERE v.entity.name == 'Lifespan' RETURN v;
Execution time consumption 0.002558 (s)
hop
MATCH (v1:entity)-[e:propertiesRel]->(v2:attribute) WHERE id(v1) == '70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a' RETURN v2 limit 100;
Execution time consumption 0.003571 (s)
double jump
MATCH p=(v1:entity)-[e:propertiesRel*1..2]->(v2) WHERE id(v1) == '70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a' RETURN p;
Execution time consumption 0.005143 (s)
Get all attribute values of an edge
FETCH PROP ON propertiesRel '70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a' -> '0000002d2e88d7ba6659db83893dedf3b8678f3f80de4ffe3f8683694b63a256' YIELD properties(edge);
Execution time consumption 0.001304 (s)
match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1]->(v2) return p;
Execution time consumption 0.02986 (s)
match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*2]->(v2) return p;
Execution time consumption Execution time consumption 0.07937 (s)
match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*3]->(v2) return p;
Execution time consumption 0.269 (s)
match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*4]->(v2) return p;
Execution time consumption 3.524859 (s)
match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1..2]->(v2) return p;
Execution time consumption 0.072367 (s)
match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1..3]->(v2) return p;
Execution time consumption 0.279011 (s)
match p=(v:entity{name:"张三"})-[e:entityRel|propertiesRel*1..4]->(v2) return p;
Execution time consumption 3.728018 (s)
Query the shortest path (bidirectional) from point A_vid to point B_vid, carrying attributes of points and edges:
FIND SHORTEST PATH WITH PROP FROM "70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a" TO "0000002d2e88d7ba6659db83893dedf3b8678f3f80de4ffe3f8683694b63a256" OVER * BIDIRECT YIELD path AS p;
Execution time consumption 0.003096 (s)
FIND ALL PATH FROM "70da43c5e46f56c634547c7aded3639aa8a1565975303218e2a92af677a7ee3a" TO "0000002d2e88d7ba6659db83893dedf3b8678f3f80de4ffe3f8683694b63a256" OVER * WHERE propertiesRel.name is not EMPTY or propertiesRel.name >=0 YIELD path AS p;
Execution time consumption 0.003656 (s)
8. Problems encountered:
1.Guava dependency package version conflict problem
Caused by: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
After investigation, it was found that a dependent module uses guava version 22.0, while the Spark cluster comes with 14.0, which causes conflicts and cannot work properly. For tasks running on a Spark cluster, Spark loads the guava package with a higher priority than its own package.
The packages we depend on use a relatively new method in guava version 22.0, which does not yet exist in version 14.0. Under the premise that the other party's code cannot be modified, there are the following solutions:
- Upgrading the packages of the spark cluster is risky and may cause unknown problems.
- Another way is to use the Maven plugin to rename your own guava package.
The second way is adopted here, using the Maven plugin shade (link: https://maven.apache.org/plugins/maven-shade-plugin/ ) to rename the package to solve the problem.
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>my_guava.common</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/maven/**</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
2. Spark blacklist mechanism problem
Blacklisting behavior can be configured via spark.blacklist.*.
spark.blacklist.enabled, the default value is false. If this parameter is true, then Spark will no longer schedule tasks to blacklisted executors. The blacklist algorithm can be further controlled by other spark.blacklist
configuration options, see the introduction below for details.
Exchange feedback
*Welcome to the forum to discuss with the author: https://discuss.nebula-graph.com.cn
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。