This article was first published on the Nebula Graph Community public account
Recently, I tried to build Spark -related projects in Nebula Graph that are convenient for you to play with one click. Today, I will organize them into a document and share them with you. Moreover, I came out of the use of Nebula Spark Connector under PySpark, and I will also contribute to the document later.
Three Spark sub-projects of NebulaGraph
I have drawn a sketch around all the data import methods of NebulaGraph, which has included Spark Connector, a brief introduction to Nebula Exchange. In this article I explore them a little further along with another Nebula Algorithm.
Note: This document also clearly lists the options for different import tools for us.
TL;DR
- Nebula Spark Connector is a Spark Lib that enables Spark applications to read and write graph data from NebulaGraph in the form
dataframe
. - Nebula Exchange is built on the Nebula Spark Connector. As a Spark Lib application that can be directly executed by Spark submit JAR package, its design goal is to exchange different data sources with NebulaGraph (for the open source version, it is one-way: write, whereas for the enterprise version it is bidirectional). Many different types of data sources supported by Nebula Exchange are: MySQL , Neo4j , PostgreSQL , ClickHouse , Hive , etc. In addition to writing directly to NebulaGraph, it can optionally generate an SST file and inject it into NebulaGraph to use computing power outside of the NebulaGraph cluster to help sort the bottom layer.
- Nebula Algorithm, built on top of Nebula Spark Connector and GraphX, is also a Spark Lib and Spark application that runs common graph algorithms (pagerank, LPA, etc.) on NebulaGraph's graphs.
Nebula Spark Connector
- Code: https://github.com/vesoft-inc/nebula-spark-connector
- Documentation: https://docs.nebula-graph.io/3.1.0/nebula-spark-connector/
- JAR package: https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/
- Code example: example
NebulaGraph Spark Reader
In order to read data from NebulaGraph, such as read vertex, Nebula Spark Connector will scan all Nebula StorageDs with a given TAG, such as this means to scan player
this TAG: withLabel("player")
, we You can also specify the properties of the vertex: withReturnCols(List("name", "age"))
.
After specifying all the read TAG related configurations, call spark.read.nebula.loadVerticesToDF
and what is returned is the graph data converted to Dataframe after scanning NebulaGraph, like this:
def readVertex(spark: SparkSession): Unit = {
LOG.info("start to read nebula vertices")
val config =
NebulaConnectionConfig
.builder()
.withMetaAddress("metad0:9559,metad1:9559,metad2:9559")
.withConenctionRetry(2)
.build()
val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
.builder()
.withSpace("basketballplayer")
.withLabel("player")
.withNoColumn(false)
.withReturnCols(List("name", "age"))
.withLimit(10)
.withPartitionNum(10)
.build()
val vertex = spark.read.nebula(config, nebulaReadVertexConfig).loadVerticesToDF()
vertex.printSchema()
vertex.show(20)
println("vertex count: " + vertex.count())
}
I will not list the written examples here, but there are more detailed examples in the link to the code example given above. It is worth mentioning here that Spark Connector reads data in order to meet a large number of data scenarios for graph analysis and graph computing. , which is very different from most other clients. It bypasses GraphD directly and obtains data by scanning MetaD and StorageD, but the writing is written by initiating nGQL DML statements through GraphD.
Now let's do a hands-on exercise.
Getting started with Nebula Spark Connector
Prerequisites: Assuming the following program is running on a Linux machine with an internet connection, preferably with Docker and Docker-Compose pre-installed.
Pull up the environment
First, let's deploy the container-based NebulaGraph Core v3, Nebula Studio, Nebula Console and Spark, Hadoop environments with Nebula-Up , which will also try to install Docker and Docker-Compose for us if not already installed.
# Install Core with Spark Connector, Nebula Algorithm, Nebula Exchange
curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash -s -- v3 spark
Did you know that Nebula-UP can install more things in one click, if your environment is configured a bit bigger (eg 8 GB RAM) curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash
can install more things, but please note that Nebula-UP is not for production environment prepare.
After the above side script is executed, let's connect it with Nebula-Console (the command line client for Nebula Graph).
# Connect to nebula with console
~/.nebula-up/console.sh
# Execute any queryies like
~/.nebula-up/console.sh -e "SHOW HOSTS"
Load a piece of data into it and execute a graph query:
# Load the sample dataset
~/.nebula-up/load-basketballplayer-dataset.sh
# 等一分钟左右
# Make a Graph Query the sample dataset
~/.nebula-up/console.sh -e 'USE basketballplayer; FIND ALL PATH FROM "player100" TO "team204" OVER * WHERE follow.degree is EMPTY or follow.degree >=0 YIELD path AS p;'
Enter the Spark environment
Execute the following line, we can enter the Spark environment:
docker exec -it spark_master_1 bash
If we want to perform compilation, we can install mvn
inside:
docker exec -it spark_master_1 bash
# in the container shell
export MAVEN_VERSION=3.5.4
export MAVEN_HOME=/usr/lib/mvn
export PATH=$MAVEN_HOME/bin:$PATH
wget http://archive.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz && \
tar -zxvf apache-maven-$MAVEN_VERSION-bin.tar.gz && \
rm apache-maven-$MAVEN_VERSION-bin.tar.gz && \
mv apache-maven-$MAVEN_VERSION /usr/lib/mvn
Example of running Spark Connector
Option 1 (recommended): Via PySpark
- Enter PySpark Shell
~/.nebula-up/nebula-pyspark.sh
- Invoke Nebula Spark Reader
# call Nebula Spark Connector Reader
df = spark.read.format(
"com.vesoft.nebula.connector.NebulaDataSource").option(
"type", "vertex").option(
"spaceName", "basketballplayer").option(
"label", "player").option(
"returnCols", "name,age").option(
"metaAddress", "metad0:9559").option(
"partitionNumber", 1).load()
# show the dataframe with limit of 2
df.show(n=2)
- return result example
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Python version 2.7.16 (default, Jan 14 2020 07:22:06)
SparkSession available as 'spark'.
>>> df = spark.read.format(
... "com.vesoft.nebula.connector.NebulaDataSource").option(
... "type", "vertex").option(
... "spaceName", "basketballplayer").option(
... "label", "player").option(
... "returnCols", "name,age").option(
... "metaAddress", "metad0:9559").option(
... "partitionNumber", 1).load()
>>> df.show(n=2)
+---------+--------------+---+
|_vertexId| name|age|
+---------+--------------+---+
|player105| Danny Green| 31|
|player109|Tiago Splitter| 34|
+---------+--------------+---+
only showing top 2 rows
Option 2: Compile, submit the sample JAR package
- First clone the Spark Connector and its sample code repository, then compile:
Note that we use the master branch, because the current master branch is compatible with 3.x, we must ensure that the spark connector and the database kernel version are matched, and the version correspondence refers to the code repository README.md
.
cd ~/.nebula-up/nebula-up/spark
git clone https://github.com/vesoft-inc/nebula-spark-connector.git
docker exec -it spark_master_1 bash
cd /root/nebula-spark-connector
- Replace the code of the example project
echo > example/src/main/scala/com/vesoft/nebula/examples/connector/NebulaSparkReaderExample.scala
vi example/src/main/scala/com/vesoft/nebula/examples/connector/NebulaSparkReaderExample.scala
- Paste the following code into it, here we read the vertices and edges on the graph loaded earlier:
basketballplayer
: respectively callreadVertex
andreadEdges
.
package com.vesoft.nebula.examples.connector
import com.facebook.thrift.protocol.TCompactProtocol
import com.vesoft.nebula.connector.connector.NebulaDataFrameReader
import com.vesoft.nebula.connector.{NebulaConnectionConfig, ReadNebulaConfig}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.slf4j.LoggerFactory
object NebulaSparkReaderExample {
private val LOG = LoggerFactory.getLogger(this.getClass)
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf
sparkConf
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array[Class[_]](classOf[TCompactProtocol]))
val spark = SparkSession
.builder()
.master("local")
.config(sparkConf)
.getOrCreate()
readVertex(spark)
readEdges(spark)
spark.close()
sys.exit()
}
def readVertex(spark: SparkSession): Unit = {
LOG.info("start to read nebula vertices")
val config =
NebulaConnectionConfig
.builder()
.withMetaAddress("metad0:9559,metad1:9559,metad2:9559")
.withConenctionRetry(2)
.build()
val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
.builder()
.withSpace("basketballplayer")
.withLabel("player")
.withNoColumn(false)
.withReturnCols(List("name", "age"))
.withLimit(10)
.withPartitionNum(10)
.build()
val vertex = spark.read.nebula(config, nebulaReadVertexConfig).loadVerticesToDF()
vertex.printSchema()
vertex.show(20)
println("vertex count: " + vertex.count())
}
def readEdges(spark: SparkSession): Unit = {
LOG.info("start to read nebula edges")
val config =
NebulaConnectionConfig
.builder()
.withMetaAddress("metad0:9559,metad1:9559,metad2:9559")
.withTimeout(6000)
.withConenctionRetry(2)
.build()
val nebulaReadEdgeConfig: ReadNebulaConfig = ReadNebulaConfig
.builder()
.withSpace("basketballplayer")
.withLabel("follow")
.withNoColumn(false)
.withReturnCols(List("degree"))
.withLimit(10)
.withPartitionNum(10)
.build()
val edge = spark.read.nebula(config, nebulaReadEdgeConfig).loadEdgesToDF()
edge.printSchema()
edge.show(20)
println("edge count: " + edge.count())
}
}
- Then package it into a JAR package
/usr/lib/mvn/bin/mvn install -Dgpg.skip -Dmaven.javadoc.skip=true -Dmaven.test.skip=true
- Finally, submit it to Spark for execution:
cd example
/spark/bin/spark-submit --master "local" \
--class com.vesoft.nebula.examples.connector.NebulaSparkReaderExample \
--driver-memory 4g target/example-3.0-SNAPSHOT.jar
# 退出 spark 容器
exit
- After success, we will get the return result:
22/04/19 07:29:34 INFO DAGScheduler: Job 1 finished: show at NebulaSparkReaderExample.scala:57, took 0.199310 s
+---------+------------------+---+
|_vertexId| name|age|
+---------+------------------+---+
|player105| Danny Green| 31|
|player109| Tiago Splitter| 34|
|player111| David West| 38|
|player118| Russell Westbrook| 30|
|player143|Kristaps Porzingis| 23|
|player114| Tracy McGrady| 39|
|player150| Luka Doncic| 20|
|player103| Rudy Gay| 32|
|player113| Dejounte Murray| 29|
|player121| Chris Paul| 33|
|player128| Carmelo Anthony| 34|
|player130| Joel Embiid| 25|
|player136| Steve Nash| 45|
|player108| Boris Diaw| 36|
|player122| DeAndre Jordan| 30|
|player123| Ricky Rubio| 28|
|player139| Marc Gasol| 34|
|player142| Klay Thompson| 29|
|player145| JaVale McGee| 31|
|player102| LaMarcus Aldridge| 33|
+---------+------------------+---+
only showing top 20 rows
22/04/19 07:29:36 INFO DAGScheduler: Job 4 finished: show at NebulaSparkReaderExample.scala:82, took 0.135543 s
+---------+---------+-----+------+
| _srcId| _dstId|_rank|degree|
+---------+---------+-----+------+
|player105|player100| 0| 70|
|player105|player104| 0| 83|
|player105|player116| 0| 80|
|player109|player100| 0| 80|
|player109|player125| 0| 90|
|player118|player120| 0| 90|
|player118|player131| 0| 90|
|player143|player150| 0| 90|
|player114|player103| 0| 90|
|player114|player115| 0| 90|
|player114|player140| 0| 90|
|player150|player120| 0| 80|
|player150|player137| 0| 90|
|player150|player143| 0| 90|
|player103|player102| 0| 70|
|player113|player100| 0| 99|
|player113|player101| 0| 99|
|player113|player104| 0| 99|
|player113|player105| 0| 99|
|player113|player106| 0| 99|
+---------+---------+-----+------+
only showing top 20 rows
In fact, there are many more examples under this repo, especially for GraphX , you can try to explore this part yourself.
Note that vertex IDs are assumed to be numeric in GraphX, so in the case of string-typed vertex IDs, a real-time conversion is required, see Nebula Algorithom's example for how to get around this.
Nebula Exchange
- Code: https://github.com/vesoft-inc/nebula-exchange/
- Documentation: https://docs.nebula-graph.com.cn/3.1.0/nebula-exchange/about-exchange/ex-ug-what-is-exchange/
- JAR package: https://github.com/vesoft-inc/nebula-exchange/releases
- Configuration example: exchange-common/src/test/resources/application.conf
Nebula Exchange is a Spark Lib and a Spark application that can be directly submitted for execution. It is used to read data from multiple data sources and write to NebulaGraph or output Nebula Graph SST files .
Using Nebula Exchange via spark-submit is straightforward:
- First create a configuration file to let Exchange know how it should get and write data
- Then call the Exchange package with the specified profile
Now, let's do a real test with the same environment created in the previous chapter.
Try Exchange with one click
Let's run first
Please refer to the previous chapter on pulling up the environment to install the environment with one click.
One-click execution:
~/.nebula-up/nebula-exchange-example.sh
Congratulations, you have successfully executed an Exchange data import task for the first time!
look at some details
In this example, we are actually using Exchange to read data to the NebulaGraph cluster from a CSV file, one of the supported data sources. The first column in this CSV file is the vertex ID, and the second and third columns are the "name" and "age" attributes:
player800,"Foo Bar",23
player801,"Another Name",21
- We can go into the Spark environment to see
docker exec -it spark_master_1 bash
cd /root
You can see that the configuration file we specified when submitting the Exchange task
exchange.conf
it is a file inHOCON
format:- Information about NebulaGraph clusters is described in
.nebula
- In
.tags
is described how to map the required fields to our data source (here a CSV file) and other information about Vertecies.
- Information about NebulaGraph clusters is described in
{
# Spark relation config
spark: {
app: {
name: Nebula Exchange
}
master:local
driver: {
cores: 1
maxResultSize: 1G
}
executor: {
memory: 1G
}
cores:{
max: 16
}
}
# Nebula Graph relation config
nebula: {
address:{
graph:["graphd:9669"]
meta:["metad0:9559", "metad1:9559", "metad2:9559"]
}
user: root
pswd: nebula
space: basketballplayer
# parameters for SST import, not required
path:{
local:"/tmp"
remote:"/sst"
hdfs.namenode: "hdfs://localhost:9000"
}
# nebula client connection parameters
connection {
# socket connect & execute timeout, unit: millisecond
timeout: 30000
}
error: {
# max number of failures, if the number of failures is bigger than max, then exit the application.
max: 32
# failed import job will be recorded in output path
output: /tmp/errors
}
# use google's RateLimiter to limit the requests send to NebulaGraph
rate: {
# the stable throughput of RateLimiter
limit: 1024
# Acquires a permit from RateLimiter, unit: MILLISECONDS
# if it can't be obtained within the specified timeout, then give up the request.
timeout: 1000
}
}
# Processing tags
# There are tag config examples for different dataSources.
tags: [
# HDFS csv
# Import mode is client, just change type.sink to sst if you want to use client import mode.
{
name: player
type: {
source: csv
sink: client
}
path: "file:///root/player.csv"
# if your csv file has no header, then use _c0,_c1,_c2,.. to indicate fields
fields: [_c1, _c2]
nebula.fields: [name, age]
vertex: {
field:_c0
}
separator: ","
header: false
batch: 256
partition: 32
}
]
}
- We should see that the CSV data source is in the same directory as this configuration file:
bash-5.0# ls -l
total 24
drwxrwxr-x 2 1000 1000 4096 Jun 1 04:26 download
-rw-rw-r-- 1 1000 1000 1908 Jun 1 04:23 exchange.conf
-rw-rw-r-- 1 1000 1000 2593 Jun 1 04:23 hadoop.env
drwxrwxr-x 7 1000 1000 4096 Jun 6 03:27 nebula-spark-connector
-rw-rw-r-- 1 1000 1000 51 Jun 1 04:23 player.csv
- Then, we can actually submit this Exchange task again manually
/spark/bin/spark-submit --master local \
--class com.vesoft.nebula.exchange.Exchange download/nebula-exchange.jar \
-c exchange.conf
- Partial return result
22/06/06 03:56:26 INFO Exchange$: Processing Tag player
22/06/06 03:56:26 INFO Exchange$: field keys: _c1, _c2
22/06/06 03:56:26 INFO Exchange$: nebula keys: name, age
22/06/06 03:56:26 INFO Exchange$: Loading CSV files from file:///root/player.csv
...
22/06/06 03:56:41 INFO Exchange$: import for tag player cost time: 3.35 s
22/06/06 03:56:41 INFO Exchange$: Client-Import: batchSuccess.player: 2
22/06/06 03:56:41 INFO Exchange$: Client-Import: batchFailure.player: 0
...
For more data sources, please refer to the documentation and configuration examples.
For the practice of exporting SST files from Exchange, you can refer to the documentation and my old article Nebula Exchange SST 2.x Practice Guide .
Nebula Algorithm
- Code repository: https://github.com/vesoft-inc/nebula-algorithm
- Documentation: https://docs.nebula-graph.com.cn/3.1.0/nebula-algorithm/
- JAR package: https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/
- Example code: example/src/main/scala/com/vesoft/nebula/algorithm
Submit tasks through spark-submit
I gave an example in this code repository , and today we can experience it more easily with the help of Nebula-UP.
Refer to the previous chapter on pulling up the environment , first install the environment with one click.
After deploying the required dependencies through Nebula-UP's Spark mode as above
- Load LiveJournal dataset
~/.nebula-up/load-LiveJournal-dataset.sh
- Execute a PageRank algorithm on the LiveJournal dataset and output the results to a CSV file
~/.nebula-up/nebula-algo-pagerank-example.sh
- Check the output:
docker exec -it spark_master_1 bash
head /output/part*000.csv
_id,pagerank
637100,0.9268620883822242
108150,1.1855749056722755
957460,0.923720299211093
257320,0.9967932799358413
Configuration file interpretation
The full file is here , here, we introduce the main fields:
-
.data
specifies that the source is Nebula, which means that the graph data is obtained from the cluster, and the outputsink
iscsv
, which means that it is written to the local file.
data: {
# data source. optional of nebula,csv,json
source: nebula
# data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
sink: csv
# if your algorithm needs weight
hasWeight: false
}
-
.nebula.read
specifies the corresponding relationship of reading NebulaGraph clusters, here is the edge data for reading all edge types:follow
is a whole graph
nebula: {
# algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
read: {
# Nebula metad server address, multiple addresses are split by English comma
metaAddress: "metad0:9559"
# Nebula space
space: livejournal
# Nebula edge types, multiple labels means that data from multiple edges will union together
labels: ["follow"]
# Nebula edge property name for each edge type, this property will be as weight col for algorithm.
# Make sure the weightCols are corresponding to labels.
weightCols: []
}
-
.algorithm
configure the algorithm we want to call, and the configuration of the algorithm
algorithm: {
executeAlgo: pagerank
# PageRank parameter
pagerank: {
maxIter: 10
resetProb: 0.15 # default 0.15
}
Calling Nebula Algoritm in Spark as a library
Note that on the other hand, we can call Nebula Algoritm as a library, which has the benefit of:
- More control/customization of the output format of the algorithm
- Can be converted for non-numeric ID cases, see here
I won't give an example here. If you are interested, you can provide Nebula-UP with requirements, and I will also add corresponding examples.
Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。