An article about Spark projects on NebulaGraph

This article was first published on the Nebula Graph Community public account

Recently, I tried to build Spark -related projects in Nebula Graph that are convenient for you to play with one click. Today, I will organize them into a document and share them with you. Moreover, I came out of the use of Nebula Spark Connector under PySpark, and I will also contribute to the document later.

Three Spark sub-projects of NebulaGraph

I have drawn a sketch around all the data import methods of NebulaGraph, which has included Spark Connector, a brief introduction to Nebula Exchange. In this article I explore them a little further along with another Nebula Algorithm.

Note: This document also clearly lists the options for different import tools for us.

TL;DR

Nebula Spark Connector is a Spark Lib that enables Spark applications to read and write graph data from NebulaGraph in the form dataframe .
Nebula Exchange is built on the Nebula Spark Connector. As a Spark Lib application that can be directly executed by Spark submit JAR package, its design goal is to exchange different data sources with NebulaGraph (for the open source version, it is one-way: write, whereas for the enterprise version it is bidirectional). Many different types of data sources supported by Nebula Exchange are: MySQL , Neo4j , PostgreSQL , ClickHouse , Hive , etc. In addition to writing directly to NebulaGraph, it can optionally generate an SST file and inject it into NebulaGraph to use computing power outside of the NebulaGraph cluster to help sort the bottom layer.
Nebula Algorithm, built on top of Nebula Spark Connector and GraphX, is also a Spark Lib and Spark application that runs common graph algorithms (pagerank, LPA, etc.) on NebulaGraph's graphs.

Nebula Spark Connector

Code: https://github.com/vesoft-inc/nebula-spark-connector
Documentation: https://docs.nebula-graph.io/3.1.0/nebula-spark-connector/
JAR package: https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/
Code example: example

NebulaGraph Spark Reader

In order to read data from NebulaGraph, such as read vertex, Nebula Spark Connector will scan all Nebula StorageDs with a given TAG, such as this means to scan player this TAG: withLabel("player") , we You can also specify the properties of the vertex: withReturnCols(List("name", "age")) .

After specifying all the read TAG related configurations, call spark.read.nebula.loadVerticesToDF and what is returned is the graph data converted to Dataframe after scanning NebulaGraph, like this:

 def readVertex(spark: SparkSession): Unit = {
    LOG.info("start to read nebula vertices")
    val config =
      NebulaConnectionConfig
        .builder()
        .withMetaAddress("metad0:9559,metad1:9559,metad2:9559")
        .withConenctionRetry(2)
        .build()
    val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
      .builder()
      .withSpace("basketballplayer")
      .withLabel("player")
      .withNoColumn(false)
      .withReturnCols(List("name", "age"))
      .withLimit(10)
      .withPartitionNum(10)
      .build()
    val vertex = spark.read.nebula(config, nebulaReadVertexConfig).loadVerticesToDF()
    vertex.printSchema()
    vertex.show(20)
    println("vertex count: " + vertex.count())
  }

I will not list the written examples here, but there are more detailed examples in the link to the code example given above. It is worth mentioning here that Spark Connector reads data in order to meet a large number of data scenarios for graph analysis and graph computing. , which is very different from most other clients. It bypasses GraphD directly and obtains data by scanning MetaD and StorageD, but the writing is written by initiating nGQL DML statements through GraphD.

Now let's do a hands-on exercise.

Getting started with Nebula Spark Connector

Prerequisites: Assuming the following program is running on a Linux machine with an internet connection, preferably with Docker and Docker-Compose pre-installed.

Pull up the environment

First, let's deploy the container-based NebulaGraph Core v3, Nebula Studio, Nebula Console and Spark, Hadoop environments with Nebula-Up , which will also try to install Docker and Docker-Compose for us if not already installed.

 # Install Core with Spark Connector, Nebula Algorithm, Nebula Exchange
curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash -s -- v3 spark

Did you know that Nebula-UP can install more things in one click, if your environment is configured a bit bigger (eg 8 GB RAM) curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash can install more things, but please note that Nebula-UP is not for production environment prepare.

After the above side script is executed, let's connect it with Nebula-Console (the command line client for Nebula Graph).

 # Connect to nebula with console
~/.nebula-up/console.sh
# Execute any queryies like
~/.nebula-up/console.sh -e "SHOW HOSTS"

Load a piece of data into it and execute a graph query:

 # Load the sample dataset
~/.nebula-up/load-basketballplayer-dataset.sh
# 等一分钟左右

# Make a Graph Query the sample dataset
~/.nebula-up/console.sh -e 'USE basketballplayer; FIND ALL PATH FROM "player100" TO "team204" OVER * WHERE follow.degree is EMPTY or follow.degree >=0 YIELD path AS p;'

Enter the Spark environment

Execute the following line, we can enter the Spark environment:

 docker exec -it spark_master_1 bash

If we want to perform compilation, we can install mvn inside:

 docker exec -it spark_master_1 bash
# in the container shell

export MAVEN_VERSION=3.5.4
export MAVEN_HOME=/usr/lib/mvn
export PATH=$MAVEN_HOME/bin:$PATH

wget http://archive.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz && \
  tar -zxvf apache-maven-$MAVEN_VERSION-bin.tar.gz && \
  rm apache-maven-$MAVEN_VERSION-bin.tar.gz && \
  mv apache-maven-$MAVEN_VERSION /usr/lib/mvn

Example of running Spark Connector

Option 1 (recommended): Via PySpark

Enter PySpark Shell

 ~/.nebula-up/nebula-pyspark.sh

Invoke Nebula Spark Reader

 # call Nebula Spark Connector Reader
df = spark.read.format(
  "com.vesoft.nebula.connector.NebulaDataSource").option(
    "type", "vertex").option(
    "spaceName", "basketballplayer").option(
    "label", "player").option(
    "returnCols", "name,age").option(
    "metaAddress", "metad0:9559").option(
    "partitionNumber", 1).load()

# show the dataframe with limit of 2
df.show(n=2)

return result example

 ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 2.7.16 (default, Jan 14 2020 07:22:06)
SparkSession available as 'spark'.
>>> df = spark.read.format(
...   "com.vesoft.nebula.connector.NebulaDataSource").option(
...     "type", "vertex").option(
...     "spaceName", "basketballplayer").option(
...     "label", "player").option(
...     "returnCols", "name,age").option(
...     "metaAddress", "metad0:9559").option(
...     "partitionNumber", 1).load()
>>> df.show(n=2)
+---------+--------------+---+
|_vertexId|          name|age|
+---------+--------------+---+
|player105|   Danny Green| 31|
|player109|Tiago Splitter| 34|
+---------+--------------+---+
only showing top 2 rows

Option 2: Compile, submit the sample JAR package

First clone the Spark Connector and its sample code repository, then compile:

Note that we use the master branch, because the current master branch is compatible with 3.x, we must ensure that the spark connector and the database kernel version are matched, and the version correspondence refers to the code repository README.md .

 cd ~/.nebula-up/nebula-up/spark
git clone https://github.com/vesoft-inc/nebula-spark-connector.git

docker exec -it spark_master_1 bash
cd /root/nebula-spark-connector

Replace the code of the example project

 echo > example/src/main/scala/com/vesoft/nebula/examples/connector/NebulaSparkReaderExample.scala

vi example/src/main/scala/com/vesoft/nebula/examples/connector/NebulaSparkReaderExample.scala

Paste the following code into it, here we read the vertices and edges on the graph loaded earlier: basketballplayer : respectively call readVertex and readEdges .

 package com.vesoft.nebula.examples.connector

import com.facebook.thrift.protocol.TCompactProtocol
import com.vesoft.nebula.connector.connector.NebulaDataFrameReader
import com.vesoft.nebula.connector.{NebulaConnectionConfig, ReadNebulaConfig}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.slf4j.LoggerFactory

object NebulaSparkReaderExample {

  private val LOG = LoggerFactory.getLogger(this.getClass)

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf
    sparkConf
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .registerKryoClasses(Array[Class[_]](classOf[TCompactProtocol]))
    val spark = SparkSession
      .builder()
      .master("local")
      .config(sparkConf)
      .getOrCreate()

    readVertex(spark)
    readEdges(spark)

    spark.close()
    sys.exit()
  }

  def readVertex(spark: SparkSession): Unit = {
    LOG.info("start to read nebula vertices")
    val config =
      NebulaConnectionConfig
        .builder()
        .withMetaAddress("metad0:9559,metad1:9559,metad2:9559")
        .withConenctionRetry(2)
        .build()
    val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
      .builder()
      .withSpace("basketballplayer")
      .withLabel("player")
      .withNoColumn(false)
      .withReturnCols(List("name", "age"))
      .withLimit(10)
      .withPartitionNum(10)
      .build()
    val vertex = spark.read.nebula(config, nebulaReadVertexConfig).loadVerticesToDF()
    vertex.printSchema()
    vertex.show(20)
    println("vertex count: " + vertex.count())
  }

  def readEdges(spark: SparkSession): Unit = {
    LOG.info("start to read nebula edges")

    val config =
      NebulaConnectionConfig
        .builder()
        .withMetaAddress("metad0:9559,metad1:9559,metad2:9559")
        .withTimeout(6000)
        .withConenctionRetry(2)
        .build()
    val nebulaReadEdgeConfig: ReadNebulaConfig = ReadNebulaConfig
      .builder()
      .withSpace("basketballplayer")
      .withLabel("follow")
      .withNoColumn(false)
      .withReturnCols(List("degree"))
      .withLimit(10)
      .withPartitionNum(10)
      .build()
    val edge = spark.read.nebula(config, nebulaReadEdgeConfig).loadEdgesToDF()
    edge.printSchema()
    edge.show(20)
    println("edge count: " + edge.count())
  }

}

Then package it into a JAR package

 /usr/lib/mvn/bin/mvn install -Dgpg.skip -Dmaven.javadoc.skip=true -Dmaven.test.skip=true

Finally, submit it to Spark for execution:

 cd example

/spark/bin/spark-submit --master "local" \
    --class com.vesoft.nebula.examples.connector.NebulaSparkReaderExample \
    --driver-memory 4g target/example-3.0-SNAPSHOT.jar

# 退出 spark 容器
exit

After success, we will get the return result:

 22/04/19 07:29:34 INFO DAGScheduler: Job 1 finished: show at NebulaSparkReaderExample.scala:57, took 0.199310 s
+---------+------------------+---+
|_vertexId|              name|age|
+---------+------------------+---+
|player105|       Danny Green| 31|
|player109|    Tiago Splitter| 34|
|player111|        David West| 38|
|player118| Russell Westbrook| 30|
|player143|Kristaps Porzingis| 23|
|player114|     Tracy McGrady| 39|
|player150|       Luka Doncic| 20|
|player103|          Rudy Gay| 32|
|player113|   Dejounte Murray| 29|
|player121|        Chris Paul| 33|
|player128|   Carmelo Anthony| 34|
|player130|       Joel Embiid| 25|
|player136|        Steve Nash| 45|
|player108|        Boris Diaw| 36|
|player122|    DeAndre Jordan| 30|
|player123|       Ricky Rubio| 28|
|player139|        Marc Gasol| 34|
|player142|     Klay Thompson| 29|
|player145|      JaVale McGee| 31|
|player102| LaMarcus Aldridge| 33|
+---------+------------------+---+
only showing top 20 rows

22/04/19 07:29:36 INFO DAGScheduler: Job 4 finished: show at NebulaSparkReaderExample.scala:82, took 0.135543 s
+---------+---------+-----+------+
|   _srcId|   _dstId|_rank|degree|
+---------+---------+-----+------+
|player105|player100|    0|    70|
|player105|player104|    0|    83|
|player105|player116|    0|    80|
|player109|player100|    0|    80|
|player109|player125|    0|    90|
|player118|player120|    0|    90|
|player118|player131|    0|    90|
|player143|player150|    0|    90|
|player114|player103|    0|    90|
|player114|player115|    0|    90|
|player114|player140|    0|    90|
|player150|player120|    0|    80|
|player150|player137|    0|    90|
|player150|player143|    0|    90|
|player103|player102|    0|    70|
|player113|player100|    0|    99|
|player113|player101|    0|    99|
|player113|player104|    0|    99|
|player113|player105|    0|    99|
|player113|player106|    0|    99|
+---------+---------+-----+------+
only showing top 20 rows

In fact, there are many more examples under this repo, especially for GraphX , you can try to explore this part yourself.

Note that vertex IDs are assumed to be numeric in GraphX, so in the case of string-typed vertex IDs, a real-time conversion is required, see Nebula Algorithom's example for how to get around this.

Nebula Exchange

Code: https://github.com/vesoft-inc/nebula-exchange/
Documentation: https://docs.nebula-graph.com.cn/3.1.0/nebula-exchange/about-exchange/ex-ug-what-is-exchange/
JAR package: https://github.com/vesoft-inc/nebula-exchange/releases
Configuration example: exchange-common/src/test/resources/application.conf

Nebula Exchange is a Spark Lib and a Spark application that can be directly submitted for execution. It is used to read data from multiple data sources and write to NebulaGraph or output Nebula Graph SST files .

Using Nebula Exchange via spark-submit is straightforward:

First create a configuration file to let Exchange know how it should get and write data
Then call the Exchange package with the specified profile

Now, let's do a real test with the same environment created in the previous chapter.

Try Exchange with one click

Let's run first

Please refer to the previous chapter on pulling up the environment to install the environment with one click.

One-click execution:

 ~/.nebula-up/nebula-exchange-example.sh

Congratulations, you have successfully executed an Exchange data import task for the first time!

look at some details

In this example, we are actually using Exchange to read data to the NebulaGraph cluster from a CSV file, one of the supported data sources. The first column in this CSV file is the vertex ID, and the second and third columns are the "name" and "age" attributes:

 player800,"Foo Bar",23
player801,"Another Name",21

We can go into the Spark environment to see

 docker exec -it spark_master_1 bash
cd /root

You can see that the configuration file we specified when submitting the Exchange task exchange.conf it is a file in HOCON format:
- Information about NebulaGraph clusters is described in .nebula
- In .tags is described how to map the required fields to our data source (here a CSV file) and other information about Vertecies.

 {
  # Spark relation config
  spark: {
    app: {
      name: Nebula Exchange
    }

    master:local

    driver: {
      cores: 1
      maxResultSize: 1G
    }

    executor: {
        memory: 1G
    }

    cores:{
      max: 16
    }
  }

  # Nebula Graph relation config
  nebula: {
    address:{
      graph:["graphd:9669"]
      meta:["metad0:9559", "metad1:9559", "metad2:9559"]
    }
    user: root
    pswd: nebula
    space: basketballplayer

    # parameters for SST import, not required
    path:{
        local:"/tmp"
        remote:"/sst"
        hdfs.namenode: "hdfs://localhost:9000"
    }

    # nebula client connection parameters
    connection {
      # socket connect & execute timeout, unit: millisecond
      timeout: 30000
    }

    error: {
      # max number of failures, if the number of failures is bigger than max, then exit the application.
      max: 32
      # failed import job will be recorded in output path
      output: /tmp/errors
    }

    # use google's RateLimiter to limit the requests send to NebulaGraph
    rate: {
      # the stable throughput of RateLimiter
      limit: 1024
      # Acquires a permit from RateLimiter, unit: MILLISECONDS
      # if it can't be obtained within the specified timeout, then give up the request.
      timeout: 1000
    }
  }

  # Processing tags
  # There are tag config examples for different dataSources.
  tags: [

    # HDFS csv
    # Import mode is client, just change type.sink to sst if you want to use client import mode.
    {
      name: player
      type: {
        source: csv
        sink: client
      }
      path: "file:///root/player.csv"
      # if your csv file has no header, then use _c0,_c1,_c2,.. to indicate fields
      fields: [_c1, _c2]
      nebula.fields: [name, age]
      vertex: {
        field:_c0
      }
      separator: ","
      header: false
      batch: 256
      partition: 32
    }

  ]
}

We should see that the CSV data source is in the same directory as this configuration file:

 bash-5.0# ls -l
total 24
drwxrwxr-x    2 1000     1000          4096 Jun  1 04:26 download
-rw-rw-r--    1 1000     1000          1908 Jun  1 04:23 exchange.conf
-rw-rw-r--    1 1000     1000          2593 Jun  1 04:23 hadoop.env
drwxrwxr-x    7 1000     1000          4096 Jun  6 03:27 nebula-spark-connector
-rw-rw-r--    1 1000     1000            51 Jun  1 04:23 player.csv

Then, we can actually submit this Exchange task again manually

 /spark/bin/spark-submit --master local \
    --class com.vesoft.nebula.exchange.Exchange download/nebula-exchange.jar \
    -c exchange.conf

Partial return result

 22/06/06 03:56:26 INFO Exchange$: Processing Tag player
22/06/06 03:56:26 INFO Exchange$: field keys: _c1, _c2
22/06/06 03:56:26 INFO Exchange$: nebula keys: name, age
22/06/06 03:56:26 INFO Exchange$: Loading CSV files from file:///root/player.csv
...
22/06/06 03:56:41 INFO Exchange$: import for tag player cost time: 3.35 s
22/06/06 03:56:41 INFO Exchange$: Client-Import: batchSuccess.player: 2
22/06/06 03:56:41 INFO Exchange$: Client-Import: batchFailure.player: 0
...

For more data sources, please refer to the documentation and configuration examples.

For the practice of exporting SST files from Exchange, you can refer to the documentation and my old article Nebula Exchange SST 2.x Practice Guide .

Nebula Algorithm

Code repository: https://github.com/vesoft-inc/nebula-algorithm
Documentation: https://docs.nebula-graph.com.cn/3.1.0/nebula-algorithm/
JAR package: https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/
Example code: example/src/main/scala/com/vesoft/nebula/algorithm

Submit tasks through spark-submit

I gave an example in this code repository , and today we can experience it more easily with the help of Nebula-UP.
Refer to the previous chapter on pulling up the environment , first install the environment with one click.

After deploying the required dependencies through Nebula-UP's Spark mode as above

Load LiveJournal dataset

 ~/.nebula-up/load-LiveJournal-dataset.sh

Execute a PageRank algorithm on the LiveJournal dataset and output the results to a CSV file

 ~/.nebula-up/nebula-algo-pagerank-example.sh

Check the output:

 docker exec -it spark_master_1 bash

head /output/part*000.csv
_id,pagerank
637100,0.9268620883822242
108150,1.1855749056722755
957460,0.923720299211093
257320,0.9967932799358413

Configuration file interpretation

The full file is here , here, we introduce the main fields:

.data specifies that the source is Nebula, which means that the graph data is obtained from the cluster, and the output sink is csv , which means that it is written to the local file.

 data: {
    # data source. optional of nebula,csv,json
    source: nebula
    # data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
    sink: csv
    # if your algorithm needs weight
    hasWeight: false
  }

.nebula.read specifies the corresponding relationship of reading NebulaGraph clusters, here is the edge data for reading all edge types: follow is a whole graph

 nebula: {
    # algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
    read: {
        # Nebula metad server address, multiple addresses are split by English comma
        metaAddress: "metad0:9559"
        # Nebula space
        space: livejournal
        # Nebula edge types, multiple labels means that data from multiple edges will union together
        labels: ["follow"]
        # Nebula edge property name for each edge type, this property will be as weight col for algorithm.
        # Make sure the weightCols are corresponding to labels.
        weightCols: []
    }

.algorithm configure the algorithm we want to call, and the configuration of the algorithm

 algorithm: {
    executeAlgo: pagerank

    # PageRank parameter
    pagerank: {
        maxIter: 10
        resetProb: 0.15  # default 0.15
    }

Calling Nebula Algoritm in Spark as a library

Note that on the other hand, we can call Nebula Algoritm as a library, which has the benefit of:

More control/customization of the output format of the algorithm
Can be converted for non-numeric ID cases, see here

I won't give an example here. If you are interested, you can provide Nebula-UP with requirements, and I will also add corresponding examples.

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~

An article about Spark projects on NebulaGraph

Three Spark sub-projects of NebulaGraph

Nebula Spark Connector

NebulaGraph Spark Reader

Getting started with Nebula Spark Connector

Pull up the environment

Enter the Spark environment

Example of running Spark Connector

Option 1 (recommended): Via PySpark

Option 2: Compile, submit the sample JAR package

Nebula Exchange

Try Exchange with one click

Let's run first

look at some details

Nebula Algorithm

Submit tasks through spark-submit

Configuration file interpretation

Calling Nebula Algoritm in Spark as a library

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式