Using nebula-spark-connector and nebula-algorithm in Nebula K8s cluster

This article was first published on Nebula Graph Community public number

在 Nebula K8s 集群中使用 nebula-spark-connector 和 nebula-algorithm

## Solutions

The most convenient way to solve the problem of inability to connect to the cluster after K8s deploys the Nebula Graph cluster is to run nebula-algorithm / nebula-spark in the same network namespace as nebula-operator, and fill in the show hosts meta domain name: port format address of 0623159c277e20 Just go into the configuration.

Note: Version 2.6.2 or later is required here, nebula-spark-connector / nebula-algorithm only supports MetaD addresses in the form of domain names.

Here's the actual network configuration:

Get MetaD address

(root@nebula) [(none)]> show hosts meta
+------------------------------------------------------------------+------+----------+--------+--------------+---------+
| Host                                                             | Port | Status   | Role   | Git Info Sha | Version |
+------------------------------------------------------------------+------+----------+--------+--------------+---------+
| "nebula-metad-0.nebula-metad-headless.default.svc.cluster.local" | 9559 | "ONLINE" | "META" | "d113f4a"    | "2.6.2" |
+------------------------------------------------------------------+------+----------+--------+--------------+---------+
Got 1 rows (time spent 1378/2598 us)

Mon, 14 Feb 2022 08:22:33 UTC

The Host name needs to be recorded here so that it can be used in subsequent configuration files.

Fill in the configuration file of nebula-algorithm

Reference document https://github.com/vesoft-inc/nebula-algorithm/blob/master/nebula-algorithm/src/main/resources/application.conf . There are two ways to fill in the configuration file: modify the TOML file or add configuration information in the nebula-spark-connector code.

Method 1: Modify the TOML file

# ...
  nebula: {
    # algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
    read: {
        # 这里填上刚获得到的 meta 的 Host 名，多个地址的话用英文字符下的逗号隔开；
         metaAddress: "nebula-metad-0.nebula-metad-headless.default.svc.cluster.local:9559"
#...

Method 2: Call the code of nebula-spark-connector

Ref: https://github.com/vesoft-inc/nebula-spark-connector

  val config = NebulaConnectionConfig
    .builder()
// 这里填上刚获得到的 meta 的 Host 名
    .withMetaAddress("nebula-metad-0.nebula-metad-headless.default.svc.cluster.local:9559")
    .withConenctionRetry(2)
    .build()
  val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
    .builder()
    .withSpace("foo_bar_space")
    .withLabel("person")
    .withNoColumn(false)
    .withReturnCols(List("birthday"))
    .withLimit(10)
    .withPartitionNum(10)
    .build()
  val vertex = spark.read.nebula(config, nebulaReadVertexConfig).loadVerticesToDF()

Ok, so far, the process looks pretty straightforward. So why is such a simple process worth an article?

`Configuration information is easy to ignore`

We just talked about the specific practical operation, but there are some theoretical knowledge here:

a. MetaD implicitly needs to ensure that the address of StorageD can be accessed by the Spark environment;

b. The StorageD address is obtained from MetaD;

c. In Nebula K8s Operator, the source of the StorageD address (service discovery) stored in MetaD is the StorageD configuration file, which is the internal address of K8s.

`background knowledge`

a. The reason for is relatively straightforward, and it is related to the architecture of Nebula: the data of the graph is stored in the Storage Service, and the query using the statement is usually transparently transmitted through the Graph Service, only the connection of GraphD is enough, and the nebula- The scenario of spark-connector using Nebula Graph is to scan the whole graph or subgraph. At this time, the design of computing and storage separation allows us to bypass the query and computing layer to directly and efficiently read graph data.

So the question is, why do you need and only need the MetaD address?

This is also related to the architecture. The Meta Service contains the distributed data of the full graph and the distribution of each shard and instance of the distributed Storage Service, so on the one hand, only Meta has the information of the full graph (required), on the other hand Because this part of the information (only) can be obtained from Meta. Go here b. 's answer also has it.

Detailed Nebula Graph architecture information can refer to the architecture trilogy series

Let's look at the logic behind c. :

c. In the Nebula K8s Operator, the source of the StorageD address (service discovery) stored in MetaD is the StorageD configuration file, which is the internal address of k8s.

This is related to the service discovery mechanism in Nebula Graph: in the Nebula Graph cluster, both the Graph Service and the Storage Service report their information to the Meta Service through heartbeat, and the source of the service's own address comes from their corresponding The network configuration in the configuration file.

For the address configuration of the service itself, please refer to the document: Storage networking configuration
For detailed information about service discovery, please refer to the article of the Four Kings: Graph Database Nebula Graph Cluster Communication: Starting from the Heartbeat .

Finally, we know that the Nebula Operator is an application that automatically creates, maintains, and scales the K8s control plane of the Nebula cluster according to the configuration in the K8s cluster. It needs to abstract some internal resource-related configurations, including GraphD and StorageD instances. The actual addresses they are configured with are actually headless service address .

These addresses (as follows) cannot be accessed by K8s external network by default, so for GraphD and MetaD, we can easily create services to expose them.

(root@nebula) [(none)]> show hosts meta
+------------------------------------------------------------------+------+----------+--------+--------------+---------+
| Host                                                             | Port | Status   | Role   | Git Info Sha | Version |
+------------------------------------------------------------------+------+----------+--------+--------------+---------+
| "nebula-metad-0.nebula-metad-headless.default.svc.cluster.local" | 9559 | "ONLINE" | "META" | "d113f4a"    | "2.6.2" |
+------------------------------------------------------------------+------+----------+--------+--------------+---------+
Got 1 rows (time spent 1378/2598 us)

Mon, 14 Feb 2022 09:22:33 UTC

(root@nebula) [(none)]> show hosts graph
+---------------------------------------------------------------+------+----------+---------+--------------+---------+
| Host                                                          | Port | Status   | Role    | Git Info Sha | Version |
+---------------------------------------------------------------+------+----------+---------+--------------+---------+
| "nebula-graphd-0.nebula-graphd-svc.default.svc.cluster.local" | 9669 | "ONLINE" | "GRAPH" | "d113f4a"    | "2.6.2" |
+---------------------------------------------------------------+------+----------+---------+--------------+---------+
Got 1 rows (time spent 2072/3403 us)

Mon, 14 Feb 2022 10:03:58 UTC

(root@nebula) [(none)]> show hosts storage
+------------------------------------------------------------------------+------+----------+-----------+--------------+---------+
| Host                                                                   | Port | Status   | Role      | Git Info Sha | Version |
+------------------------------------------------------------------------+------+----------+-----------+--------------+---------+
| "nebula-storaged-0.nebula-storaged-headless.default.svc.cluster.local" | 9779 | "ONLINE" | "STORAGE" | "d113f4a"    | "2.6.2" |
| "nebula-storaged-1.nebula-storaged-headless.default.svc.cluster.local" | 9779 | "ONLINE" | "STORAGE" | "d113f4a"    | "2.6.2" |
| "nebula-storaged-2.nebula-storaged-headless.default.svc.cluster.local" | 9779 | "ONLINE" | "STORAGE" | "d113f4a"    | "2.6.2" |
+------------------------------------------------------------------------+------+----------+-----------+--------------+---------+
Got 3 rows (time spent 1603/2979 us)

Mon, 14 Feb 2022 10:05:24 UTC

However, because the aforementioned nebula-spark-connector obtains the address of StorageD through Meta Service, and this address is discovered by the service, the StorageD address actually obtained by nebula-spark-connector is the above headless service. address, which cannot be accessed directly from the outside.

Therefore, if we have the conditions, we only need to let Spark run in the same K8s network as Nebula Cluster, and everything will be solved. Otherwise, we need to:

Expose the L4 (TCP) addresses of MetaD and StorageD by means of Ingress.
You can refer to the documentation of Nebula Operator: https://github.com/vesoft-inc/nebula-operator
These headless services can be resolved to the corresponding StorageD through reverse proxy and DNS.

So, is there a more convenient way?

Unfortunately, the most convenient way is still as described at the beginning of the article: let Spark run inside the Nebula Cluster. In fact, I'm trying to push the Nebula Spark community to support the configurable StorageAddresses option, and with it, the aforementioned 2. is unnecessary.

`More convenient nebula-algorithm + nebula-operator experience`

In order to facilitate the students who are early adopters of nebula-graph and nebula-algorithm on K8s, here Amway wrote a small tool Neubla-Operator-KinD , which is a one-click deployment of a K8s cluster inside the Docker environment, and in it Deploy the Nebula Operator and all dependencies (including storage providers) gadgets. Not only that, but it automatically deploys a small Nebula cluster. You can see the steps below:

The first step is to deploy K8s + nebula-operator + Nebula Cluster:

curl -sL nebula-kind.siwei.io/install.sh | bash

The second step, according to the tool document what's next

a. Use the console to connect to the cluster and load the sample dataset

b. runs a graph algorithm 1623159c2787d4 in this

Create a Spark environment

kubectl create -f http://nebula-kind.siwei.io/deployment/spark.yaml
kubectl wait pod --timeout=-1s --for=condition=Ready -l '!job-name'

After the waits above are ready, enter the spark pod.

kubectl exec -it deploy/spark-deployment -- bash

Download nebula-algorithm such as version 2.6.2 . For more versions, please refer to https://github.com/vesoft-inc/nebula-algorithm/ .

Precautions:

The official version is available here: https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/
Because of this issue: https://github.com/vesoft-inc/nebula-algorithm/issues/42 Only 2.6.2 or newer version supports domain name access to MetaD.

# 下载 nebula-algorithm-2.6.2.jar
wget https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/2.6.2/nebula-algorithm-2.6.2.jar
# 下载 nebula-algorthm 配置文件
wget https://github.com/vesoft-inc/nebula-algorithm/raw/v2.6/nebula-algorithm/src/main/resources/application.conf

Modify the mete and graph address information in nebula-algorithm.

sed -i '/^        metaAddress/c\        metaAddress: \"nebula-metad-0.nebula-metad-headless.default.svc.cluster.local:9559\"' application.conf
sed -i '/^        graphAddress/c\        graphAddress: \"nebula-graphd-0.nebula-graphd-svc.default.svc.cluster.local:9669\"' application.conf
##### change space
sed -i '/^        space/c\        space: basketballplayer' application.conf
##### read data from nebula graph
sed -i '/^    source/c\    source: nebula' application.conf
##### execute algorithm: labelpropagation
sed -i '/^    executeAlgo/c\    executeAlgo: labelpropagation' application.conf

Execute LPA algorithm in basketballplayer graph space

/spark/bin/spark-submit --master "local" --conf spark.rpc.askTimeout=6000s \
    --class com.vesoft.nebula.algorithm.Main \
    nebula-algorithm-2.6.2.jar \
    -p application.conf

The result is as follows:

bash-5.0# ls /tmp/count/
_SUCCESS                                                  part-00000-5475f9f4-66b9-426b-b0c2-704f946e54d3-c000.csv
bash-5.0# head /tmp/count/part-00000-5475f9f4-66b9-426b-b0c2-704f946e54d3-c000.csv
_id,lpa
1100,1104
2200,2200
2201,2201
1101,1104
2202,2202

Next, you can Happy Graphing!

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first at , and the Nebula assistant will pull you into the group~~

Follow the public number

Using nebula-spark-connector and nebula-algorithm in Nebula K8s cluster

`Configuration information is easy to ignore`

`background knowledge`

`More convenient nebula-algorithm + nebula-operator experience`

NebulaGraph

`引用和评论`

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式