Storage and reading and writing of blood relationship data in data warehouses

This article was first published on the Nebula Graph Community public account

1. Reasons for choosing Nebula

Superior performance

Query speed is extremely fast
The architecture is separated and easy to expand (the current machine configuration is low, and it may be expanded in the future)
High availability (because it is distributed, there has been no downtime since its use)

Easy to get started

Introductory (familiar with architecture and performance)
Fast deployment (after the baptism of the manual, quickly deploy a simple cluster)
Easy to use (if you encounter the data you need, query the manual to get the corresponding GNQL, and make targeted queries)
Excellent Q&A (if you encounter problems, you can turn to the forum first, if not, then post a post, the help of the developers is very timely)

Open source and technically stable

Because there are many practical enterprises, you can use it with confidence.

2. Background introduction to business requirements

In order to facilitate data governance, metadata management and data quality monitoring, the data warehouse blood relationship generated by the scheduling system is saved.

Bloodline data process

The whole process of data collection, storage and platform display:

血缘数据流程

Part of the data query display on the query platform

数据查询展示

3. My specific practice

1. Version selection

Here we use Nebula v3.0.0 and Nebula Java Client v3.0.0. It is mentioned that Nebula Graph and Java client need to be compatible, and the version numbers should be aligned.

2. Cluster deployment

Machine configuration

Four physical machines with the same configuration:
10C 2 / 16G 8 / 600G

3. Installation method

Here we use the RPM installation.
a. Pull the installation package through wget and install it.
b. Change the configuration file, mainly change the parameters:

All machines served by Meta —— meta_server_addrs=ip1:9559, ip2:9559, ip3:9559
Current machine ip (if it is meta / graph / storage, fill in the ip of the corresponding meta / graph / storage machine) —— local_ip

c. Simple test through Console after startup
add hosts ip:port After adding your own machine ip (Nebula users whose kernel version is lower than v3.0.0 can ignore this step), show hosts , if it is online, you can start testing related nGQL.

4. Data import

Currently, the data is updated in two cases.

a. Real-time monitoring and scheduling platform

Monitor each task instance, obtain upstream and downstream relationships through dependent nodes, enter relationships into MySQL and Nebula in real time, and update Nebula Graph data through Spark Connector. (MySQL does backup, because Nebula does not support transactions, there may be data deviations)

b. Schedule correction data regularly

Through the blood relationship in MySQL, the Nebula data is regularly corrected through the Spark task, and the updated data is also implemented through the Spark Connector.

The use of Spark Connector: NebulaConnectionConfig initializes the configuration, and then creates WriteNebulaVertexConfig and WriteNebulaEdgeConfig objects through connection information, related parameters of inserted points and edges, and entity Tag and Edge for writing data of points and edges.

5. Data platform query

The application of the data platform to query the blood relationship:

a. Obtaining Nebula data implementation process

By initializing the connection pool Nebula pool, a singleton tool class is implemented, which is convenient to call and use Session in the entire project.

It must be noted here that there can only be one connection pool, and the Session can set the number of connections through MaxConnectionNum, and determine the specific parameters according to the actual business (the more frequently the platform is queried, the more the number of connections must be set). And every time the Session is used up, it is also to be released.

b. Query data and convert it to JSON required by ECharts

① Get all the upstream and downstream related points of the current table or field through getSubGraph, which is very convenient by getting the subgraph.
② It is necessary to parse out the points of the data in two directions through the results, then recursively parse, and finally turn into a Bean class object that recursively calls itself.
③ Write a toString method that meets the JSON string required by the front-end, and then you can get the result.

Tool class and core logic code

Here I share the tool classes and core logic code I use

Tools

 object NebulaUtil {

  private val log: Logger = LoggerFactory.getLogger(NebulaUtil.getClass)

  private val pool: NebulaPool = new NebulaPool

  private var success: Boolean = false

  {
    //首先初始化连接池

    val nebulaPoolConfig = new NebulaPoolConfig
    nebulaPoolConfig.setMaxConnSize(100)


    // 初始化ip和端口
    val addresses = util.Arrays.asList(new HostAddress("10.88.100.88", 9669))
    success = pool.init(addresses, nebulaPoolConfig)

  }

  def getPool(): NebulaPool = {
    pool
  }

  def isSuccess(): Boolean = {
    success
  }

  //TODO query： 创建空间、进入空间、创建新的点和边的类型、插入点、插入边、执行查询
  def executeResultSet(query: String, session: Session): ResultSet = {

    val resp: ResultSet = session.execute(query)
    if (!resp.isSucceeded){
      log.error(String.format("Execute: `%s', failed: %s", query, resp.getErrorMessage))
      System.exit(1)
    }

    resp
  }

  def executeJSON(queryForJson: String, session: Session): String = {

    val resp: String = session.executeJson(queryForJson)
    val errors: JSONObject = JSON.parseObject(resp).getJSONArray("errors").getJSONObject(0)
    if (errors.getInteger("code") != 0){
      log.error(String.format("Execute: `%s', failed: %s", queryForJson, errors.getString("message")))
      System.exit(1)
    }

    resp
  }


  def executeNGqlWithParameter(query: String, paramMap: util.Map[String, Object], session: Session): Unit = {

    val resp: ResultSet = session.executeWithParameter(query, paramMap)
    if (!resp.isSucceeded){
      log.error(String.format("Execute: `%s', failed: %s", query, resp.getErrorMessage))
      System.exit(1)
    }

  }



  //获取ResultSet中的各个列名及数据
  //_1 列名组成的列表
  //_2 多row组成的列表嵌套    单个row的列表 包含本行每一列的数据
  def getInfoForResult(resultSet: ResultSet): (util.List[String], util.List[util.List[Object]]) = {

    //拿到列名
    val colNames: util.List[String] = resultSet.keys

    //拿数据
    val data: util.List[util.List[Object]] = new util.ArrayList[util.List[Object]]

    //循环获取每行数据
    for (i <- 0 until resultSet.rowsSize) {
      val curData = new util.ArrayList[Object]
      //拿到第i行数据的容器
      val record = resultSet.rowValues(i)
      import scala.collection.JavaConversions._

      //获取容器中数据
      for (value <- record.values) {
        if (value.isString) curData.add(value.asString)
        else if (value.isLong) curData.add(value.asLong.toString)
        else if (value.isBoolean) curData.add(value.asBoolean.toString)
        else if (value.isDouble) curData.add(value.asDouble.toString)
        else if (value.isTime) curData.add(value.asTime.toString)
        else if (value.isDate) curData.add(value.asDate.toString)
        else if (value.isDateTime) curData.add(value.asDateTime.toString)
        else if (value.isVertex) curData.add(value.asNode.toString)
        else if (value.isEdge) curData.add(value.asRelationship.toString)
        else if (value.isPath) curData.add(value.asPath.toString)
        else if (value.isList) curData.add(value.asList.toString)
        else if (value.isSet) curData.add(value.asSet.toString)
        else if (value.isMap) curData.add(value.asMap.toString)
      }
      //合并数据
      data.add(curData)
    }

    (colNames, data)
  }

  def close(): Unit = {
    pool.close()
  }

}

core code

 //bean next 指针为可变数组
  //获取子图
  //field_name 起始节点, direct 子图方向(true 下游, false 上游)
  def getSubgraph(field_name: String, direct: Boolean, nebulaSession: Session): FieldRely = {

    // field_name 所在节点
    val relyResult = new FieldRely(field_name, new mutable.ArrayBuffer[FieldRely])

    // out 为下游, in 为上游
    var downOrUp = "out"
    // 获取当前查询的方向
    if (direct){
      downOrUp = "out"
    } else {
      downOrUp = "in"
    }

    //1 查询语句 查询下游所有子图
    val query =
      s"""
         | get subgraph 100 steps from "$field_name" $downOrUp field_rely yield edges as field_rely;
         |""".stripMargin

    val resultSet = NebulaUtil.executeResultSet(query, nebulaSession)

    //[[:field_rely "dws.dws_order+ds_code"->"dws.dws_order_day+ds_code" @0 {}], [:field_rely "dws.dws_order+ds_code"->"tujia_qlibra.dws_order+p_ds_code" @0 {}], [:field_rely "dws.dws_order+ds_code"->"tujia_tmp.dws_order_execution+ds_code" @0 {}]]
    //非空则获取数据
    if (!resultSet.isEmpty) {
      //非空，则拿数据，解析数据
      val data = NebulaUtil.getInfoForResult(resultSet)
      val curData: util.List[util.List[Object]] = data._2

      //正则匹配引号中数据
      val pattern = Pattern.compile("\"([^\"]*)\"")

      // 上一步长的所有节点数组
      // 判断节点的父节点, 方便存储
      var parentNode = new mutable.ArrayBuffer[FieldRely]()


      //2 首先获取步长为 1 的边
      curData.get(0).get(0).toString.split(",").foreach(curEdge =>{
        //拿到边的起始和目的点
        val matcher = pattern.matcher(curEdge)
        var startPoint = ""
        var endPoint = ""

        //将两点赋值
        while (matcher.find()){
          val curValue = matcher.group().replaceAll("\"", "")
          // 上下游的指向是不同的 所以需要根据上下游切换 开始节点和结束节点的信息获取
          // out 为下游, 数据结构是 startPoint -> endPoint
          if(direct){
            if ("".equals(startPoint)){
              startPoint = curValue
            }else{
              endPoint = curValue
            }
          }else {
            // in 为上游, 数据结构是 endPoint -> startPoint
            if ("".equals(endPoint)){
              endPoint = curValue
            }else{
              startPoint = curValue
            }
          }

        }
        //合并到起点 bean 中
        relyResult.children.append(new FieldRely(endPoint, new ArrayBuffer[FieldRely]()))
      })

      //3 并初始化父节点数组
      parentNode = relyResult.children




      //4 得到其余所有边
      for (i <- 1 until curData.size - 1){
        //储存下个步长的父节点集合
        val nextParentNode = new mutable.ArrayBuffer[FieldRely]()
        val curEdges = curData.get(i).get(0).toString

        //3 多个边循环解析, 拿到目的点
        curEdges.split(",").foreach(curEdge => {

          //拿到边的起始和目的点
          val matcher = pattern.matcher(curEdge)
          var startPoint = ""
          val endNode = new FieldRely("")

          //将两点赋值
          while (matcher.find()){
            val curValue = matcher.group().replaceAll("\"", "")
//            logger.info(s"not 1 curValue: $curValue")
            if(direct) {
              if ("".equals(startPoint)){
                startPoint = curValue
              }else{
                endNode.name = curValue
                endNode.children = new mutable.ArrayBuffer[FieldRely]()
                nextParentNode.append(endNode)
              }
            }else {
              if ("".equals(endNode.name)){
                endNode.name = curValue
                endNode.children = new mutable.ArrayBuffer[FieldRely]()
                nextParentNode.append(endNode)
              }else{
                startPoint = curValue
              }
            }

          }

          //通过 startPoint 找到父节点, 将 endPoint 加入到本父节点的 children 中
          var flag = true
          //至此, 一条边插入成功
          for (curFieldRely <- parentNode if flag){
            if (curFieldRely.name.equals(startPoint)){
              curFieldRely.children.append(endNode)
              flag = false
            }
          }

        })

        //更新父节点
        parentNode = nextParentNode
      }

    }
//    logger.info(s"relyResult.toString: ${relyResult.toString}")
    relyResult
  }

Bean toString

 class FieldRely {

  @BeanProperty
  var name: String = _  // 当前节点字段名

  @BeanProperty
  var children: mutable.ArrayBuffer[FieldRely] = _  // 当前节点对应的所有上游或下游子字段名

  def this(name: String, children: mutable.ArrayBuffer[FieldRely]) = {
    this()
    this.name = name
    this.children = children
  }

  def this(name: String) = {
    this()
    this.name = name
  }


  override def toString(): String = {
    var resultString = ""
    //引号变量
    val quote = "\""

    //空的话直接将 child 置为空数组的json
    if (children.isEmpty){
      resultString += s"{${quote}name${quote}: ${quote}$name${quote}, ${quote}children${quote}: []}"
    }else {
      //child 有数据, 添加索引并循环获取
      var childrenStr = ""
//      var index = 0

      for (curRely <- children){
        val curRelyStr = curRely.toString
        childrenStr += curRelyStr + ", "
//        index += 1
      }

      // 去掉多余的  ', '
      if (childrenStr.length > 2){
        childrenStr = childrenStr.substring(0, childrenStr.length - 2)
      }

      resultString += s"{${quote}name${quote}: ${quote}$name${quote}, ${quote}children${quote}: [$childrenStr]}"
    }
    resultString
  }
}

result

When the query subgraph step size is close to 20, basically the data returned by the interface can be controlled within 200ms (including complex backend processing logic).

This article is participating in the voting of the first Nebula Call for Papers . If you think this article is helpful to you, you can vote for me to show encouragement~
Thank you (#^.^#)
I am an intern in data development. I have been working in this position for about four months, during which I was responsible for developing the functions of the data platform.
Because the read and write performance of some of the data is low, after the investigation, I chose to deploy a Nebula cluster. Its technical system is relatively mature, the community is relatively complete, and it is very friendly to those who have just come into contact with it. So it was quickly put into use. In the process of using it, I have some of my own opinions, and some problems and solutions encountered. I will share my experience with you here.

Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~

Storage and reading and writing of blood relationship data in data warehouses

1. Reasons for choosing Nebula

2. Background introduction to business requirements

Bloodline data process

Part of the data query display on the query platform

3. My specific practice

1. Version selection

2. Cluster deployment

Machine configuration

3. Installation method

4. Data import

5. Data platform query

Tool class and core logic code

Tools

core code

Bean toString

result

NebulaGraph

引用和评论

来领《黑神话：悟空》！NebulaGraph 用户案例征集ing

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式