The practice of Spark SQL field blood relationship in vivo Internet

Author: vivo internet server team - Hao Guangshi

1. Background

Field lineage is to retain the field processing process during table processing. Why do you need a field bloodline?

With the blood relationship between fields, you can know where the data comes from, and the conversion relationship between fields, which is of great help to the quality and governance of the data.

Compared with Hive, Spark SQL is usually more efficient, and it will have greater benefits in terms of running time and resource usage.

The platform plans to migrate Hive tasks to Spark SQL, and also needs to implement the function of field blood relationship.

2. Preliminary research

Before development, we did a lot of related research and learned that Spark supports expansion: allowing users to expand Spark SQL's SQL parsing, logical plan analysis and inspection, logical plan optimization, and physical plan formation.

This scheme is feasible, and the source code of Spark has not been changed, and the cost is relatively small. It is determined to use this scheme.

3. Spark SQL extension

3.1 What Spark can scale

SparkSessionExtensions is an important class, which defines methods for injecting rules, and now supports the following:

【Analyzer Rules】Logical plan analysis rules
【Check Analysis Rules】Logical plan check rules
【Optimizer Rules.】 Logical plan optimization rules
【Planning Strategies】Strategies for Forming Physical Plans
【Customized Parser】Customized sql parser
【(External) Catalog listeners catalog】Listener

In the above six user-defined places, we chose [Check Analysis Rules]. Because the check rule does not need to have a return value when the method is called, it means that the currently traversed logical plan tree does not need to be modified, which is exactly what we need.

However, [Analyzer Rules] and [Optimizer Rules] need to modify the current logic plan, which makes it difficult for us to iterate the entire tree and get the results we want.

3.2 Implement your own extensions

 class ExtralSparkExtension extends (SparkSessionExtensions => Unit) {
  override def apply(spark: SparkSessionExtensions): Unit = {

    //字段血缘
    spark.injectCheckRule(FieldLineageCheckRuleV3)

    //sql解析器
    spark.injectParser { case (_, parser) => new ExtraSparkParser(parser) }

  }
}

The extension is implemented in this way, and the rules you need are injected into SparkSessionExtensions in the apply method. There are other rules besides the above four that can be injected. For ExtraSparkExtension to work, we need to configure spark.sql.extensions=org.apache.spark.sql.hive.ExtralSparkExtension in spark-default.conf to take effect when starting Spark tasks.

Notice that we also implemented a custom SQL parser, which doesn't really do much. Just set the SQLText (SQL statement) to FIELD\_LINE\_AGE_SQL when judging if the statement contains insert, the reason why SQLText is placed in FIELD\_LINE\_AGE_SQL . Because SparkPlan is not available in DheckRule, we need to parse SQL again to get SprkPlan, and the implementation of FieldLineageCheckRuleV3 is also very simple, and the important thing is in another thread implementation.

Here we only focus on the insert statement, because the insert statement contains input from some tables and then writes to a table.

 class ExtraSparkParser(delegate: ParserInterface) extends ParserInterface with Logging{

  override def parsePlan(sqlText: String): LogicalPlan = {
    val lineAgeEnabled = SparkSession.getActiveSession
      .get.conf.getOption("spark.sql.xxx-xxx-xxx.enable").getOrElse("false").toBoolean
    logDebug(s"SqlText: $sqlText")
    if(sqlText.toLowerCase().contains("insert")){
      if(lineAgeEnabled){
        if(FIELD_LINE_AGE_SQL_COULD_SET.get()){
          //线程本地变量在这里
          FIELD_LINE_AGE_SQL.set(sqlText)
        }
        FIELD_LINE_AGE_SQL_COULD_SET.remove()
      }
    }
    delegate.parsePlan(sqlText)
  }
  //调用原始的sqlparser
  override def parseExpression(sqlText: String): Expression = {
    delegate.parseExpression(sqlText)
  }
  //调用原始的sqlparser
  override def parseTableIdentifier(sqlText: String): TableIdentifier = {
    delegate.parseTableIdentifier(sqlText)
  }
  //调用原始的sqlparser
  override def parseFunctionIdentifier(sqlText: String): FunctionIdentifier = {
    delegate.parseFunctionIdentifier(sqlText)
  }
  //调用原始的sqlparser
  override def parseTableSchema(sqlText: String): StructType = {
    delegate.parseTableSchema(sqlText)
  }
  //调用原始的sqlparser
  override def parseDataType(sqlText: String): DataType = {
    delegate.parseDataType(sqlText)
  }
}

3.3 Extended rule classes

 case class FieldLineageCheckRuleV3(sparkSession:SparkSession) extends (LogicalPlan=>Unit ) {

  val executor: ThreadPoolExecutor =
    ThreadUtils.newDaemonCachedThreadPool("spark-field-line-age-collector",3,6)

  override def apply(plan: LogicalPlan): Unit = {
    val sql = FIELD_LINE_AGE_SQL.get
    FIELD_LINE_AGE_SQL.remove()
    if(sql != null){
      //这里我们拿到sql然后启动一个线程做剩余的解析任务
      val task = new FieldLineageRunnableV3(sparkSession,sql)
      executor.execute(task)
    }

  }
}

Very simple, we just got the SQL and started a thread to get SparkPlan, the actual logic is in FieldLineageRunnableV3.

3.4 Specific implementation methods

3.4.1 Get SparkPlan

We get SparkPlan in the run method:

 override def run(): Unit = {
  val parser = sparkSession.sessionState.sqlParser
  val analyzer = sparkSession.sessionState.analyzer
  val optimizer = sparkSession.sessionState.optimizer
  val planner = sparkSession.sessionState.planner
      ............
  val newPlan = parser.parsePlan(sql)
  PASS_TABLE_AUTH.set(true)
  val analyzedPlan = analyzer.executeAndCheck(newPlan)

  val optimizerPlan = optimizer.execute(analyzedPlan)
  //得到sparkPlan
  val sparkPlan = planner.plan(optimizerPlan).next()
  ...............
if(targetTable != null){
  val levelProject = new ArrayBuffer[ArrayBuffer[NameExpressionHolder]]()
  val predicates = new ArrayBuffer[(String,ArrayBuffer[NameExpressionHolder])]()
  //projection
  projectionLineAge(levelProject, sparkPlan.child)
  //predication
  predicationLineAge(predicates, sparkPlan.child)
  ...............

Why use SparkPlan? When we first considered it, the physical plan was more accurate when taking field relationships, and the link was shorter and more direct.

Here is a supplement to the Spark SQL parsing process as follows:

After SqlParser, the logical plan will be obtained. At this time, the table name, function, etc. have not been parsed and cannot be executed; after the Analyzer will analyze some binding information, such as table validation, field information, and function information; after the Optimizer, the logical plan will be based on the established The rules are optimized. The rules here are RBO. Of course, Spark also supports CBO optimization; after SparkPlanner, it becomes an executable physical plan.

Let's look at an example of a logical plan versus a physical plan:

An SQL statement:

 select item_id,TYPE,v_value,imei from t1
union all
select item_id,TYPE,v_value,imei from t2
union all
select item_id,TYPE,v_value,imei from t3

The logical plan is this:

The physics plan looks like this:

Obviously simplified a lot.

After getting the SparkPlan, we can do iterative processing according to different SparkPlan nodes.

We divide the field blood relationship into two types: projection (select query field) and prediction (wehre query condition).

These two are a point-to-point relationship, that is, the corresponding relationship of the fields of the target table is generated from the fields of the original table.

Imagine a query is a tree, then the iterative relationship will iterate from the top of the tree to the leaf node of the tree, which is the original table:

Then the result of our iterative query should be

id ->tab1.id ,
name->tab1.name,tabb2.name,
age→tabb2.age.

Note that the variable val levelProject = new ArrayBuffer ArrayBuffer[NameExpressionHolder], after iterating through projecti-onLineAge, levelProject stores the top-level id, name, age corresponding to (tab1.id), (tab1.name, tabb2.name), (tabb2 .age).

Of course, it is not a simple recursive iteration, and special cases need to be considered. For example: Join, ExpandExec, Aggregate, Explode, GenerateExec, etc. all need special consideration.

Examples and effects:

SQL:

 with A as (select id,name,age from tab1 where id > 100 ) ,
C as (select id,name,max(age) from A group by A.id,A.name) ,
B as (select id,name,age from tabb2 where age > 28)
insert into tab3
   select C.id,concat(C.name,B.name) as name, B.age from
     B,C where C.id = B.id

Effect:

 {
  "edges": [
    {
      "sources": [
        3
      ],
      "targets": [
        0
      ],
      "expression": "id",
      "edgeType": "PROJECTION"
    },
    {
      "sources": [
        4,
        7
      ],
      "targets": [
        1
      ],
      "expression": "name",
      "edgeType": "PROJECTION"
    },
    {
      "sources": [
        5
      ],
      "targets": [
        2
      ],
      "expression": "age",
      "edgeType": "PROJECTION"
    },
    {
      "sources": [
        6,
        3
      ],
      "targets": [
        0,
        1,
        2
      ],
      "expression": "INNER",
      "edgeType": "PREDICATE"
    },
    {
      "sources": [
        6,
        5
      ],
      "targets": [
        0,
        1,
        2
      ],
      "expression": "((((default.tabb2.`age` IS NOT NULL) AND (CAST(default.tabb2.`age` AS INT) > 28)) AND (B.`id` > 100)) AND (B.`id` IS NOT NULL))",
      "edgeType": "PREDICATE"
    },
    {
      "sources": [
        3
      ],
      "targets": [
        0,
        1,
        2
      ],
      "expression": "((default.tab1.`id` IS NOT NULL) AND (default.tab1.`id` > 100))",
      "edgeType": "PREDICATE"
    }
  ],
  "vertices": [
    {
      "id": 0,
      "vertexType": "COLUMN",
      "vertexId": "default.tab3.id"
    },
    {
      "id": 1,
      "vertexType": "COLUMN",
      "vertexId": "default.tab3.name"
    },
    {
      "id": 2,
      "vertexType": "COLUMN",
      "vertexId": "default.tab3.age"
    },
    {
      "id": 3,
      "vertexType": "COLUMN",
      "vertexId": "default.tab1.id"
    },
    {
      "id": 4,
      "vertexType": "COLUMN",
      "vertexId": "default.tab1.name"
    },
    {
      "id": 5,
      "vertexType": "COLUMN",
      "vertexId": "default.tabb2.age"
    },
    {
      "id": 6,
      "vertexType": "COLUMN",
      "vertexId": "default.tabb2.id"
    },
    {
      "id": 7,
      "vertexType": "COLUMN",
      "vertexId": "default.tabb2.name"
    }
  ]
}

4. Summary

In Spark SQL's field lineage implementation, we first got the insert statement through its self-expansion, then got the SQL statement in our own check rules, and finally got the physical plan through SparkSqlParser, Analyzer, Optimizer, and SparkPlanner.

By iterating the physical plan, we make corresponding transformations according to different execution plans, and then get the correspondence between the fields. The current implementation is relatively simple. There is a straight line correspondence between fields, and the intermediate process is ignored. If you want to implement the entire process of field conversion, there is no problem.

The practice of Spark SQL field blood relationship in vivo Internet

1. Background

2. Preliminary research

3. Spark SQL extension

3.1 What Spark can scale

3.2 Implement your own extensions

3.3 Extended rule classes

3.4 Specific implementation methods

3.4.1 Get SparkPlan

4. Summary

vivo互联网技术

引用和评论

vivo Pulsar万亿级消息处理实践（1）-数据发送原理解析和性能调优

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

The practice of Spark SQL field blood relationship in vivo Internet

1. Background

2. Preliminary research

3. Spark SQL extension

3.1 What Spark can scale

3.2 Implement your own extensions

3.3 Extended rule classes

3.4 Specific implementation methods

3.4.1 Get SparkPlan

4. Summary

vivo互联网技术

引用和评论

vivo Pulsar万亿级消息处理实践（1）-数据发送原理解析和性能调优

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈