一、项目背景
在集团数据挖掘项目下,用户画像工程需要开发很多基础标签,如年龄、性别、婚姻状况、资产情况等标签。这些标签不能使用集团下的各个数据源来简单的处理后,得到结果作为最终评价结果,因为不同数据源的数据质量参差不齐,且各个数据源得到的结果置信度情况也不一样。
因此我们需要使用决策树+XGBoost等模型来综合预测用户标签。比如年龄段标签的决策树就如下图所示:
二、模型抽象
一开始,我准备参照机器学习的PMML来实现决策树的抽象功能,但是看了一整天的PMML官方文档和一大堆PMML相关的文章后,我发现这种方案对我来实现起来太过复杂。机器学习的PMML文件大多都是模型代码自动生成的,不适合我们这种场景(我们是要自己实现把决策树转换为PMML文件)。如果想要我们来写生成PMML文件的代码,那么需要精通PMML的各类标签定义,以及在spark环境下怎么加载该文件,以及如何保证加载的Model是我们想要的逻辑,这里实现起来,后续工作可能很复杂。
没办法,我只能思考在我们自己的需求背景下,怎么自己写代码来实现决策树模型的抽象功能。
它至少包括以下几个难点:
一、如何抽象决策树
二、如何设计类似PMML的配置文件
三、如何解析配置文件
四、如何转换成spark可执行的代码
接下来上正餐~
三、功能实现
正餐~
由上边几个案例分析得知,内存模型很适合使用二叉树来表达,而配置文件可以使用xml或者json来表达,xml文件可以使用xml工具类来解析,json可以使用scala的json4s或者阿里的fastJson来解析,这里我使用json配置文件,使用比较熟悉的json4s来解析json配置文件。接下来我的思路就很清晰,实现步骤包括:
1、把json配置文件解析成一棵二叉树
2、根据json中的表达式来绑定二叉树各节点的数据
3、将绑定好数据的树转换成spark物理执行计划
4、提交到spark集群执行,将结果输出到执行位置
流程图如下:
Main方法类如下:
package com.yang
import com.yang.decision.Source
import com.yang.decision.build.{Binder, Executor, Parser}
import com.yang.utils.{FileUtils, SparkUtils}
import org.slf4j.LoggerFactory
/**
* 决策树模型抽象工程Application入口类
* 1、解析入参args,获取执行日期(processDay)、
* 配置文件路径(configFilePath)、并行度(parallelism)
* 2、从HDFS读取配置文件,解析字段(schema)、
* 数据源Class(source)、输出表(output)、规则集合(rules)
* 3、将rules并解析为一棵二叉树,同时绑定树节点规则
* 4、遍历解析后的二叉树(parsedTree),根据节点规则绑定各节点的数据集(dataFrame),
* 生成绑定数据后的二叉树(boundTree)
* 5、遍历全部叶子节点(Calculation),叶子节点的数据即为不同规则下的结果集,
* 合并(union)全部叶子节点数据后,即为最终结果集
* 执行时会输出parsed tree、bound tree以及physical plan
* 使用者需要提供两点:
* 1、【配置文件】,格式参照:AgeGroupRule.json
* 2、【自定义数据源】,需要实现[[com.yang.decision.Source]]抽象类
* 如果有新的模型需求,需要实现[[com.yang.decision.model.Model]]抽象类,来扩展决策树的模型类别
*
* @author yangfan
* @since 2021/3/18
* @version 1.0.0
*/
object DecisionTreeApplication {
private final val logger = LoggerFactory.getLogger(this.getClass)
private def checkInputArgs(args: Array[String]): (String, String, Int) = {
if (null == args || args.length != 3) {
throw new IllegalArgumentException(
s"Input args exception, args length mast be 3 but get `${args.mkString(",")}` !")
}
logger.error("== args ==\n{}", args.mkString(","))
(args(0), args(1), args(2).toInt)
}
def main(args: Array[String]): Unit = {
/**
* processDay:执行日期
* configFilePath:配置文件路径
* parallelism:并行度
*/
val (processDay, configFilePath, parallelism) = checkInputArgs(args)
//val configText = FileUtils.readFromLocal(configFilePath)
val configText = FileUtils.readFromHDFS(configFilePath)
logger.error("== config text ==\n{}", configText)
/**
* 解析Json为一棵二叉树,同时绑定树节点规则
*/
val (sc, parsedTree) = Parser.parse(configText)
logger.error("== schema and source ==\n{}", sc)
logger.error("== parsed tree ==\n{}", parsedTree)
val spark = SparkUtils.initSession(isLocal = false, this.getClass.getSimpleName)
/**
* 先序遍历parsedTree,根据节点规则绑定子节点的数据集,生成绑定后的树boundTree
*/
val boundTree = Binder.bind(parsedTree,
Source.loadData(sc.sourceClass, spark, processDay, parallelism))
logger.error("== bound tree ==\n{}", boundTree)
/**
* 遍历全部叶子节点Calculation,叶子节点的数据即为不同规则下的结果集,
* union全部叶子节点数据后,即为最终结果集
*/
val df = Executor.execute(boundTree)
df.explain(extended = true)
//df.show()
Executor.executeAndSaveResultData(spark, sc, processDay, boundTree)
spark.close()
}
}
第一步:
怎样把一个json配置文件解析成一棵二叉树作为内存模型,这部分比较简单。
先看一个简单的json配置文件:
{
"schema": "id, jr_age, xinan_age, jl_age, umc_age, yc_age, vector",
"source": "com.yang.decision.source.AgeGroupSource",
"output": "xxxx_ods.ods_decision_tree_output_data, age_group",
"rules": {
"rule": "jr_age != null",
"left": {
"rule": "GeneralRule, jr_age, 1.0"
},
"right": {
"rule": "xinan_age != null",
"left": {
"rule": "GeneralRule, xinan_age, 0.997349184"
},
"right": {
"rule": "jl_age != null || umc_age != null || yc_age != null",
"left": {
"rule": "jl_age == umc_age || jl_age == yc_age || umc_age == yc_age",
"left": {
"rule": "jl_age != null",
"left": {
"rule": "GeneralRule, jl_age, 0.992448605"
},
"right": {
"rule": "GeneralRule, umc_age, 0.992448605"
}
},
"right": {
"rule": "jl_age != null",
"left": {
"rule": "GeneralRule, jl_age, 0.982582546"
},
"right": {
"rule": "umc_age != null",
"left": {
"rule": "GeneralRule, umc_age, 0.974128879"
},
"right": {
"rule": "GeneralRule, yc_age, 0.920175899"
}
}
}
},
"right": {
"rule": "vector != null",
"left": {
"rule": "ModelRule, com.yang.decision.model.XGBoostPipelineModel, /xxxx/age/age-applist-xgb-0901"
}
}
}
}
}
}
解析配置文件,直接上代码:
package com.yang.decision.build
import com.yang.decision.ModelProperties
import com.yang.decision.tree._
import org.apache.spark.sql.DataFrame
import org.json4s.jackson.JsonMethods
import org.json4s.{DefaultFormats, JValue}
/**
* 解析器:将配置文件解析成一棵二叉树
*
* @author yangfan
* @since 2021/3/16
* @version 1.0.0
*/
object Parser {
implicit val formats: DefaultFormats.type = DefaultFormats
def parse(text: String): (ModelProperties, Tree[DataFrame]) = try {
val jv = JsonMethods.parse(text)
val schema = (jv \ "schema").extract[String]
val arr = schema.split(",").map(_.trim)
val source = (jv \ "source").extract[String].trim
val output = (jv \ "output").extract[String].split(",").map(_.trim)
val rules = jv \ "rules"
(ModelProperties(arr, source, output), parse2Tree(rules, null))
} catch {
case _: Exception =>
throw new RuntimeException(
s"`parse` method Exception, config text parse failed !")
}
def parse2Tree(jv: JValue,
parent: Tree[DataFrame]): Tree[DataFrame] = parse2Node(jv) match {
case (rule, l, r) if null == l && null == r =>
new Calculation(parent, rule)
case (rule, l, r) =>
new Condition(parent, rule,
parse2Tree(l, null), parse2Tree(r, null))
case _ => new Calculation(parent, null)
}
def parse2Node(jv: JValue): (String, JValue, JValue) = jv match {
case _: JValue => (
try {
(jv \ "rule").extract[String]
} catch {
case _: Exception =>
throw new RuntimeException(
s"`parse2Node` method Exception, `rule` mast not be null !")
},
if (containsKey(jv, "left")) jv \ "left" else null,
if (containsKey(jv, "right")) jv \ "right" else null
)
case _ => null
}
def containsKey(jv: JValue, key: String): Boolean = {
jv.findField(_._1.equals(key)).nonEmpty
}
}
上边,parse(text: String): (ModelProperties, Tree[DataFrame])
方法把schema、source、output、rules等解析出来,此时rules还是一个JValue类型的对象,并不是Tree。所以针对rules,需要调用parse2Tree(jv: JValue, parent: Tree[DataFrame]): Tree[DataFrame]
方法将JValue对象解析成Tree,解析方式是递归遍历,深度优先遍历。解析完成后,我把这个Tree标识为ParsedTree(解析后的Tree)。
下边展示解析前后的局部对比图:
第二步:
经过第一步,得到的二叉树模型(parsedTree)是这样的:
== schema and source ==
schema: id,jr_age,xinan_age,jl_age,umc_age,yc_age,vector
source: com.bj58.decision.source.AgeGroupSource
output: hdp_jinrong_tech_ods.ods_decision_tree_output_data,age_group
== parsed tree ==
Branch[DataFrame]: {
DataFrame: null
Condition: jr_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: null
GeneralRule[fieldName: jr_age, confidence: 1.0]
}
RightNode: Branch[DataFrame]: {
DataFrame: null
Condition: xinan_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: null
GeneralRule[fieldName: xinan_age, confidence: 0.997349184]
}
RightNode: Branch[DataFrame]: {
DataFrame: null
Condition: jl_age != null || umc_age != null || yc_age != null
LeftNode: Branch[DataFrame]: {
DataFrame: null
Condition: jl_age == umc_age || jl_age == yc_age || umc_age == yc_age
LeftNode: Branch[DataFrame]: {
DataFrame: null
Condition: jl_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: null
GeneralRule[fieldName: jl_age, confidence: 0.992448605]
}
RightNode: Leaf[DataFrame]: {
DataFrame: null
GeneralRule[fieldName: umc_age, confidence: 0.992448605]
}
}
RightNode: Branch[DataFrame]: {
DataFrame: null
Condition: jl_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: null
GeneralRule[fieldName: jl_age, confidence: 0.982582546]
}
RightNode: Branch[DataFrame]: {
DataFrame: null
Condition: umc_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: null
GeneralRule[fieldName: umc_age, confidence: 0.974128879]
}
RightNode: Leaf[DataFrame]: {
DataFrame: null
GeneralRule[fieldName: yc_age, confidence: 0.920175899]
}
}
}
}
RightNode: Branch[DataFrame]: {
DataFrame: null
Condition: vector != null
LeftNode: Leaf[DataFrame]: {
DataFrame: null
ModelRule[modelClass: com.bj58.decision.model.XGBoostPipelineModel, modelFilePath: /xxxx/age/age-applist-xgb-0901, config: null]
}
RightNode: Leaf[DataFrame]: {
DataFrame: null
null
}
}
}
}
}
那么如何绑定各节点数据呢?
这部分比较复杂,先展示绑定数据前后的局部对比图:
可以看到,每个节点都已经绑定好了该节点自身需要的数据集(注:这里的数据集是一个DataFrame数据抽象,而不是数据本身,懂Spark RDD的都懂,就不多说了~)。
那么这一步是怎么做的呢?
首先要知道二叉树的节点分为分支节点(Branch)和叶子节点(Leaf),根据实际情况,我发现在我们这个场景下,所有的分支节点(Branch)都存在条件表达式,表达式的结果表示一条数据要去左节点还是右节点,而所有的叶子节点(Leaf)都是需要根据规则计算的节点。于是我定义了两个类:Condition、Calculation。Condition表示条件节点,继承自Branch。Calculation表示计算节点,继承自Leaf。
这里为什么要让Condition、Calculation去实现Branch、Leaf,而不是直接去实现Tree呢?在跟老大沟通时问了我这样一个问题,我当时想了想,好像确实没什么必要多这一层。但当我反过来重新看了下SparkSql源码时,我认为这样是没毛病的。SparkSql源码在对TreeNode进行实现时,也有UnaryNode、BinaryNode等抽象类,在此之下才是Join、CoGroup、SetOperation等真正的实现类。
还有就是对数据绑定这部分逻辑有疑问的,请移步去看SparkSql源码org.apache.spark.sql.execution.QueryExecution类对于Unresolved Logical Plan->Resolved Logical Plan的实现,实际它就是Sql语句经过Antlr4解析,生成Unresolved Logical Plan,然后analyzer与catalog进行绑定(catlog存储元数据),生成Resolved Logical Plan。我的代码实现就是借鉴此处。
上代码:
1、绑定过程:
package com.yang.decision.build
import com.yang.decision.tree._
import org.apache.spark.sql.DataFrame
/**
* 绑定器:绑定节点数据
*
* @author yangfan
* @since 2021/3/16
* @version 1.0.0
*/
object Binder {
def bind(parsedTree: Tree[DataFrame], dataFrame: DataFrame): Tree[DataFrame] = {
parsedTree.data = dataFrame
bindChild(parsedTree)
}
def bindChild(parsedTree: Tree[DataFrame]): Tree[DataFrame] = parsedTree match {
case leaf: Calculation if null != leaf.rule =>
leaf.data = leaf.rule.execute(leaf.data)
leaf
case branch: Condition =>
branch.left.data = branch.leftDF
branch.right.data = branch.rightDF
bindChild(branch.left)
bindChild(branch.right)
branch
case other@_ => other
}
}
从上边可以看到,绑定的过程分为以下几步:
1、传入解析后的二叉树parsedTree,和根节点的数据集rootDataFrame
2、绑定根节点的数据集parsedTree.data = dataFrame
3、递归遍历左右子节点,绑定各子节点的数据集
如果该节点是分支节点(Condition),则根据条件表达式分别绑定左右节点的数据集,
再递归绑定左右节点及其子节点的数据集,返回该节点:
branch.left.data = branch.leftDF
branch.right.data = branch.rightDF
如果该节点是叶子节点(Calculation),则根据计算规则rule,得到该节点的数据集,返回该节点:
leaf.data = leaf.rule.execute(leaf.data)
4、如果该节点既不是分支节点,也不是叶子节点,则不做绑定直接返回该节点
2、二叉树的抽象
package com.yang.decision.tree
/**
* 二叉树抽象
*
* @author yangfan
* @since 2021/3/16
* @version 1.0.0
*/
abstract class Tree[T](parent: Tree[T]) extends Serializable {
var data: T = _
/**
* 节点个数
*/
def nodeNumber: Int = go(this, 0)
private def go[A](tree: Tree[A], t: Int): Int = tree match {
case Leaf(_) => t + 1
case Branch(_, left, right) => t + 1 + go(left, 0) + go(right, 0)
case _ => 0
}
/**
* 深度值
*/
val depth: Int = deepFirstSearch(this)(_ => 1)(_ max _ + 1)
/**
* 转换操作
*/
def map[A, B](tree: Tree[A])(f: A => B): Tree[B] =
deepFirstSearch(tree)(x => f(x))((a, b) => Branch(a, b))
/**
* 深度优先遍历
*/
def deepFirstSearch[A, B](tree: Tree[A])(f: A => B)(g: (B, B) => B): B = tree match {
case l@Leaf(_) => f(l.data)
case b@Branch(_, left, right) =>
g(deepFirstSearch(left), deepFirstSearch(right))
f(b.data)
}
override def toString: String = toString(0)
def toString(depth: Int): String = super.toString
}
/**
* 叶子节点
*
* @since 2021/3/16
* @version 1.0.0
*/
case class Leaf[T](parent: Tree[T]) extends Tree[T](parent)
/**
* 分支节点
*
* @since 2021/3/16
* @version 1.0.0
*/
case class Branch[T](parent: Tree[T], left: Tree[T], right: Tree[T]) extends Tree[T](parent)
3、条件节点的实现:
package com.yang.decision.tree
import com.ql.util.express.{DefaultContext, ExpressRunner}
import com.yang.express.EqualsOperator
import org.apache.spark.sql.{DataFrame, Row}
/**
* 条件节点
*
* @author yangfan
* @since 2021/3/16
* @version 1.0.0
*/
class Condition(parent: Tree[DataFrame],
ruleText: String,
left: Tree[DataFrame],
right: Tree[DataFrame])
extends Branch[DataFrame](parent, left, right) {
lazy val runner: ExpressRunner = {
val runner = new ExpressRunner()
/**
* 修改原表达式 `==` 的判断逻辑
* 修改前:null == null 返回:true
* 修改后:null == null 返回:false
*/
runner.replaceOperator("==", new EqualsOperator())
runner
}
lazy val fields: Array[String] = runner.getOutVarNames(ruleText)
lazy val leftDF: DataFrame = data.filter(r => isLeft(r))
lazy val rightDF: DataFrame = data.filter(!isLeft(_))
def isLeft(row: Row): Boolean = {
val context = new DefaultContext[String, AnyRef]
fields.foreach(field =>
context.put(field, row.getAs(field).asInstanceOf[AnyRef])
)
try {
runner.execute(ruleText, context,
null, true, false)
.asInstanceOf[Boolean]
} catch {
case _: Exception =>
throw new RuntimeException(
s"`isLeft` method Exception, condition expression compute failed !")
}
}
override def toString(depth: Int): String = {
val prefix = Seq.fill(depth)('\t').mkString
s"""Branch[DataFrame]: {
|${prefix + '\t'}DataFrame: $data
|${prefix + '\t'}Condition: $ruleText
|${prefix + '\t'}LeftNode: ${if (null == left) "null" else left.toString(depth + 1)}
|${prefix + '\t'}RightNode: ${if (null == right) "null" else right.toString(depth + 1)}
|$prefix}""".stripMargin
}
}
4、计算节点的实现:
package com.yang.decision.tree
import com.yang.decision._
import org.apache.spark.sql.DataFrame
/**
* 计算节点
*
* @author yangfan
* @since 2021/3/16
* @version 1.0.0
*/
class Calculation(parent: Tree[DataFrame],
ruleText: String)
extends Leaf[DataFrame](parent) {
val rule: Rule = {
if (null == ruleText) {
null
} else {
val arr: Array[String] = ruleText.split(",").map(_.trim)
arr(0) match {
case "GeneralRule" => GeneralRule(arr(1), arr(2).toDouble)
case "ModelRule" => ModelRule(arr(1), arr(2), null)
case par@_ =>
throw new RuntimeException(s"`Calculation` class Exception, " +
s"`rule` mast be `GeneralRule` or `ModelRule` but get `$par` !")
}
}
}
override def toString(depth: Int): String = {
val prefix = Seq.fill(depth)('\t').mkString
s"""Leaf[DataFrame]: {
|${prefix + '\t'}DataFrame: $data
|${prefix + '\t'}$rule
|$prefix}""".stripMargin
}
}
这里,计算节点的rule对象是一个重要的对象,它表示这个节点在需要计算的时候,要根据什么规则来计算得到结果集。它返回的是一个Rule对象。Rule类是一个抽象类,里边只有一个方法execute(data: DataFrame): DataFrame
,当该节点提交到集群上边后,真正执行的计算的就是这个方法。目前实现的计算规则有两个,第一个是普通计算,第二个是模型计算。代码如下:
package com.yang.decision
import com.yang.decision.model.Model
import org.apache.spark.sql.{DataFrame, Encoders}
/**
* 计算规则抽象
*
* @author yangfan
* @since 2021/3/16
* @version 1.0.0
*/
trait Rule {
def execute(data: DataFrame): DataFrame
}
/**
* 普通计算
*
* @since 2021/3/16
* @version 1.0.0
*/
case class GeneralRule(fieldName: String, confidence: Double) extends Rule {
override def execute(data: DataFrame): DataFrame = {
data.map(row =>
OutputData(
row.getAs[String]("id"),
row.getAs[String](fieldName),
confidence
)
)(Encoders.product[OutputData]).toDF()
}
override def toString: String = s"GeneralRule[fieldName: $fieldName, confidence: $confidence]"
}
/**
* 模型计算
*
* @since 2021/3/16
* @version 1.0.0
*/
case class ModelRule(modelClass: String,
modelFilePath: String,
config: Map[String, String]) extends Rule {
lazy val model: Model = Class.forName(modelClass)
.getConstructor(classOf[String], classOf[Map[String, String]])
.newInstance(modelFilePath, config)
.asInstanceOf[Model]
override def execute(data: DataFrame): DataFrame = model.execute(data)
override def toString: String = s"ModelRule[modelClass: $modelClass, " +
s"modelFilePath: $modelFilePath, config: $config]"
}
而这里的模型计算规则中,可以看到调用的是model.execute(data)
方法返回结果集,因此真正实现计算逻辑的是各个模型,比如XGboost模型、其他模型等。所以这里的model对象Model仍然是一个抽象类,当该节点想要根据不同的模型去计算时,需要实现Model类的execute(data: DataFrame): DataFrame
方法,这部分代码如下:
1、模型抽象类(有两个需要实现的属性方法,val model: Any
表示模型对象,def execute(data: DataFrame): DataFrame
表示模型计算的逻辑):
package com.yang.decision.model
import org.apache.spark.sql.DataFrame
/**
* 模型抽象
*
* @author yangfan
* @since 2021/3/8
* @version 1.0.0
*/
abstract class Model(modelFilePath: String,
config: Map[String, String]) extends Serializable {
val model: Any
def execute(data: DataFrame): DataFrame
}
2、XGBoost模型的实现类:
package com.yang.decision.model
import com.yang.decision.OutputData
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.{DataFrame, Encoders}
import scala.collection.mutable
/**
* XGBoost模型
*
* @author yangfan
* @since 2021/3/10
* @version 1.0.0
*/
class XGBoostPipelineModel(modelFilePath: String,
config: Map[String, String])
extends Model(modelFilePath, config) {
override lazy val model: PipelineModel = PipelineModel.load(modelFilePath)
override def execute(data: DataFrame): DataFrame = {
if (null == model)
throw new NullPointerException(
s"`execute` method exception, `model` mast not be null !")
if (null == data)
throw new NullPointerException(
s"`execute` method exception, `data` mast not be null !")
val labeledData = data.map(l => {
LabeledPoint(0d,
Vectors.dense(
l.getAs[mutable.WrappedArray[Double]]("vector").toArray
),
l.getAs[String]("id"))
})(Encoders.product[LabeledPoint])
model.transform(labeledData)
.select("id", "probability", "predictedLabel")
.map(r => {
val v = r.getAs[Vector](1)
OutputData(r.getString(0), r.getString(2), v(v.argmax))
})(Encoders.product[OutputData])
.toDF()
}
}
case class LabeledPoint(label: Double,
features: Vector,
id: String)
下边展示局部计算逻辑图:
第三步:
经过一、二两步,得到的二叉树模型(boundTree)是这样的:
== bound tree ==
Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: jr_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
GeneralRule[fieldName: jr_age, confidence: 1.0]
}
RightNode: Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: xinan_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
GeneralRule[fieldName: xinan_age, confidence: 0.997349184]
}
RightNode: Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: jl_age != null || umc_age != null || yc_age != null
LeftNode: Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: jl_age == umc_age || jl_age == yc_age || umc_age == yc_age
LeftNode: Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: jl_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
GeneralRule[fieldName: jl_age, confidence: 0.992448605]
}
RightNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
GeneralRule[fieldName: umc_age, confidence: 0.992448605]
}
}
RightNode: Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: jl_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
GeneralRule[fieldName: jl_age, confidence: 0.982582546]
}
RightNode: Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: umc_age != null
LeftNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
GeneralRule[fieldName: umc_age, confidence: 0.974128879]
}
RightNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
GeneralRule[fieldName: yc_age, confidence: 0.920175899]
}
}
}
}
RightNode: Branch[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
Condition: vector != null
LeftNode: Leaf[DataFrame]: {
DataFrame: [id: string, featureValue: string ... 1 more field]
ModelRule[modelClass: com.bj58.decision.model.XGBoostPipelineModel, modelFilePath: /xxxx/age/age-applist-xgb-0901, config: null]
}
RightNode: Leaf[DataFrame]: {
DataFrame: [id: string, jr_age: string ... 5 more fields]
null
}
}
}
}
}
那么怎么才能把这样一个已经绑定好数据的Tree转换为Spark可执行的物理执行计划呢?
如果明白了前边几步的实现,那么这一步是最简单的,上代码:
package com.yang.decision.build
import com.yang.decision.ModelProperties
import com.yang.decision.tree._
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* 执行器:将Tree转换成Spark可执行的物理执行计划
*
* @author yangfan
* @since 2021/3/16
* @version 1.0.0
*/
object Executor {
def executeAndSaveResultData(spark: SparkSession,
sc: ModelProperties,
processDay: String,
tree: Tree[DataFrame]): Unit = {
execute(tree).createOrReplaceTempView("output_data_temp_view")
spark.sql(
s"""
|insert overwrite table ${sc.output(0)}
|partition(dt = '$processDay', model = '${sc.output(1)}')
|select
| id
| ,featureValue
| ,confidenceScore
|from output_data_temp_view
""".stripMargin
)
}
def execute(boundTree: Tree[DataFrame]): DataFrame = boundTree match {
case leaf: Calculation if null != leaf.rule => leaf.data
case branch: Condition => union(execute(branch.left), execute(branch.right))
case _ => null
}
def union(l: DataFrame, r: DataFrame): DataFrame = (l, r) match {
case _ if null == l => r
case _ if null == r => l
case _ => l.union(r)
}
}
实际上,在前几步的实现下,我们已经将每个节点想要的数据集都已经绑定到该节点下了,那么最后我们想要的结果不就是每个叶子节点(计算节点)下的数据集吗?所以,最后只需要递归遍历叶子节点,并把所有数据集union起来就是最终的结果集。而得到的DataFrame一旦触发计算,就会自动生成物理执行计划到Spark集群上执行!我这里触发计算的当然就是insert命令。
最后展示以下物理执行计划(Physical Plan),以及执行时的监控:
== Physical Plan ==
*Union
:- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#210, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#211, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#212]
: +- *MapElements <function1>, obj#209: com.bj58.decision.OutputData
: +- *Filter <function1>.apply
: +- *DeserializeToObject createexternalrow(id#29.toString, jr_age#30.toString, xinan_age#31.toString, jl_age#32.toString, umc_age#33.toString, yc_age#34.toString, staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, true), vector#28, None).array, true, false), StructField(id,StringType,true), StructField(jr_age,StringType,true), StructField(xinan_age,StringType,true), StructField(jl_age,StringType,true), StructField(umc_age,StringType,true), StructField(yc_age,StringType,true), StructField(vector,ArrayType(DoubleType,true),true)), obj#208: org.apache.spark.sql.Row
: +- *Exchange RoundRobinPartitioning(300)
: +- *Project [coalesce(userid#121, userid#27) AS id#29, cast(jr_age#22 as string) AS jr_age#30, cast(xinan_age#23 as string) AS xinan_age#31, cast(jl_age#24 as string) AS jl_age#32, cast(umc_age#25 as string) AS umc_age#33, cast(yc_age#26 as string) AS yc_age#34, vector#28]
: +- *SortMergeJoin [userid#121], [userid#27], FullOuter
: :- *SortAggregate(key=[userid#121], functions=[max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, jr_age#22, xinan_age#23, jl_age#24, umc_age#25, yc_age#26])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#121, 500, false)
: : +- *SortAggregate(key=[userid#121], functions=[partial_max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, max#507, max#508, max#509, max#510, max#511])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Union
: : :- *Project [cast(user_id#39L as string) AS userid#121, birthday#43 AS feature_value#1, xinan AS ds#2]
: : : +- *Filter ((((((((isnotnull(user_id#39L) && isnotnull(birthday#43)) && isnotnull(cast(birthday#43 as int))) && (cast(birthday#43 as int) > 19200101)) && (cast(birthday#43 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#43, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (cast(user_id#39L as string) = 0)) && (cast(cast(user_id#39L as string) as bigint) > 0)) && NOT cast(user_id#39L as string) RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#43, user_id#39L], HiveTableRelation `hdp_ubu_xxzl_defaultdb`.`dw_realname_userinfo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#37, access_party_id#38, user_id#39L, name_md5#40, idno_md5#41, gender#42, birthday#43, address#44], [dt#45], [isnotnull(dt#45), (dt#45 = 20210321)]
: : :- *Project [wb_id#46 AS userid#5, birthday#3 AS feature_value#6, jinrong AS ds#7]
: : : +- *Filter ((((((isnotnull(rank1#4) && isnotnull(birthday#3)) && (rank1#4 = 1)) && isnotnull(cast(birthday#3 as int))) && (cast(birthday#3 as int) > 19200101)) && isnotnull(unix_timestamp(birthday#3, yyyyMMdd, Some(Asia/Shanghai)))) && (cast(birthday#3 as int) < 20170101))
: : : +- *Window [row_number() windowspecdefinition(wb_id#46, bill_no#50 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank1#4], [wb_id#46], [bill_no#50 DESC NULLS LAST]
: : : +- *Sort [wb_id#46 ASC NULLS FIRST, bill_no#50 DESC NULLS LAST], false, 0
: : : +- *Exchange hashpartitioning(wb_id#46, 500, false)
: : : +- *Project [wb_id#46, CASE WHEN ((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) THEN substring(idcard#47, 7, 8) WHEN ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))) THEN concat(19, substring(idcard#47, 7, 6)) END AS birthday#3, bill_no#50]
: : : +- *Filter ((((isnotnull(wb_id#46) && (((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) || ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))))) && NOT (wb_id#46 = 0)) && (cast(wb_id#46 as bigint) > 0)) && NOT wb_id#46 RLIKE ^0.*)
: : : +- *HiveTableScan [bill_no#50, idcard#47, wb_id#46], HiveTableRelation `hdp_jinrong_qiangui_defaultdb`.`itf_58haojie_user_baihang`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [wb_id#46, idcard#47, name#48, phone#49, bill_no#50]
: : :- *Project [userid#51, CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END AS feature_value#10, umc AS ds#11]
: : : +- *Filter (((((((((isnotnull(userid#51) && (length(birthday#66) > 5)) && (((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) || (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1))) && isnotnull(cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int))) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) > 19200101)) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) < 20170101)) && isnotnull(unix_timestamp(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#51 = 0)) && (cast(userid#51 as bigint) > 0)) && NOT userid#51 RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#66, userid#51], HiveTableRelation `hdp_teu_dpd_defaultdb`.`ds_umc_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#51, register_name#52, register_email#53, register_mobile#54, verified_mobile#55, verified_realname#56, verified_business#57, reg_time#58, reg_ip#59, reg_cityid#60, reg_platform#61, locked#62, md5_realname#63, nick_name#64, sex#65, birthday#66, md5_qq#67, md5_msn#68, md5_mobile#69, md5_phone#70, postzip#71, face#72, verified_face#73, address#74, ... 17 more fields], [ds#92], [isnotnull(ds#92), (ds#92 = umc)]
: : :- *Project [userid#93, concat(regexp_replace(feature_value#95, -, ), 01) AS feature_value#12, jianli AS ds#13]
: : : +- *Filter ((((((((((isnotnull(userid#93) && isnotnull(feature_key#94)) && (feature_key#94 = zp_avg_birthday_rsm_val_1d)) && (length(feature_value#95) = 7)) && isnotnull(cast(regexp_replace(feature_value#95, -, ) as int))) && (cast(regexp_replace(feature_value#95, -, ) as int) > 192001)) && (cast(regexp_replace(feature_value#95, -, ) as int) < 201701)) && isnotnull(unix_timestamp(concat(regexp_replace(feature_value#95, -, ), 01), yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#93 = 0)) && (cast(userid#93 as bigint) > 0)) && NOT userid#93 RLIKE ^0.*)
: : : +- *FileScan parquet hdp_jinrong_tech_dw.stf_taie_feature_topic[userid#93,feature_key#94,feature_value#95,service_type#98,period#99,task#100,dt#101] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/resultdata/hdp_jinrong_tech_dw/..., PartitionCount: 1, PartitionFilters: [isnotnull(service_type#98), isnotnull(task#100), isnotnull(dt#101), isnotnull(period#99), (dt#10..., PushedFilters: [IsNotNull(userid), IsNotNull(feature_key), EqualTo(feature_key,zp_avg_birthday_rsm_val_1d), Not(..., ReadSchema: struct<userid:string,feature_key:string,feature_value:string>
: : +- *Project [userid#14, feature_value#15, ds#16]
: : +- *Filter (count(1)#117L = 1)
: : +- *SortAggregate(key=[userid#102], functions=[max(birthday#103), count(1)], output=[userid#14, feature_value#15, ds#16, count(1)#117L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#102, 500, false)
: : +- *SortAggregate(key=[userid#102], functions=[partial_max(birthday#103), partial_count(1)], output=[userid#102, max#166, count#167L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Project [userid#102, birthday#103]
: : +- *Filter ((((((((isnotnull(birthday#103) && isnotnull(userid#102)) && isnotnull(cast(birthday#103 as int))) && (cast(birthday#103 as int) > 19200101)) && (cast(birthday#103 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#103, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#102 = 0)) && (cast(userid#102 as bigint) > 0)) && NOT userid#102 RLIKE ^0.*)
: : +- *FileScan parquet hdp_jinrong_tech_ods.user_chinahr_bir_20181231[userid#102,birthday#103,dt#104] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/warehouse/hdp_jinrong_tech_dw/d..., PartitionCount: 1, PartitionFilters: [isnotnull(dt#104), (dt#104 = 20181231)], PushedFilters: [IsNotNull(birthday), IsNotNull(userid), Not(EqualTo(userid,0))], ReadSchema: struct<userid:string,birthday:string>
: +- *Sort [userid#27 ASC NULLS FIRST], false, 0
: +- *Exchange hashpartitioning(userid#27, 500, true)
: +- *Project [UDF:des_encoder(userid#105, decode) AS userid#27, UDF:vector_trans(applistall#110, 4399游戏盒|代练通|b612咔叽|wifi万能钥匙|掌上wegame|翼支付|个人所得税|掌上英雄联盟|比心|饿了么|三国杀|知到|inshot-视频编辑|手机淘宝|美图秀秀|节奏大师|识货|部落冲突|qq|前程无忧51job|百度极速版|美柚|一甜相机|虎牙直播|迷你世界|云班课|轻颜相机|tt语音|火山小视频|驾考宝典|买单吧|宝宝树孕育|一起小学学生| 360手机卫士|家长通|同桌游戏|vsco|意见反馈|picsart美易照片编辑|哔哩哔哩|爱奇艺|一起作业学生|安全教育平台|小猿口算|熊猫优选|y2002电音|滴滴车主|百度网盘|dnf助手|作业帮|剪映|毒|微光|纳米盒|qq音乐|欢乐斗地主|汽车之家|uu加速器|最右|steam|内涵段子|小天才|百度手机助手|店长直聘|美篇|好看视频|uki|云课堂智慧职教|一键锁屏|蓝墨云班课|360手机助手|学习通|掌上飞车|抖音插件|支付宝|萤石云视频|全民小视频|小红书|手机营业厅|yy|铃声多多|腾讯动漫|来分期|快手|发现精彩|今日头条极速版|捷信金融|交易猫|faceu激萌|天天酷跑|蘑菇街|中国大学mooc|斗米|同花顺|分期乐|企鹅电竞|一起学|学而思网校|抖音短视频|快手直播伴侣|芒果tv|酷狗音乐|平安口袋银行|应用宝|绝地求生刺激战场|网易云音乐|易班|qq飞车|好游快爆|微博|王者营地|影视大全|球球大作战|boss直聘|皮皮虾|作业盒子小学|天天p图|小盒家长|皮皮搞笑|u净|全球购骑士特权|小猿搜题|mix|momo陌陌|掌上穿越火线|今日头条|超级课程表|穿越火线:枪战王者|掌通家园|学小易|王者荣耀|pubgmobile|工银融e联|快手小游戏|知乎|小恩爱|心悦俱乐部|平安普惠|百度贴吧|优酷|keep|美团|得物(毒)|百词斩|找靓机|晓黑板|到梦空间|美团外卖|智联招聘|华为主题动态引擎|情侣空间|抖音火山版|闲鱼|中国建设银行|今日校园|玩吧|相册管家|便签|智学网|浦发信用卡|熊猫直播|交管12123|掌上生活|腾讯新闻|猫咪|迅雷|探探|腾讯欢乐麻将|qq同步助手|凤凰新闻|平安好车主|和平精英|比心陪练|优酷视频|快猫|狼人杀|第五人格|快看漫画|斗鱼直播|途虎养车|搜狗输入法|西瓜视频|快影|人人视频|学习强国|倒数日|韩剧tv|音遇|soul|菜鸟裹裹|云电脑|无他相机|迅游手游加速器|租号玩|平安金管家|qq安全中心|360清理大师|葫芦侠|王者荣耀助手|taptap|全民k歌|腾讯视频|唱鸭|掌上道聚城|百度, \|) AS vector#28]
: +- *HiveTableScan [applistall#110, userid#105], HiveTableRelation `hdp_jinrong_tech_dw`.`dw_taie_tb_app_action_applist_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#105, lastdate#106, applistlast#107, applist180d#108, applist365d#109, applistall#110], [daystr#111], [isnotnull(daystr#111), (daystr#111 = 20210321)]
:- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#224, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#225, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#226]
: +- *MapElements <function1>, obj#223: com.bj58.decision.OutputData
: +- *Filter (<function1>.apply && <function1>.apply)
: +- *DeserializeToObject createexternalrow(id#29.toString, jr_age#30.toString, xinan_age#31.toString, jl_age#32.toString, umc_age#33.toString, yc_age#34.toString, staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, true), vector#28, None).array, true, false), StructField(id,StringType,true), StructField(jr_age,StringType,true), StructField(xinan_age,StringType,true), StructField(jl_age,StringType,true), StructField(umc_age,StringType,true), StructField(yc_age,StringType,true), StructField(vector,ArrayType(DoubleType,true),true)), obj#222: org.apache.spark.sql.Row
: +- *Exchange RoundRobinPartitioning(300)
: +- *Project [coalesce(userid#121, userid#27) AS id#29, cast(jr_age#22 as string) AS jr_age#30, cast(xinan_age#23 as string) AS xinan_age#31, cast(jl_age#24 as string) AS jl_age#32, cast(umc_age#25 as string) AS umc_age#33, cast(yc_age#26 as string) AS yc_age#34, vector#28]
: +- *SortMergeJoin [userid#121], [userid#27], FullOuter
: :- *SortAggregate(key=[userid#121], functions=[max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, jr_age#22, xinan_age#23, jl_age#24, umc_age#25, yc_age#26])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#121, 500, false)
: : +- *SortAggregate(key=[userid#121], functions=[partial_max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, max#507, max#508, max#509, max#510, max#511])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Union
: : :- *Project [cast(user_id#39L as string) AS userid#121, birthday#43 AS feature_value#1, xinan AS ds#2]
: : : +- *Filter ((((((((isnotnull(user_id#39L) && isnotnull(birthday#43)) && isnotnull(cast(birthday#43 as int))) && (cast(birthday#43 as int) > 19200101)) && (cast(birthday#43 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#43, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (cast(user_id#39L as string) = 0)) && (cast(cast(user_id#39L as string) as bigint) > 0)) && NOT cast(user_id#39L as string) RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#43, user_id#39L], HiveTableRelation `hdp_ubu_xxzl_defaultdb`.`dw_realname_userinfo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#37, access_party_id#38, user_id#39L, name_md5#40, idno_md5#41, gender#42, birthday#43, address#44], [dt#45], [isnotnull(dt#45), (dt#45 = 20210321)]
: : :- *Project [wb_id#46 AS userid#5, birthday#3 AS feature_value#6, jinrong AS ds#7]
: : : +- *Filter ((((((isnotnull(rank1#4) && isnotnull(birthday#3)) && (rank1#4 = 1)) && isnotnull(cast(birthday#3 as int))) && (cast(birthday#3 as int) > 19200101)) && isnotnull(unix_timestamp(birthday#3, yyyyMMdd, Some(Asia/Shanghai)))) && (cast(birthday#3 as int) < 20170101))
: : : +- *Window [row_number() windowspecdefinition(wb_id#46, bill_no#50 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank1#4], [wb_id#46], [bill_no#50 DESC NULLS LAST]
: : : +- *Sort [wb_id#46 ASC NULLS FIRST, bill_no#50 DESC NULLS LAST], false, 0
: : : +- *Exchange hashpartitioning(wb_id#46, 500, false)
: : : +- *Project [wb_id#46, CASE WHEN ((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) THEN substring(idcard#47, 7, 8) WHEN ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))) THEN concat(19, substring(idcard#47, 7, 6)) END AS birthday#3, bill_no#50]
: : : +- *Filter ((((isnotnull(wb_id#46) && (((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) || ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))))) && NOT (wb_id#46 = 0)) && (cast(wb_id#46 as bigint) > 0)) && NOT wb_id#46 RLIKE ^0.*)
: : : +- *HiveTableScan [bill_no#50, idcard#47, wb_id#46], HiveTableRelation `hdp_jinrong_qiangui_defaultdb`.`itf_58haojie_user_baihang`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [wb_id#46, idcard#47, name#48, phone#49, bill_no#50]
: : :- *Project [userid#51, CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END AS feature_value#10, umc AS ds#11]
: : : +- *Filter (((((((((isnotnull(userid#51) && (length(birthday#66) > 5)) && (((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) || (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1))) && isnotnull(cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int))) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) > 19200101)) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) < 20170101)) && isnotnull(unix_timestamp(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#51 = 0)) && (cast(userid#51 as bigint) > 0)) && NOT userid#51 RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#66, userid#51], HiveTableRelation `hdp_teu_dpd_defaultdb`.`ds_umc_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#51, register_name#52, register_email#53, register_mobile#54, verified_mobile#55, verified_realname#56, verified_business#57, reg_time#58, reg_ip#59, reg_cityid#60, reg_platform#61, locked#62, md5_realname#63, nick_name#64, sex#65, birthday#66, md5_qq#67, md5_msn#68, md5_mobile#69, md5_phone#70, postzip#71, face#72, verified_face#73, address#74, ... 17 more fields], [ds#92], [isnotnull(ds#92), (ds#92 = umc)]
: : :- *Project [userid#93, concat(regexp_replace(feature_value#95, -, ), 01) AS feature_value#12, jianli AS ds#13]
: : : +- *Filter ((((((((((isnotnull(userid#93) && isnotnull(feature_key#94)) && (feature_key#94 = zp_avg_birthday_rsm_val_1d)) && (length(feature_value#95) = 7)) && isnotnull(cast(regexp_replace(feature_value#95, -, ) as int))) && (cast(regexp_replace(feature_value#95, -, ) as int) > 192001)) && (cast(regexp_replace(feature_value#95, -, ) as int) < 201701)) && isnotnull(unix_timestamp(concat(regexp_replace(feature_value#95, -, ), 01), yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#93 = 0)) && (cast(userid#93 as bigint) > 0)) && NOT userid#93 RLIKE ^0.*)
: : : +- *FileScan parquet hdp_jinrong_tech_dw.stf_taie_feature_topic[userid#93,feature_key#94,feature_value#95,service_type#98,period#99,task#100,dt#101] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/resultdata/hdp_jinrong_tech_dw/..., PartitionCount: 1, PartitionFilters: [isnotnull(service_type#98), isnotnull(task#100), isnotnull(dt#101), isnotnull(period#99), (dt#10..., PushedFilters: [IsNotNull(userid), IsNotNull(feature_key), EqualTo(feature_key,zp_avg_birthday_rsm_val_1d), Not(..., ReadSchema: struct<userid:string,feature_key:string,feature_value:string>
: : +- *Project [userid#14, feature_value#15, ds#16]
: : +- *Filter (count(1)#117L = 1)
: : +- *SortAggregate(key=[userid#102], functions=[max(birthday#103), count(1)], output=[userid#14, feature_value#15, ds#16, count(1)#117L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#102, 500, false)
: : +- *SortAggregate(key=[userid#102], functions=[partial_max(birthday#103), partial_count(1)], output=[userid#102, max#166, count#167L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Project [userid#102, birthday#103]
: : +- *Filter ((((((((isnotnull(birthday#103) && isnotnull(userid#102)) && isnotnull(cast(birthday#103 as int))) && (cast(birthday#103 as int) > 19200101)) && (cast(birthday#103 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#103, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#102 = 0)) && (cast(userid#102 as bigint) > 0)) && NOT userid#102 RLIKE ^0.*)
: : +- *FileScan parquet hdp_jinrong_tech_ods.user_chinahr_bir_20181231[userid#102,birthday#103,dt#104] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/warehouse/hdp_jinrong_tech_dw/d..., PartitionCount: 1, PartitionFilters: [isnotnull(dt#104), (dt#104 = 20181231)], PushedFilters: [IsNotNull(birthday), IsNotNull(userid), Not(EqualTo(userid,0))], ReadSchema: struct<userid:string,birthday:string>
: +- *Sort [userid#27 ASC NULLS FIRST], false, 0
: +- *Exchange hashpartitioning(userid#27, 500, true)
: +- *Project [UDF:des_encoder(userid#105, decode) AS userid#27, UDF:vector_trans(applistall#110, 4399游戏盒|代练通|b612咔叽|wifi万能钥匙|掌上wegame|翼支付|个人所得税|掌上英雄联盟|比心|饿了么|三国杀|知到|inshot-视频编辑|手机淘宝|美图秀秀|节奏大师|识货|部落冲突|qq|前程无忧51job|百度极速版|美柚|一甜相机|虎牙直播|迷你世界|云班课|轻颜相机|tt语音|火山小视频|驾考宝典|买单吧|宝宝树孕育|一起小学学生| 360手机卫士|家长通|同桌游戏|vsco|意见反馈|picsart美易照片编辑|哔哩哔哩|爱奇艺|一起作业学生|安全教育平台|小猿口算|熊猫优选|y2002电音|滴滴车主|百度网盘|dnf助手|作业帮|剪映|毒|微光|纳米盒|qq音乐|欢乐斗地主|汽车之家|uu加速器|最右|steam|内涵段子|小天才|百度手机助手|店长直聘|美篇|好看视频|uki|云课堂智慧职教|一键锁屏|蓝墨云班课|360手机助手|学习通|掌上飞车|抖音插件|支付宝|萤石云视频|全民小视频|小红书|手机营业厅|yy|铃声多多|腾讯动漫|来分期|快手|发现精彩|今日头条极速版|捷信金融|交易猫|faceu激萌|天天酷跑|蘑菇街|中国大学mooc|斗米|同花顺|分期乐|企鹅电竞|一起学|学而思网校|抖音短视频|快手直播伴侣|芒果tv|酷狗音乐|平安口袋银行|应用宝|绝地求生刺激战场|网易云音乐|易班|qq飞车|好游快爆|微博|王者营地|影视大全|球球大作战|boss直聘|皮皮虾|作业盒子小学|天天p图|小盒家长|皮皮搞笑|u净|全球购骑士特权|小猿搜题|mix|momo陌陌|掌上穿越火线|今日头条|超级课程表|穿越火线:枪战王者|掌通家园|学小易|王者荣耀|pubgmobile|工银融e联|快手小游戏|知乎|小恩爱|心悦俱乐部|平安普惠|百度贴吧|优酷|keep|美团|得物(毒)|百词斩|找靓机|晓黑板|到梦空间|美团外卖|智联招聘|华为主题动态引擎|情侣空间|抖音火山版|闲鱼|中国建设银行|今日校园|玩吧|相册管家|便签|智学网|浦发信用卡|熊猫直播|交管12123|掌上生活|腾讯新闻|猫咪|迅雷|探探|腾讯欢乐麻将|qq同步助手|凤凰新闻|平安好车主|和平精英|比心陪练|优酷视频|快猫|狼人杀|第五人格|快看漫画|斗鱼直播|途虎养车|搜狗输入法|西瓜视频|快影|人人视频|学习强国|倒数日|韩剧tv|音遇|soul|菜鸟裹裹|云电脑|无他相机|迅游手游加速器|租号玩|平安金管家|qq安全中心|360清理大师|葫芦侠|王者荣耀助手|taptap|全民k歌|腾讯视频|唱鸭|掌上道聚城|百度, \|) AS vector#28]
: +- *HiveTableScan [applistall#110, userid#105], HiveTableRelation `hdp_jinrong_tech_dw`.`dw_taie_tb_app_action_applist_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#105, lastdate#106, applistlast#107, applist180d#108, applist365d#109, applistall#110], [daystr#111], [isnotnull(daystr#111), (daystr#111 = 20210321)]
:- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#242, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#243, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#244]
: +- *MapElements <function1>, obj#241: com.bj58.decision.OutputData
: +- *Filter ((((<function1>.apply && <function1>.apply) && <function1>.apply) && <function1>.apply) && <function1>.apply)
: +- *DeserializeToObject createexternalrow(id#29.toString, jr_age#30.toString, xinan_age#31.toString, jl_age#32.toString, umc_age#33.toString, yc_age#34.toString, staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, true), vector#28, None).array, true, false), StructField(id,StringType,true), StructField(jr_age,StringType,true), StructField(xinan_age,StringType,true), StructField(jl_age,StringType,true), StructField(umc_age,StringType,true), StructField(yc_age,StringType,true), StructField(vector,ArrayType(DoubleType,true),true)), obj#240: org.apache.spark.sql.Row
: +- *Exchange RoundRobinPartitioning(300)
: +- *Project [coalesce(userid#121, userid#27) AS id#29, cast(jr_age#22 as string) AS jr_age#30, cast(xinan_age#23 as string) AS xinan_age#31, cast(jl_age#24 as string) AS jl_age#32, cast(umc_age#25 as string) AS umc_age#33, cast(yc_age#26 as string) AS yc_age#34, vector#28]
: +- *SortMergeJoin [userid#121], [userid#27], FullOuter
: :- *SortAggregate(key=[userid#121], functions=[max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, jr_age#22, xinan_age#23, jl_age#24, umc_age#25, yc_age#26])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#121, 500, false)
: : +- *SortAggregate(key=[userid#121], functions=[partial_max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, max#507, max#508, max#509, max#510, max#511])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Union
: : :- *Project [cast(user_id#39L as string) AS userid#121, birthday#43 AS feature_value#1, xinan AS ds#2]
: : : +- *Filter ((((((((isnotnull(user_id#39L) && isnotnull(birthday#43)) && isnotnull(cast(birthday#43 as int))) && (cast(birthday#43 as int) > 19200101)) && (cast(birthday#43 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#43, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (cast(user_id#39L as string) = 0)) && (cast(cast(user_id#39L as string) as bigint) > 0)) && NOT cast(user_id#39L as string) RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#43, user_id#39L], HiveTableRelation `hdp_ubu_xxzl_defaultdb`.`dw_realname_userinfo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#37, access_party_id#38, user_id#39L, name_md5#40, idno_md5#41, gender#42, birthday#43, address#44], [dt#45], [isnotnull(dt#45), (dt#45 = 20210321)]
: : :- *Project [wb_id#46 AS userid#5, birthday#3 AS feature_value#6, jinrong AS ds#7]
: : : +- *Filter ((((((isnotnull(rank1#4) && isnotnull(birthday#3)) && (rank1#4 = 1)) && isnotnull(cast(birthday#3 as int))) && (cast(birthday#3 as int) > 19200101)) && isnotnull(unix_timestamp(birthday#3, yyyyMMdd, Some(Asia/Shanghai)))) && (cast(birthday#3 as int) < 20170101))
: : : +- *Window [row_number() windowspecdefinition(wb_id#46, bill_no#50 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank1#4], [wb_id#46], [bill_no#50 DESC NULLS LAST]
: : : +- *Sort [wb_id#46 ASC NULLS FIRST, bill_no#50 DESC NULLS LAST], false, 0
: : : +- *Exchange hashpartitioning(wb_id#46, 500, false)
: : : +- *Project [wb_id#46, CASE WHEN ((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) THEN substring(idcard#47, 7, 8) WHEN ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))) THEN concat(19, substring(idcard#47, 7, 6)) END AS birthday#3, bill_no#50]
: : : +- *Filter ((((isnotnull(wb_id#46) && (((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) || ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))))) && NOT (wb_id#46 = 0)) && (cast(wb_id#46 as bigint) > 0)) && NOT wb_id#46 RLIKE ^0.*)
: : : +- *HiveTableScan [bill_no#50, idcard#47, wb_id#46], HiveTableRelation `hdp_jinrong_qiangui_defaultdb`.`itf_58haojie_user_baihang`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [wb_id#46, idcard#47, name#48, phone#49, bill_no#50]
: : :- *Project [userid#51, CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END AS feature_value#10, umc AS ds#11]
: : : +- *Filter (((((((((isnotnull(userid#51) && (length(birthday#66) > 5)) && (((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) || (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1))) && isnotnull(cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int))) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) > 19200101)) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) < 20170101)) && isnotnull(unix_timestamp(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#51 = 0)) && (cast(userid#51 as bigint) > 0)) && NOT userid#51 RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#66, userid#51], HiveTableRelation `hdp_teu_dpd_defaultdb`.`ds_umc_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#51, register_name#52, register_email#53, register_mobile#54, verified_mobile#55, verified_realname#56, verified_business#57, reg_time#58, reg_ip#59, reg_cityid#60, reg_platform#61, locked#62, md5_realname#63, nick_name#64, sex#65, birthday#66, md5_qq#67, md5_msn#68, md5_mobile#69, md5_phone#70, postzip#71, face#72, verified_face#73, address#74, ... 17 more fields], [ds#92], [isnotnull(ds#92), (ds#92 = umc)]
: : :- *Project [userid#93, concat(regexp_replace(feature_value#95, -, ), 01) AS feature_value#12, jianli AS ds#13]
: : : +- *Filter ((((((((((isnotnull(userid#93) && isnotnull(feature_key#94)) && (feature_key#94 = zp_avg_birthday_rsm_val_1d)) && (length(feature_value#95) = 7)) && isnotnull(cast(regexp_replace(feature_value#95, -, ) as int))) && (cast(regexp_replace(feature_value#95, -, ) as int) > 192001)) && (cast(regexp_replace(feature_value#95, -, ) as int) < 201701)) && isnotnull(unix_timestamp(concat(regexp_replace(feature_value#95, -, ), 01), yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#93 = 0)) && (cast(userid#93 as bigint) > 0)) && NOT userid#93 RLIKE ^0.*)
: : : +- *FileScan parquet hdp_jinrong_tech_dw.stf_taie_feature_topic[userid#93,feature_key#94,feature_value#95,service_type#98,period#99,task#100,dt#101] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/resultdata/hdp_jinrong_tech_dw/..., PartitionCount: 1, PartitionFilters: [isnotnull(service_type#98), isnotnull(task#100), isnotnull(dt#101), isnotnull(period#99), (dt#10..., PushedFilters: [IsNotNull(userid), IsNotNull(feature_key), EqualTo(feature_key,zp_avg_birthday_rsm_val_1d), Not(..., ReadSchema: struct<userid:string,feature_key:string,feature_value:string>
: : +- *Project [userid#14, feature_value#15, ds#16]
: : +- *Filter (count(1)#117L = 1)
: : +- *SortAggregate(key=[userid#102], functions=[max(birthday#103), count(1)], output=[userid#14, feature_value#15, ds#16, count(1)#117L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#102, 500, false)
: : +- *SortAggregate(key=[userid#102], functions=[partial_max(birthday#103), partial_count(1)], output=[userid#102, max#166, count#167L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Project [userid#102, birthday#103]
: : +- *Filter ((((((((isnotnull(birthday#103) && isnotnull(userid#102)) && isnotnull(cast(birthday#103 as int))) && (cast(birthday#103 as int) > 19200101)) && (cast(birthday#103 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#103, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#102 = 0)) && (cast(userid#102 as bigint) > 0)) && NOT userid#102 RLIKE ^0.*)
: : +- *FileScan parquet hdp_jinrong_tech_ods.user_chinahr_bir_20181231[userid#102,birthday#103,dt#104] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/warehouse/hdp_jinrong_tech_dw/d..., PartitionCount: 1, PartitionFilters: [isnotnull(dt#104), (dt#104 = 20181231)], PushedFilters: [IsNotNull(birthday), IsNotNull(userid), Not(EqualTo(userid,0))], ReadSchema: struct<userid:string,birthday:string>
: +- *Sort [userid#27 ASC NULLS FIRST], false, 0
: +- *Exchange hashpartitioning(userid#27, 500, true)
: +- *Project [UDF:des_encoder(userid#105, decode) AS userid#27, UDF:vector_trans(applistall#110, 4399游戏盒|代练通|b612咔叽|wifi万能钥匙|掌上wegame|翼支付|个人所得税|掌上英雄联盟|比心|饿了么|三国杀|知到|inshot-视频编辑|手机淘宝|美图秀秀|节奏大师|识货|部落冲突|qq|前程无忧51job|百度极速版|美柚|一甜相机|虎牙直播|迷你世界|云班课|轻颜相机|tt语音|火山小视频|驾考宝典|买单吧|宝宝树孕育|一起小学学生| 360手机卫士|家长通|同桌游戏|vsco|意见反馈|picsart美易照片编辑|哔哩哔哩|爱奇艺|一起作业学生|安全教育平台|小猿口算|熊猫优选|y2002电音|滴滴车主|百度网盘|dnf助手|作业帮|剪映|毒|微光|纳米盒|qq音乐|欢乐斗地主|汽车之家|uu加速器|最右|steam|内涵段子|小天才|百度手机助手|店长直聘|美篇|好看视频|uki|云课堂智慧职教|一键锁屏|蓝墨云班课|360手机助手|学习通|掌上飞车|抖音插件|支付宝|萤石云视频|全民小视频|小红书|手机营业厅|yy|铃声多多|腾讯动漫|来分期|快手|发现精彩|今日头条极速版|捷信金融|交易猫|faceu激萌|天天酷跑|蘑菇街|中国大学mooc|斗米|同花顺|分期乐|企鹅电竞|一起学|学而思网校|抖音短视频|快手直播伴侣|芒果tv|酷狗音乐|平安口袋银行|应用宝|绝地求生刺激战场|网易云音乐|易班|qq飞车|好游快爆|微博|王者营地|影视大全|球球大作战|boss直聘|皮皮虾|作业盒子小学|天天p图|小盒家长|皮皮搞笑|u净|全球购骑士特权|小猿搜题|mix|momo陌陌|掌上穿越火线|今日头条|超级课程表|穿越火线:枪战王者|掌通家园|学小易|王者荣耀|pubgmobile|工银融e联|快手小游戏|知乎|小恩爱|心悦俱乐部|平安普惠|百度贴吧|优酷|keep|美团|得物(毒)|百词斩|找靓机|晓黑板|到梦空间|美团外卖|智联招聘|华为主题动态引擎|情侣空间|抖音火山版|闲鱼|中国建设银行|今日校园|玩吧|相册管家|便签|智学网|浦发信用卡|熊猫直播|交管12123|掌上生活|腾讯新闻|猫咪|迅雷|探探|腾讯欢乐麻将|qq同步助手|凤凰新闻|平安好车主|和平精英|比心陪练|优酷视频|快猫|狼人杀|第五人格|快看漫画|斗鱼直播|途虎养车|搜狗输入法|西瓜视频|快影|人人视频|学习强国|倒数日|韩剧tv|音遇|soul|菜鸟裹裹|云电脑|无他相机|迅游手游加速器|租号玩|平安金管家|qq安全中心|360清理大师|葫芦侠|王者荣耀助手|taptap|全民k歌|腾讯视频|唱鸭|掌上道聚城|百度, \|) AS vector#28]
: +- *HiveTableScan [applistall#110, userid#105], HiveTableRelation `hdp_jinrong_tech_dw`.`dw_taie_tb_app_action_applist_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#105, lastdate#106, applistlast#107, applist180d#108, applist365d#109, applistall#110], [daystr#111], [isnotnull(daystr#111), (daystr#111 = 20210321)]
:- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#254, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#255, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#256]
: +- *MapElements <function1>, obj#253: com.bj58.decision.OutputData
: +- *Filter ((((<function1>.apply && <function1>.apply) && <function1>.apply) && <function1>.apply) && <function1>.apply)
: +- *DeserializeToObject createexternalrow(id#29.toString, jr_age#30.toString, xinan_age#31.toString, jl_age#32.toString, umc_age#33.toString, yc_age#34.toString, staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, true), vector#28, None).array, true, false), StructField(id,StringType,true), StructField(jr_age,StringType,true), StructField(xinan_age,StringType,true), StructField(jl_age,StringType,true), StructField(umc_age,StringType,true), StructField(yc_age,StringType,true), StructField(vector,ArrayType(DoubleType,true),true)), obj#252: org.apache.spark.sql.Row
: +- *Exchange RoundRobinPartitioning(300)
: +- *Project [coalesce(userid#121, userid#27) AS id#29, cast(jr_age#22 as string) AS jr_age#30, cast(xinan_age#23 as string) AS xinan_age#31, cast(jl_age#24 as string) AS jl_age#32, cast(umc_age#25 as string) AS umc_age#33, cast(yc_age#26 as string) AS yc_age#34, vector#28]
: +- *SortMergeJoin [userid#121], [userid#27], FullOuter
: :- *SortAggregate(key=[userid#121], functions=[max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, jr_age#22, xinan_age#23, jl_age#24, umc_age#25, yc_age#26])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#121, 500, false)
: : +- *SortAggregate(key=[userid#121], functions=[partial_max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, max#507, max#508, max#509, max#510, max#511])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Union
: : :- *Project [cast(user_id#39L as string) AS userid#121, birthday#43 AS feature_value#1, xinan AS ds#2]
: : : +- *Filter ((((((((isnotnull(user_id#39L) && isnotnull(birthday#43)) && isnotnull(cast(birthday#43 as int))) && (cast(birthday#43 as int) > 19200101)) && (cast(birthday#43 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#43, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (cast(user_id#39L as string) = 0)) && (cast(cast(user_id#39L as string) as bigint) > 0)) && NOT cast(user_id#39L as string) RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#43, user_id#39L], HiveTableRelation `hdp_ubu_xxzl_defaultdb`.`dw_realname_userinfo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#37, access_party_id#38, user_id#39L, name_md5#40, idno_md5#41, gender#42, birthday#43, address#44], [dt#45], [isnotnull(dt#45), (dt#45 = 20210321)]
: : :- *Project [wb_id#46 AS userid#5, birthday#3 AS feature_value#6, jinrong AS ds#7]
: : : +- *Filter ((((((isnotnull(rank1#4) && isnotnull(birthday#3)) && (rank1#4 = 1)) && isnotnull(cast(birthday#3 as int))) && (cast(birthday#3 as int) > 19200101)) && isnotnull(unix_timestamp(birthday#3, yyyyMMdd, Some(Asia/Shanghai)))) && (cast(birthday#3 as int) < 20170101))
: : : +- *Window [row_number() windowspecdefinition(wb_id#46, bill_no#50 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank1#4], [wb_id#46], [bill_no#50 DESC NULLS LAST]
: : : +- *Sort [wb_id#46 ASC NULLS FIRST, bill_no#50 DESC NULLS LAST], false, 0
: : : +- *Exchange hashpartitioning(wb_id#46, 500, false)
: : : +- *Project [wb_id#46, CASE WHEN ((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) THEN substring(idcard#47, 7, 8) WHEN ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))) THEN concat(19, substring(idcard#47, 7, 6)) END AS birthday#3, bill_no#50]
: : : +- *Filter ((((isnotnull(wb_id#46) && (((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) || ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))))) && NOT (wb_id#46 = 0)) && (cast(wb_id#46 as bigint) > 0)) && NOT wb_id#46 RLIKE ^0.*)
: : : +- *HiveTableScan [bill_no#50, idcard#47, wb_id#46], HiveTableRelation `hdp_jinrong_qiangui_defaultdb`.`itf_58haojie_user_baihang`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [wb_id#46, idcard#47, name#48, phone#49, bill_no#50]
: : :- *Project [userid#51, CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END AS feature_value#10, umc AS ds#11]
: : : +- *Filter (((((((((isnotnull(userid#51) && (length(birthday#66) > 5)) && (((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) || (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1))) && isnotnull(cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int))) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) > 19200101)) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) < 20170101)) && isnotnull(unix_timestamp(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#51 = 0)) && (cast(userid#51 as bigint) > 0)) && NOT userid#51 RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#66, userid#51], HiveTableRelation `hdp_teu_dpd_defaultdb`.`ds_umc_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#51, register_name#52, register_email#53, register_mobile#54, verified_mobile#55, verified_realname#56, verified_business#57, reg_time#58, reg_ip#59, reg_cityid#60, reg_platform#61, locked#62, md5_realname#63, nick_name#64, sex#65, birthday#66, md5_qq#67, md5_msn#68, md5_mobile#69, md5_phone#70, postzip#71, face#72, verified_face#73, address#74, ... 17 more fields], [ds#92], [isnotnull(ds#92), (ds#92 = umc)]
: : :- *Project [userid#93, concat(regexp_replace(feature_value#95, -, ), 01) AS feature_value#12, jianli AS ds#13]
: : : +- *Filter ((((((((((isnotnull(userid#93) && isnotnull(feature_key#94)) && (feature_key#94 = zp_avg_birthday_rsm_val_1d)) && (length(feature_value#95) = 7)) && isnotnull(cast(regexp_replace(feature_value#95, -, ) as int))) && (cast(regexp_replace(feature_value#95, -, ) as int) > 192001)) && (cast(regexp_replace(feature_value#95, -, ) as int) < 201701)) && isnotnull(unix_timestamp(concat(regexp_replace(feature_value#95, -, ), 01), yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#93 = 0)) && (cast(userid#93 as bigint) > 0)) && NOT userid#93 RLIKE ^0.*)
: : : +- *FileScan parquet hdp_jinrong_tech_dw.stf_taie_feature_topic[userid#93,feature_key#94,feature_value#95,service_type#98,period#99,task#100,dt#101] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/resultdata/hdp_jinrong_tech_dw/..., PartitionCount: 1, PartitionFilters: [isnotnull(service_type#98), isnotnull(task#100), isnotnull(dt#101), isnotnull(period#99), (dt#10..., PushedFilters: [IsNotNull(userid), IsNotNull(feature_key), EqualTo(feature_key,zp_avg_birthday_rsm_val_1d), Not(..., ReadSchema: struct<userid:string,feature_key:string,feature_value:string>
: : +- *Project [userid#14, feature_value#15, ds#16]
: : +- *Filter (count(1)#117L = 1)
: : +- *SortAggregate(key=[userid#102], functions=[max(birthday#103), count(1)], output=[userid#14, feature_value#15, ds#16, count(1)#117L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#102, 500, false)
: : +- *SortAggregate(key=[userid#102], functions=[partial_max(birthday#103), partial_count(1)], output=[userid#102, max#166, count#167L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Project [userid#102, birthday#103]
: : +- *Filter ((((((((isnotnull(birthday#103) && isnotnull(userid#102)) && isnotnull(cast(birthday#103 as int))) && (cast(birthday#103 as int) > 19200101)) && (cast(birthday#103 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#103, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#102 = 0)) && (cast(userid#102 as bigint) > 0)) && NOT userid#102 RLIKE ^0.*)
: : +- *FileScan parquet hdp_jinrong_tech_ods.user_chinahr_bir_20181231[userid#102,birthday#103,dt#104] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/warehouse/hdp_jinrong_tech_dw/d..., PartitionCount: 1, PartitionFilters: [isnotnull(dt#104), (dt#104 = 20181231)], PushedFilters: [IsNotNull(birthday), IsNotNull(userid), Not(EqualTo(userid,0))], ReadSchema: struct<userid:string,birthday:string>
: +- *Sort [userid#27 ASC NULLS FIRST], false, 0
: +- *Exchange hashpartitioning(userid#27, 500, true)
: +- *Project [UDF:des_encoder(userid#105, decode) AS userid#27, UDF:vector_trans(applistall#110, 4399游戏盒|代练通|b612咔叽|wifi万能钥匙|掌上wegame|翼支付|个人所得税|掌上英雄联盟|比心|饿了么|三国杀|知到|inshot-视频编辑|手机淘宝|美图秀秀|节奏大师|识货|部落冲突|qq|前程无忧51job|百度极速版|美柚|一甜相机|虎牙直播|迷你世界|云班课|轻颜相机|tt语音|火山小视频|驾考宝典|买单吧|宝宝树孕育|一起小学学生| 360手机卫士|家长通|同桌游戏|vsco|意见反馈|picsart美易照片编辑|哔哩哔哩|爱奇艺|一起作业学生|安全教育平台|小猿口算|熊猫优选|y2002电音|滴滴车主|百度网盘|dnf助手|作业帮|剪映|毒|微光|纳米盒|qq音乐|欢乐斗地主|汽车之家|uu加速器|最右|steam|内涵段子|小天才|百度手机助手|店长直聘|美篇|好看视频|uki|云课堂智慧职教|一键锁屏|蓝墨云班课|360手机助手|学习通|掌上飞车|抖音插件|支付宝|萤石云视频|全民小视频|小红书|手机营业厅|yy|铃声多多|腾讯动漫|来分期|快手|发现精彩|今日头条极速版|捷信金融|交易猫|faceu激萌|天天酷跑|蘑菇街|中国大学mooc|斗米|同花顺|分期乐|企鹅电竞|一起学|学而思网校|抖音短视频|快手直播伴侣|芒果tv|酷狗音乐|平安口袋银行|应用宝|绝地求生刺激战场|网易云音乐|易班|qq飞车|好游快爆|微博|王者营地|影视大全|球球大作战|boss直聘|皮皮虾|作业盒子小学|天天p图|小盒家长|皮皮搞笑|u净|全球购骑士特权|小猿搜题|mix|momo陌陌|掌上穿越火线|今日头条|超级课程表|穿越火线:枪战王者|掌通家园|学小易|王者荣耀|pubgmobile|工银融e联|快手小游戏|知乎|小恩爱|心悦俱乐部|平安普惠|百度贴吧|优酷|keep|美团|得物(毒)|百词斩|找靓机|晓黑板|到梦空间|美团外卖|智联招聘|华为主题动态引擎|情侣空间|抖音火山版|闲鱼|中国建设银行|今日校园|玩吧|相册管家|便签|智学网|浦发信用卡|熊猫直播|交管12123|掌上生活|腾讯新闻|猫咪|迅雷|探探|腾讯欢乐麻将|qq同步助手|凤凰新闻|平安好车主|和平精英|比心陪练|优酷视频|快猫|狼人杀|第五人格|快看漫画|斗鱼直播|途虎养车|搜狗输入法|西瓜视频|快影|人人视频|学习强国|倒数日|韩剧tv|音遇|soul|菜鸟裹裹|云电脑|无他相机|迅游手游加速器|租号玩|平安金管家|qq安全中心|360清理大师|葫芦侠|王者荣耀助手|taptap|全民k歌|腾讯视频|唱鸭|掌上道聚城|百度, \|) AS vector#28]
: +- *HiveTableScan [applistall#110, userid#105], HiveTableRelation `hdp_jinrong_tech_dw`.`dw_taie_tb_app_action_applist_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#105, lastdate#106, applistlast#107, applist180d#108, applist365d#109, applistall#110], [daystr#111], [isnotnull(daystr#111), (daystr#111 = 20210321)]
:- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#268, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#269, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#270]
: +- *MapElements <function1>, obj#267: com.bj58.decision.OutputData
: +- *Filter ((((<function1>.apply && <function1>.apply) && <function1>.apply) && <function1>.apply) && <function1>.apply)
: +- *DeserializeToObject createexternalrow(id#29.toString, jr_age#30.toString, xinan_age#31.toString, jl_age#32.toString, umc_age#33.toString, yc_age#34.toString, staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, true), vector#28, None).array, true, false), StructField(id,StringType,true), StructField(jr_age,StringType,true), StructField(xinan_age,StringType,true), StructField(jl_age,StringType,true), StructField(umc_age,StringType,true), StructField(yc_age,StringType,true), StructField(vector,ArrayType(DoubleType,true),true)), obj#266: org.apache.spark.sql.Row
: +- *Exchange RoundRobinPartitioning(300)
: +- *Project [coalesce(userid#121, userid#27) AS id#29, cast(jr_age#22 as string) AS jr_age#30, cast(xinan_age#23 as string) AS xinan_age#31, cast(jl_age#24 as string) AS jl_age#32, cast(umc_age#25 as string) AS umc_age#33, cast(yc_age#26 as string) AS yc_age#34, vector#28]
: +- *SortMergeJoin [userid#121], [userid#27], FullOuter
: :- *SortAggregate(key=[userid#121], functions=[max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, jr_age#22, xinan_age#23, jl_age#24, umc_age#25, yc_age#26])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#121, 500, false)
: : +- *SortAggregate(key=[userid#121], functions=[partial_max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, max#507, max#508, max#509, max#510, max#511])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Union
: : :- *Project [cast(user_id#39L as string) AS userid#121, birthday#43 AS feature_value#1, xinan AS ds#2]
: : : +- *Filter ((((((((isnotnull(user_id#39L) && isnotnull(birthday#43)) && isnotnull(cast(birthday#43 as int))) && (cast(birthday#43 as int) > 19200101)) && (cast(birthday#43 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#43, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (cast(user_id#39L as string) = 0)) && (cast(cast(user_id#39L as string) as bigint) > 0)) && NOT cast(user_id#39L as string) RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#43, user_id#39L], HiveTableRelation `hdp_ubu_xxzl_defaultdb`.`dw_realname_userinfo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#37, access_party_id#38, user_id#39L, name_md5#40, idno_md5#41, gender#42, birthday#43, address#44], [dt#45], [isnotnull(dt#45), (dt#45 = 20210321)]
: : :- *Project [wb_id#46 AS userid#5, birthday#3 AS feature_value#6, jinrong AS ds#7]
: : : +- *Filter ((((((isnotnull(rank1#4) && isnotnull(birthday#3)) && (rank1#4 = 1)) && isnotnull(cast(birthday#3 as int))) && (cast(birthday#3 as int) > 19200101)) && isnotnull(unix_timestamp(birthday#3, yyyyMMdd, Some(Asia/Shanghai)))) && (cast(birthday#3 as int) < 20170101))
: : : +- *Window [row_number() windowspecdefinition(wb_id#46, bill_no#50 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank1#4], [wb_id#46], [bill_no#50 DESC NULLS LAST]
: : : +- *Sort [wb_id#46 ASC NULLS FIRST, bill_no#50 DESC NULLS LAST], false, 0
: : : +- *Exchange hashpartitioning(wb_id#46, 500, false)
: : : +- *Project [wb_id#46, CASE WHEN ((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) THEN substring(idcard#47, 7, 8) WHEN ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))) THEN concat(19, substring(idcard#47, 7, 6)) END AS birthday#3, bill_no#50]
: : : +- *Filter ((((isnotnull(wb_id#46) && (((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) || ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))))) && NOT (wb_id#46 = 0)) && (cast(wb_id#46 as bigint) > 0)) && NOT wb_id#46 RLIKE ^0.*)
: : : +- *HiveTableScan [bill_no#50, idcard#47, wb_id#46], HiveTableRelation `hdp_jinrong_qiangui_defaultdb`.`itf_58haojie_user_baihang`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [wb_id#46, idcard#47, name#48, phone#49, bill_no#50]
: : :- *Project [userid#51, CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END AS feature_value#10, umc AS ds#11]
: : : +- *Filter (((((((((isnotnull(userid#51) && (length(birthday#66) > 5)) && (((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) || (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1))) && isnotnull(cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int))) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) > 19200101)) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) < 20170101)) && isnotnull(unix_timestamp(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#51 = 0)) && (cast(userid#51 as bigint) > 0)) && NOT userid#51 RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#66, userid#51], HiveTableRelation `hdp_teu_dpd_defaultdb`.`ds_umc_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#51, register_name#52, register_email#53, register_mobile#54, verified_mobile#55, verified_realname#56, verified_business#57, reg_time#58, reg_ip#59, reg_cityid#60, reg_platform#61, locked#62, md5_realname#63, nick_name#64, sex#65, birthday#66, md5_qq#67, md5_msn#68, md5_mobile#69, md5_phone#70, postzip#71, face#72, verified_face#73, address#74, ... 17 more fields], [ds#92], [isnotnull(ds#92), (ds#92 = umc)]
: : :- *Project [userid#93, concat(regexp_replace(feature_value#95, -, ), 01) AS feature_value#12, jianli AS ds#13]
: : : +- *Filter ((((((((((isnotnull(userid#93) && isnotnull(feature_key#94)) && (feature_key#94 = zp_avg_birthday_rsm_val_1d)) && (length(feature_value#95) = 7)) && isnotnull(cast(regexp_replace(feature_value#95, -, ) as int))) && (cast(regexp_replace(feature_value#95, -, ) as int) > 192001)) && (cast(regexp_replace(feature_value#95, -, ) as int) < 201701)) && isnotnull(unix_timestamp(concat(regexp_replace(feature_value#95, -, ), 01), yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#93 = 0)) && (cast(userid#93 as bigint) > 0)) && NOT userid#93 RLIKE ^0.*)
: : : +- *FileScan parquet hdp_jinrong_tech_dw.stf_taie_feature_topic[userid#93,feature_key#94,feature_value#95,service_type#98,period#99,task#100,dt#101] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/resultdata/hdp_jinrong_tech_dw/..., PartitionCount: 1, PartitionFilters: [isnotnull(service_type#98), isnotnull(task#100), isnotnull(dt#101), isnotnull(period#99), (dt#10..., PushedFilters: [IsNotNull(userid), IsNotNull(feature_key), EqualTo(feature_key,zp_avg_birthday_rsm_val_1d), Not(..., ReadSchema: struct<userid:string,feature_key:string,feature_value:string>
: : +- *Project [userid#14, feature_value#15, ds#16]
: : +- *Filter (count(1)#117L = 1)
: : +- *SortAggregate(key=[userid#102], functions=[max(birthday#103), count(1)], output=[userid#14, feature_value#15, ds#16, count(1)#117L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#102, 500, false)
: : +- *SortAggregate(key=[userid#102], functions=[partial_max(birthday#103), partial_count(1)], output=[userid#102, max#166, count#167L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Project [userid#102, birthday#103]
: : +- *Filter ((((((((isnotnull(birthday#103) && isnotnull(userid#102)) && isnotnull(cast(birthday#103 as int))) && (cast(birthday#103 as int) > 19200101)) && (cast(birthday#103 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#103, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#102 = 0)) && (cast(userid#102 as bigint) > 0)) && NOT userid#102 RLIKE ^0.*)
: : +- *FileScan parquet hdp_jinrong_tech_ods.user_chinahr_bir_20181231[userid#102,birthday#103,dt#104] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/warehouse/hdp_jinrong_tech_dw/d..., PartitionCount: 1, PartitionFilters: [isnotnull(dt#104), (dt#104 = 20181231)], PushedFilters: [IsNotNull(birthday), IsNotNull(userid), Not(EqualTo(userid,0))], ReadSchema: struct<userid:string,birthday:string>
: +- *Sort [userid#27 ASC NULLS FIRST], false, 0
: +- *Exchange hashpartitioning(userid#27, 500, true)
: +- *Project [UDF:des_encoder(userid#105, decode) AS userid#27, UDF:vector_trans(applistall#110, 4399游戏盒|代练通|b612咔叽|wifi万能钥匙|掌上wegame|翼支付|个人所得税|掌上英雄联盟|比心|饿了么|三国杀|知到|inshot-视频编辑|手机淘宝|美图秀秀|节奏大师|识货|部落冲突|qq|前程无忧51job|百度极速版|美柚|一甜相机|虎牙直播|迷你世界|云班课|轻颜相机|tt语音|火山小视频|驾考宝典|买单吧|宝宝树孕育|一起小学学生| 360手机卫士|家长通|同桌游戏|vsco|意见反馈|picsart美易照片编辑|哔哩哔哩|爱奇艺|一起作业学生|安全教育平台|小猿口算|熊猫优选|y2002电音|滴滴车主|百度网盘|dnf助手|作业帮|剪映|毒|微光|纳米盒|qq音乐|欢乐斗地主|汽车之家|uu加速器|最右|steam|内涵段子|小天才|百度手机助手|店长直聘|美篇|好看视频|uki|云课堂智慧职教|一键锁屏|蓝墨云班课|360手机助手|学习通|掌上飞车|抖音插件|支付宝|萤石云视频|全民小视频|小红书|手机营业厅|yy|铃声多多|腾讯动漫|来分期|快手|发现精彩|今日头条极速版|捷信金融|交易猫|faceu激萌|天天酷跑|蘑菇街|中国大学mooc|斗米|同花顺|分期乐|企鹅电竞|一起学|学而思网校|抖音短视频|快手直播伴侣|芒果tv|酷狗音乐|平安口袋银行|应用宝|绝地求生刺激战场|网易云音乐|易班|qq飞车|好游快爆|微博|王者营地|影视大全|球球大作战|boss直聘|皮皮虾|作业盒子小学|天天p图|小盒家长|皮皮搞笑|u净|全球购骑士特权|小猿搜题|mix|momo陌陌|掌上穿越火线|今日头条|超级课程表|穿越火线:枪战王者|掌通家园|学小易|王者荣耀|pubgmobile|工银融e联|快手小游戏|知乎|小恩爱|心悦俱乐部|平安普惠|百度贴吧|优酷|keep|美团|得物(毒)|百词斩|找靓机|晓黑板|到梦空间|美团外卖|智联招聘|华为主题动态引擎|情侣空间|抖音火山版|闲鱼|中国建设银行|今日校园|玩吧|相册管家|便签|智学网|浦发信用卡|熊猫直播|交管12123|掌上生活|腾讯新闻|猫咪|迅雷|探探|腾讯欢乐麻将|qq同步助手|凤凰新闻|平安好车主|和平精英|比心陪练|优酷视频|快猫|狼人杀|第五人格|快看漫画|斗鱼直播|途虎养车|搜狗输入法|西瓜视频|快影|人人视频|学习强国|倒数日|韩剧tv|音遇|soul|菜鸟裹裹|云电脑|无他相机|迅游手游加速器|租号玩|平安金管家|qq安全中心|360清理大师|葫芦侠|王者荣耀助手|taptap|全民k歌|腾讯视频|唱鸭|掌上道聚城|百度, \|) AS vector#28]
: +- *HiveTableScan [applistall#110, userid#105], HiveTableRelation `hdp_jinrong_tech_dw`.`dw_taie_tb_app_action_applist_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#105, lastdate#106, applistlast#107, applist180d#108, applist365d#109, applistall#110], [daystr#111], [isnotnull(daystr#111), (daystr#111 = 20210321)]
:- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#282, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#283, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#284]
: +- *MapElements <function1>, obj#281: com.bj58.decision.OutputData
: +- *Filter (((((<function1>.apply && <function1>.apply) && <function1>.apply) && <function1>.apply) && <function1>.apply) && <function1>.apply)
: +- *DeserializeToObject createexternalrow(id#29.toString, jr_age#30.toString, xinan_age#31.toString, jl_age#32.toString, umc_age#33.toString, yc_age#34.toString, staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, true), vector#28, None).array, true, false), StructField(id,StringType,true), StructField(jr_age,StringType,true), StructField(xinan_age,StringType,true), StructField(jl_age,StringType,true), StructField(umc_age,StringType,true), StructField(yc_age,StringType,true), StructField(vector,ArrayType(DoubleType,true),true)), obj#280: org.apache.spark.sql.Row
: +- *Exchange RoundRobinPartitioning(300)
: +- *Project [coalesce(userid#121, userid#27) AS id#29, cast(jr_age#22 as string) AS jr_age#30, cast(xinan_age#23 as string) AS xinan_age#31, cast(jl_age#24 as string) AS jl_age#32, cast(umc_age#25 as string) AS umc_age#33, cast(yc_age#26 as string) AS yc_age#34, vector#28]
: +- *SortMergeJoin [userid#121], [userid#27], FullOuter
: :- *SortAggregate(key=[userid#121], functions=[max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, jr_age#22, xinan_age#23, jl_age#24, umc_age#25, yc_age#26])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#121, 500, false)
: : +- *SortAggregate(key=[userid#121], functions=[partial_max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, max#507, max#508, max#509, max#510, max#511])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Union
: : :- *Project [cast(user_id#39L as string) AS userid#121, birthday#43 AS feature_value#1, xinan AS ds#2]
: : : +- *Filter ((((((((isnotnull(user_id#39L) && isnotnull(birthday#43)) && isnotnull(cast(birthday#43 as int))) && (cast(birthday#43 as int) > 19200101)) && (cast(birthday#43 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#43, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (cast(user_id#39L as string) = 0)) && (cast(cast(user_id#39L as string) as bigint) > 0)) && NOT cast(user_id#39L as string) RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#43, user_id#39L], HiveTableRelation `hdp_ubu_xxzl_defaultdb`.`dw_realname_userinfo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#37, access_party_id#38, user_id#39L, name_md5#40, idno_md5#41, gender#42, birthday#43, address#44], [dt#45], [isnotnull(dt#45), (dt#45 = 20210321)]
: : :- *Project [wb_id#46 AS userid#5, birthday#3 AS feature_value#6, jinrong AS ds#7]
: : : +- *Filter ((((((isnotnull(rank1#4) && isnotnull(birthday#3)) && (rank1#4 = 1)) && isnotnull(cast(birthday#3 as int))) && (cast(birthday#3 as int) > 19200101)) && isnotnull(unix_timestamp(birthday#3, yyyyMMdd, Some(Asia/Shanghai)))) && (cast(birthday#3 as int) < 20170101))
: : : +- *Window [row_number() windowspecdefinition(wb_id#46, bill_no#50 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank1#4], [wb_id#46], [bill_no#50 DESC NULLS LAST]
: : : +- *Sort [wb_id#46 ASC NULLS FIRST, bill_no#50 DESC NULLS LAST], false, 0
: : : +- *Exchange hashpartitioning(wb_id#46, 500, false)
: : : +- *Project [wb_id#46, CASE WHEN ((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) THEN substring(idcard#47, 7, 8) WHEN ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))) THEN concat(19, substring(idcard#47, 7, 6)) END AS birthday#3, bill_no#50]
: : : +- *Filter ((((isnotnull(wb_id#46) && (((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) || ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))))) && NOT (wb_id#46 = 0)) && (cast(wb_id#46 as bigint) > 0)) && NOT wb_id#46 RLIKE ^0.*)
: : : +- *HiveTableScan [bill_no#50, idcard#47, wb_id#46], HiveTableRelation `hdp_jinrong_qiangui_defaultdb`.`itf_58haojie_user_baihang`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [wb_id#46, idcard#47, name#48, phone#49, bill_no#50]
: : :- *Project [userid#51, CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END AS feature_value#10, umc AS ds#11]
: : : +- *Filter (((((((((isnotnull(userid#51) && (length(birthday#66) > 5)) && (((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) || (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1))) && isnotnull(cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int))) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) > 19200101)) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) < 20170101)) && isnotnull(unix_timestamp(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#51 = 0)) && (cast(userid#51 as bigint) > 0)) && NOT userid#51 RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#66, userid#51], HiveTableRelation `hdp_teu_dpd_defaultdb`.`ds_umc_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#51, register_name#52, register_email#53, register_mobile#54, verified_mobile#55, verified_realname#56, verified_business#57, reg_time#58, reg_ip#59, reg_cityid#60, reg_platform#61, locked#62, md5_realname#63, nick_name#64, sex#65, birthday#66, md5_qq#67, md5_msn#68, md5_mobile#69, md5_phone#70, postzip#71, face#72, verified_face#73, address#74, ... 17 more fields], [ds#92], [isnotnull(ds#92), (ds#92 = umc)]
: : :- *Project [userid#93, concat(regexp_replace(feature_value#95, -, ), 01) AS feature_value#12, jianli AS ds#13]
: : : +- *Filter ((((((((((isnotnull(userid#93) && isnotnull(feature_key#94)) && (feature_key#94 = zp_avg_birthday_rsm_val_1d)) && (length(feature_value#95) = 7)) && isnotnull(cast(regexp_replace(feature_value#95, -, ) as int))) && (cast(regexp_replace(feature_value#95, -, ) as int) > 192001)) && (cast(regexp_replace(feature_value#95, -, ) as int) < 201701)) && isnotnull(unix_timestamp(concat(regexp_replace(feature_value#95, -, ), 01), yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#93 = 0)) && (cast(userid#93 as bigint) > 0)) && NOT userid#93 RLIKE ^0.*)
: : : +- *FileScan parquet hdp_jinrong_tech_dw.stf_taie_feature_topic[userid#93,feature_key#94,feature_value#95,service_type#98,period#99,task#100,dt#101] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/resultdata/hdp_jinrong_tech_dw/..., PartitionCount: 1, PartitionFilters: [isnotnull(service_type#98), isnotnull(task#100), isnotnull(dt#101), isnotnull(period#99), (dt#10..., PushedFilters: [IsNotNull(userid), IsNotNull(feature_key), EqualTo(feature_key,zp_avg_birthday_rsm_val_1d), Not(..., ReadSchema: struct<userid:string,feature_key:string,feature_value:string>
: : +- *Project [userid#14, feature_value#15, ds#16]
: : +- *Filter (count(1)#117L = 1)
: : +- *SortAggregate(key=[userid#102], functions=[max(birthday#103), count(1)], output=[userid#14, feature_value#15, ds#16, count(1)#117L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#102, 500, false)
: : +- *SortAggregate(key=[userid#102], functions=[partial_max(birthday#103), partial_count(1)], output=[userid#102, max#166, count#167L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Project [userid#102, birthday#103]
: : +- *Filter ((((((((isnotnull(birthday#103) && isnotnull(userid#102)) && isnotnull(cast(birthday#103 as int))) && (cast(birthday#103 as int) > 19200101)) && (cast(birthday#103 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#103, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#102 = 0)) && (cast(userid#102 as bigint) > 0)) && NOT userid#102 RLIKE ^0.*)
: : +- *FileScan parquet hdp_jinrong_tech_ods.user_chinahr_bir_20181231[userid#102,birthday#103,dt#104] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/warehouse/hdp_jinrong_tech_dw/d..., PartitionCount: 1, PartitionFilters: [isnotnull(dt#104), (dt#104 = 20181231)], PushedFilters: [IsNotNull(birthday), IsNotNull(userid), Not(EqualTo(userid,0))], ReadSchema: struct<userid:string,birthday:string>
: +- *Sort [userid#27 ASC NULLS FIRST], false, 0
: +- *Exchange hashpartitioning(userid#27, 500, true)
: +- *Project [UDF:des_encoder(userid#105, decode) AS userid#27, UDF:vector_trans(applistall#110, 4399游戏盒|代练通|b612咔叽|wifi万能钥匙|掌上wegame|翼支付|个人所得税|掌上英雄联盟|比心|饿了么|三国杀|知到|inshot-视频编辑|手机淘宝|美图秀秀|节奏大师|识货|部落冲突|qq|前程无忧51job|百度极速版|美柚|一甜相机|虎牙直播|迷你世界|云班课|轻颜相机|tt语音|火山小视频|驾考宝典|买单吧|宝宝树孕育|一起小学学生| 360手机卫士|家长通|同桌游戏|vsco|意见反馈|picsart美易照片编辑|哔哩哔哩|爱奇艺|一起作业学生|安全教育平台|小猿口算|熊猫优选|y2002电音|滴滴车主|百度网盘|dnf助手|作业帮|剪映|毒|微光|纳米盒|qq音乐|欢乐斗地主|汽车之家|uu加速器|最右|steam|内涵段子|小天才|百度手机助手|店长直聘|美篇|好看视频|uki|云课堂智慧职教|一键锁屏|蓝墨云班课|360手机助手|学习通|掌上飞车|抖音插件|支付宝|萤石云视频|全民小视频|小红书|手机营业厅|yy|铃声多多|腾讯动漫|来分期|快手|发现精彩|今日头条极速版|捷信金融|交易猫|faceu激萌|天天酷跑|蘑菇街|中国大学mooc|斗米|同花顺|分期乐|企鹅电竞|一起学|学而思网校|抖音短视频|快手直播伴侣|芒果tv|酷狗音乐|平安口袋银行|应用宝|绝地求生刺激战场|网易云音乐|易班|qq飞车|好游快爆|微博|王者营地|影视大全|球球大作战|boss直聘|皮皮虾|作业盒子小学|天天p图|小盒家长|皮皮搞笑|u净|全球购骑士特权|小猿搜题|mix|momo陌陌|掌上穿越火线|今日头条|超级课程表|穿越火线:枪战王者|掌通家园|学小易|王者荣耀|pubgmobile|工银融e联|快手小游戏|知乎|小恩爱|心悦俱乐部|平安普惠|百度贴吧|优酷|keep|美团|得物(毒)|百词斩|找靓机|晓黑板|到梦空间|美团外卖|智联招聘|华为主题动态引擎|情侣空间|抖音火山版|闲鱼|中国建设银行|今日校园|玩吧|相册管家|便签|智学网|浦发信用卡|熊猫直播|交管12123|掌上生活|腾讯新闻|猫咪|迅雷|探探|腾讯欢乐麻将|qq同步助手|凤凰新闻|平安好车主|和平精英|比心陪练|优酷视频|快猫|狼人杀|第五人格|快看漫画|斗鱼直播|途虎养车|搜狗输入法|西瓜视频|快影|人人视频|学习强国|倒数日|韩剧tv|音遇|soul|菜鸟裹裹|云电脑|无他相机|迅游手游加速器|租号玩|平安金管家|qq安全中心|360清理大师|葫芦侠|王者荣耀助手|taptap|全民k歌|腾讯视频|唱鸭|掌上道聚城|百度, \|) AS vector#28]
: +- *HiveTableScan [applistall#110, userid#105], HiveTableRelation `hdp_jinrong_tech_dw`.`dw_taie_tb_app_action_applist_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#105, lastdate#106, applistlast#107, applist180d#108, applist365d#109, applistall#110], [daystr#111], [isnotnull(daystr#111), (daystr#111 = 20210321)]
:- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#294, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#295, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#296]
: +- *MapElements <function1>, obj#293: com.bj58.decision.OutputData
: +- *Filter (((((<function1>.apply && <function1>.apply) && <function1>.apply) && <function1>.apply) && <function1>.apply) && <function1>.apply)
: +- *DeserializeToObject createexternalrow(id#29.toString, jr_age#30.toString, xinan_age#31.toString, jl_age#32.toString, umc_age#33.toString, yc_age#34.toString, staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, DoubleType, true), vector#28, None).array, true, false), StructField(id,StringType,true), StructField(jr_age,StringType,true), StructField(xinan_age,StringType,true), StructField(jl_age,StringType,true), StructField(umc_age,StringType,true), StructField(yc_age,StringType,true), StructField(vector,ArrayType(DoubleType,true),true)), obj#292: org.apache.spark.sql.Row
: +- *Exchange RoundRobinPartitioning(300)
: +- *Project [coalesce(userid#121, userid#27) AS id#29, cast(jr_age#22 as string) AS jr_age#30, cast(xinan_age#23 as string) AS xinan_age#31, cast(jl_age#24 as string) AS jl_age#32, cast(umc_age#25 as string) AS umc_age#33, cast(yc_age#26 as string) AS yc_age#34, vector#28]
: +- *SortMergeJoin [userid#121], [userid#27], FullOuter
: :- *SortAggregate(key=[userid#121], functions=[max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, jr_age#22, xinan_age#23, jl_age#24, umc_age#25, yc_age#26])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#121, 500, false)
: : +- *SortAggregate(key=[userid#121], functions=[partial_max(CASE WHEN (ds#2 = jinrong) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = xinan) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = jianli) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = umc) THEN feature_value#1 ELSE null END), partial_max(CASE WHEN (ds#2 = yingcai) THEN feature_value#1 ELSE null END)], output=[userid#121, max#507, max#508, max#509, max#510, max#511])
: : +- *Sort [userid#121 ASC NULLS FIRST], false, 0
: : +- *Union
: : :- *Project [cast(user_id#39L as string) AS userid#121, birthday#43 AS feature_value#1, xinan AS ds#2]
: : : +- *Filter ((((((((isnotnull(user_id#39L) && isnotnull(birthday#43)) && isnotnull(cast(birthday#43 as int))) && (cast(birthday#43 as int) > 19200101)) && (cast(birthday#43 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#43, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (cast(user_id#39L as string) = 0)) && (cast(cast(user_id#39L as string) as bigint) > 0)) && NOT cast(user_id#39L as string) RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#43, user_id#39L], HiveTableRelation `hdp_ubu_xxzl_defaultdb`.`dw_realname_userinfo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#37, access_party_id#38, user_id#39L, name_md5#40, idno_md5#41, gender#42, birthday#43, address#44], [dt#45], [isnotnull(dt#45), (dt#45 = 20210321)]
: : :- *Project [wb_id#46 AS userid#5, birthday#3 AS feature_value#6, jinrong AS ds#7]
: : : +- *Filter ((((((isnotnull(rank1#4) && isnotnull(birthday#3)) && (rank1#4 = 1)) && isnotnull(cast(birthday#3 as int))) && (cast(birthday#3 as int) > 19200101)) && isnotnull(unix_timestamp(birthday#3, yyyyMMdd, Some(Asia/Shanghai)))) && (cast(birthday#3 as int) < 20170101))
: : : +- *Window [row_number() windowspecdefinition(wb_id#46, bill_no#50 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank1#4], [wb_id#46], [bill_no#50 DESC NULLS LAST]
: : : +- *Sort [wb_id#46 ASC NULLS FIRST, bill_no#50 DESC NULLS LAST], false, 0
: : : +- *Exchange hashpartitioning(wb_id#46, 500, false)
: : : +- *Project [wb_id#46, CASE WHEN ((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) THEN substring(idcard#47, 7, 8) WHEN ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))) THEN concat(19, substring(idcard#47, 7, 6)) END AS birthday#3, bill_no#50]
: : : +- *Filter ((((isnotnull(wb_id#46) && (((length(idcard#47) = 18) && isnotnull(cast(substring(idcard#47, 7, 8) as int))) || ((length(idcard#47) = 15) && isnotnull(cast(substring(idcard#47, 7, 6) as int))))) && NOT (wb_id#46 = 0)) && (cast(wb_id#46 as bigint) > 0)) && NOT wb_id#46 RLIKE ^0.*)
: : : +- *HiveTableScan [bill_no#50, idcard#47, wb_id#46], HiveTableRelation `hdp_jinrong_qiangui_defaultdb`.`itf_58haojie_user_baihang`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [wb_id#46, idcard#47, name#48, phone#49, bill_no#50]
: : :- *Project [userid#51, CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END AS feature_value#10, umc AS ds#11]
: : : +- *Filter (((((((((isnotnull(userid#51) && (length(birthday#66) > 5)) && (((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) || (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1))) && isnotnull(cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int))) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) > 19200101)) && (cast(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END as int) < 20170101)) && isnotnull(unix_timestamp(CASE WHEN ((instr(trim(birthday#66, None), CST) > 0) || (instr(trim(birthday#66, None), CDT) > 0)) THEN from_unixtime(unix_timestamp(trim(birthday#66, None), EEEMMMddHH:mm:sszzzyyyy, Some(Asia/Shanghai)), yyyyMMdd, Some(Asia/Shanghai)) WHEN (length(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0)) > 1) THEN regexp_replace(regexp_extract(birthday#66, ^d{4}-d{2}-d{2}, 0), -, ) END, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#51 = 0)) && (cast(userid#51 as bigint) > 0)) && NOT userid#51 RLIKE ^0.*)
: : : +- *HiveTableScan [birthday#66, userid#51], HiveTableRelation `hdp_teu_dpd_defaultdb`.`ds_umc_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#51, register_name#52, register_email#53, register_mobile#54, verified_mobile#55, verified_realname#56, verified_business#57, reg_time#58, reg_ip#59, reg_cityid#60, reg_platform#61, locked#62, md5_realname#63, nick_name#64, sex#65, birthday#66, md5_qq#67, md5_msn#68, md5_mobile#69, md5_phone#70, postzip#71, face#72, verified_face#73, address#74, ... 17 more fields], [ds#92], [isnotnull(ds#92), (ds#92 = umc)]
: : :- *Project [userid#93, concat(regexp_replace(feature_value#95, -, ), 01) AS feature_value#12, jianli AS ds#13]
: : : +- *Filter ((((((((((isnotnull(userid#93) && isnotnull(feature_key#94)) && (feature_key#94 = zp_avg_birthday_rsm_val_1d)) && (length(feature_value#95) = 7)) && isnotnull(cast(regexp_replace(feature_value#95, -, ) as int))) && (cast(regexp_replace(feature_value#95, -, ) as int) > 192001)) && (cast(regexp_replace(feature_value#95, -, ) as int) < 201701)) && isnotnull(unix_timestamp(concat(regexp_replace(feature_value#95, -, ), 01), yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#93 = 0)) && (cast(userid#93 as bigint) > 0)) && NOT userid#93 RLIKE ^0.*)
: : : +- *FileScan parquet hdp_jinrong_tech_dw.stf_taie_feature_topic[userid#93,feature_key#94,feature_value#95,service_type#98,period#99,task#100,dt#101] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/resultdata/hdp_jinrong_tech_dw/..., PartitionCount: 1, PartitionFilters: [isnotnull(service_type#98), isnotnull(task#100), isnotnull(dt#101), isnotnull(period#99), (dt#10..., PushedFilters: [IsNotNull(userid), IsNotNull(feature_key), EqualTo(feature_key,zp_avg_birthday_rsm_val_1d), Not(..., ReadSchema: struct<userid:string,feature_key:string,feature_value:string>
: : +- *Project [userid#14, feature_value#15, ds#16]
: : +- *Filter (count(1)#117L = 1)
: : +- *SortAggregate(key=[userid#102], functions=[max(birthday#103), count(1)], output=[userid#14, feature_value#15, ds#16, count(1)#117L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Exchange hashpartitioning(userid#102, 500, false)
: : +- *SortAggregate(key=[userid#102], functions=[partial_max(birthday#103), partial_count(1)], output=[userid#102, max#166, count#167L])
: : +- *Sort [userid#102 ASC NULLS FIRST], false, 0
: : +- *Project [userid#102, birthday#103]
: : +- *Filter ((((((((isnotnull(birthday#103) && isnotnull(userid#102)) && isnotnull(cast(birthday#103 as int))) && (cast(birthday#103 as int) > 19200101)) && (cast(birthday#103 as int) < 20170101)) && isnotnull(unix_timestamp(birthday#103, yyyyMMdd, Some(Asia/Shanghai)))) && NOT (userid#102 = 0)) && (cast(userid#102 as bigint) > 0)) && NOT userid#102 RLIKE ^0.*)
: : +- *FileScan parquet hdp_jinrong_tech_ods.user_chinahr_bir_20181231[userid#102,birthday#103,dt#104] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[viewfs://58-cluster/home/hdp_jinrong_tech/warehouse/hdp_jinrong_tech_dw/d..., PartitionCount: 1, PartitionFilters: [isnotnull(dt#104), (dt#104 = 20181231)], PushedFilters: [IsNotNull(birthday), IsNotNull(userid), Not(EqualTo(userid,0))], ReadSchema: struct<userid:string,birthday:string>
: +- *Sort [userid#27 ASC NULLS FIRST], false, 0
: +- *Exchange hashpartitioning(userid#27, 500, true)
: +- *Project [UDF:des_encoder(userid#105, decode) AS userid#27, UDF:vector_trans(applistall#110, 4399游戏盒|代练通|b612咔叽|wifi万能钥匙|掌上wegame|翼支付|个人所得税|掌上英雄联盟|比心|饿了么|三国杀|知到|inshot-视频编辑|手机淘宝|美图秀秀|节奏大师|识货|部落冲突|qq|前程无忧51job|百度极速版|美柚|一甜相机|虎牙直播|迷你世界|云班课|轻颜相机|tt语音|火山小视频|驾考宝典|买单吧|宝宝树孕育|一起小学学生| 360手机卫士|家长通|同桌游戏|vsco|意见反馈|picsart美易照片编辑|哔哩哔哩|爱奇艺|一起作业学生|安全教育平台|小猿口算|熊猫优选|y2002电音|滴滴车主|百度网盘|dnf助手|作业帮|剪映|毒|微光|纳米盒|qq音乐|欢乐斗地主|汽车之家|uu加速器|最右|steam|内涵段子|小天才|百度手机助手|店长直聘|美篇|好看视频|uki|云课堂智慧职教|一键锁屏|蓝墨云班课|360手机助手|学习通|掌上飞车|抖音插件|支付宝|萤石云视频|全民小视频|小红书|手机营业厅|yy|铃声多多|腾讯动漫|来分期|快手|发现精彩|今日头条极速版|捷信金融|交易猫|faceu激萌|天天酷跑|蘑菇街|中国大学mooc|斗米|同花顺|分期乐|企鹅电竞|一起学|学而思网校|抖音短视频|快手直播伴侣|芒果tv|酷狗音乐|平安口袋银行|应用宝|绝地求生刺激战场|网易云音乐|易班|qq飞车|好游快爆|微博|王者营地|影视大全|球球大作战|boss直聘|皮皮虾|作业盒子小学|天天p图|小盒家长|皮皮搞笑|u净|全球购骑士特权|小猿搜题|mix|momo陌陌|掌上穿越火线|今日头条|超级课程表|穿越火线:枪战王者|掌通家园|学小易|王者荣耀|pubgmobile|工银融e联|快手小游戏|知乎|小恩爱|心悦俱乐部|平安普惠|百度贴吧|优酷|keep|美团|得物(毒)|百词斩|找靓机|晓黑板|到梦空间|美团外卖|智联招聘|华为主题动态引擎|情侣空间|抖音火山版|闲鱼|中国建设银行|今日校园|玩吧|相册管家|便签|智学网|浦发信用卡|熊猫直播|交管12123|掌上生活|腾讯新闻|猫咪|迅雷|探探|腾讯欢乐麻将|qq同步助手|凤凰新闻|平安好车主|和平精英|比心陪练|优酷视频|快猫|狼人杀|第五人格|快看漫画|斗鱼直播|途虎养车|搜狗输入法|西瓜视频|快影|人人视频|学习强国|倒数日|韩剧tv|音遇|soul|菜鸟裹裹|云电脑|无他相机|迅游手游加速器|租号玩|平安金管家|qq安全中心|360清理大师|葫芦侠|王者荣耀助手|taptap|全民k歌|腾讯视频|唱鸭|掌上道聚城|百度, \|) AS vector#28]
: +- *HiveTableScan [applistall#110, userid#105], HiveTableRelation `hdp_jinrong_tech_dw`.`dw_taie_tb_app_action_applist_info`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [userid#105, lastdate#106, applistlast#107, applist180d#108, applist365d#109, applistall#110], [daystr#111], [isnotnull(daystr#111), (daystr#111 = 20210321)]
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).id, true, false) AS id#472, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.bj58.decision.OutputData, true]).featureValue, true, false) AS featureValue#473, assertnotnull(input[0, com.bj58.decision.OutputData, true]).confidenceScore AS confidenceScore#474]
+- *MapElements <function1>, obj#471: com.bj58.decision.OutputData
+- *DeserializeToObject createexternalrow(id#385.toString, newInstance(class org.apache.spark.ml.linalg.VectorUDT).deserialize, predictedLabel#454.toString, StructField(id,StringType,true), StructField(probability,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true), StructField(predictedLabel,StringType,true)), obj#470: org.apache.spark.sql.Row
+- *Project [id#385, UDF(_probability#389) AS probability#406, UDF(knownotnull(UDF(_probability#389))) AS predictedLabel#454]
+- *Scan ExistingRDD[label#383,features#384,id#385,classIndex#386,indexedFeatures#387,_rawPrediction#388,_probability#389]
DAG执行图:
Stage执行图:
四、优化计算
一、关于条件表达式解析计算的优化
为了解析用户定义的各种条件表达式,我引入了阿里开源的QLExpress,它支持解析各种数学表达式,也支持重写、自定义等操作。QLExpress开源项目源码链接:https://github.com/alibaba/QL...
我在使用它时,一开始发现效率不是很高,后来发现跟代码实现有关,部分代码如下:
lazy val runner: ExpressRunner = {
val runner = new ExpressRunner()
/**
* 修改原表达式 `==` 的判断逻辑
* 修改前:null == null 返回:true
* 修改后:null == null 返回:false
*/
runner.replaceOperator("==", new EqualsOperator())
runner
}
lazy val fields: Array[String] = runner.getOutVarNames(ruleText)
lazy val leftDF: DataFrame = data.filter(r => isLeft(r))
lazy val rightDF: DataFrame = data.filter(!isLeft(_))
def isLeft(row: Row): Boolean = {
val context = new DefaultContext[String, AnyRef]
fields.foreach(field =>
context.put(field, row.getAs(field).asInstanceOf[AnyRef])
)
try {
runner.execute(ruleText, context,
null, true, false)
.asInstanceOf[Boolean]
} catch {
case _: Exception =>
throw new RuntimeException(
s"`isLeft` method Exception, condition expression compute failed !")
}
}
按照一开始的实现,不难发现每处理一条数据都需要传入ruleText去重新解析表达式,这样的话效率自然会变低,解决方案也很简单,就是开启表达式的缓存功能。
第二个问题是按照Hive或者Spark的逻辑,==
表达式两边如果出现null,那么一定是要返回false的,但QLExpress竟然返回的是true。于是我自己重信定义了==
表达式,替换掉了原来的表达式逻辑:runner.replaceOperator("==", new EqualsOperator())
。代码如下:
package com.yang.express;
import com.ql.util.express.Operator;
import com.ql.util.express.OperatorOfNumber;
/**
* 修改阿里开源QLExpress表达式 `==` 的判断逻辑
* 修改前:null == null 返回:true
* 修改后:null == null 返回:false
*
* @author yangfan
* @since 2021/3/19
* @version 1.0.0
*/
public class EqualsOperator extends Operator {
@Override
public Boolean executeInner(Object[] objects) {
return executeInner(objects[0], objects[1]);
}
private Boolean executeInner(Object op1, Object op2) {
if (null != op1 && null != op2) {
int compareResult;
if (op1 instanceof Character || op2 instanceof Character) {
if (op1 instanceof Character && op2 instanceof Character) {
return op1.equals(op2);
}
if (op1 instanceof Number) {
compareResult = OperatorOfNumber.compareNumber((Number) op1, Integer.valueOf((Character) op2));
return compareResult == 0;
}
if (op2 instanceof Number) {
compareResult = OperatorOfNumber.compareNumber(Integer.valueOf((Character) op1), (Number) op2);
return compareResult == 0;
}
}
if (op1 instanceof Number && op2 instanceof Number) {
compareResult = OperatorOfNumber.compareNumber((Number) op1, (Number) op2);
return compareResult == 0;
} else {
return op1.equals(op2);
}
} else {
return false;
}
}
}
二、XGBoot入参需要传入的数组向量的优化
这个在另一篇文章有详细说明。传送门:https://segmentfault.com/a/11...
三、XGBoot数组向量转为稀疏数组的优化
在实际生产中发现,XGBoost需要用的数组向量在数据传输过程中占用了很大的内存,因此对它进行了两项优化:
1、将普通数组转为稀疏数组
2、将内存中的稀疏数组转为字符串传输
下边看看优化前后的部分代码:
优化前:
val labeledData = data.map(l => {
LabeledPoint(0d,
Vectors.dense(
l.getAs[mutable.WrappedArray[Double]]("vector").toArray
),
l.getAs[String]("id"))
})(Encoders.product[LabeledPoint])
优化后:
val labeledData = data.map(l => {
/*LabeledPoint(0d,
Vectors.dense(
l.getAs[mutable.WrappedArray[Double]]("vector").toArray
),
l.getAs[String]("id"))*/
/**
* @version 2021/3/29 1.0.2 由于数组在数据传输中的效率较低,
* 这里优化为稀疏数组,且在使用之前都以字符串的方式传递
*/
val sparseVector = l.getAs[String]("vector")
val vector = sparseJson2Array(sparseVector)
LabeledPoint(0d, Vectors.dense(vector), l.getAs[String]("id"))
})(Encoders.product[LabeledPoint])
var arr0: Array[Double] = _
def sparseJson2Array(sparseJson: String): Array[Double] = try {
implicit val formats: DefaultFormats.type = DefaultFormats
val spArr = JsonMethods.parse(sparseJson).extract[Array[Array[Double]]]
if (null == arr0) arr0 = new Array[Double](spArr(0)(1).toInt)
val arr1 = arr0.clone()
spArr.tail.foreach(c => arr1(c(0).toInt) = c(1))
arr1
} catch {
case _: Exception =>
throw new RuntimeException(
s"`sparseJson2Array` method Exception, `$sparseJson` can not parse to array !")
}
优化后执行效率是原来的2倍,内存消耗降低为原来的30%,Stage执行图:
五、项目总结
项目源码地址:https://github.com/dafanpangg...
待更新~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。