背景
公司需要根据过去一段时间内每天网站的流量数据,预测未来一段时间每日流量,这样,在流量高峰到来前,可以提前警示相关的运营、运维提前准备。
这是个典型的“时序预测问题”,关于时序预测的方法有很多,有规则法、机器学习、传统建模法等等。
本文主要讲述机器学习的方式。
由于工作中主要用的是Spark技术栈处理数据,所以这里也选用SparkML来解决。当然,机器学习的包和库又很多,完全可以用sklearn来做。实际上,数据分析阶段我用的是pandas、numpy、sklearn,效率更高些。
数据分析
初始数据很简单,只有两列:PV、日期
画个曲线图,观察一下:
从图中看出,发现2019-07前后整体差异很大,这其实是由于业务调整导致的。由于需求是预测未来几天的pv,那么一定是以现有的业务为基础,过早的数据反而是噪声,直接抛弃。
选取近半年的数据,再观察一下:
这个数据就相对比较稳定了。
整体观察,数据变化存在周期性,一个周期是一星期;工作日相对周末pv高些;
局部观察,节假日为高峰(但并非所有节假日都是高峰,同样这与具体业务相关,所以需要按自己的业务整理出节假日表);
另外,非节假日也有高峰,可能的原因是有热点事件(2020年2月,疫情期间热点较多);对于热点事件导致的流量高峰不可预测,所以我们尽量减小这类样本的影响,因此后边数据处理时会“去热点”。
模型选取
这里选取线性回归模型作为机器学习模型,并非是线性回归是最优的,而是趋势预测很容易想到线性回归模型,可以作为baseline,后续在此基础上尝试其他模型进行优化。
特征提取
1. 时间特征
经过上边的数据分析,可以知道周末、工作日、节假日对pv影响较大,因此可以把这几个值作为特征:
day_of_week // 星期几,取值1~7
is_weekend // 是否是周末,取值0、1,星期六和星期日是周末
is_holiday // 是否是节假日,取值0、1,节假日库根据实际业务维护
2. 均值特征
既然有周期性,那么
周一的pv与所有周一的平均值有一定关系
周二的pv与所有周二的平均值有一定关系
...
所以,每个day_of_week的平均值可以作为一个特征。
同样,周末、节假日都有类似的均值特征。
day_of_week_avg // 按 day_of_week 分组,求平均值
is_weekend_avg // 按 is_weekend 分组,取平均值
is_holiday_avg // 按 is_holiday 分组,取平均值
3. 中位数特征
与均值特征类似,可以有中位数特征
day_of_week_med // 按 day_of_week 分组,取中位数
is_weekend_med // 按 is_weekend 分组,取中位数
is_holiday_med // 按 is_holiday 分组,取中位数
4. 平移特征
均值特征、中位数特征反应的是整体的情况,实际上某日的pv很有可能取决于最近N天的pv。
具体N取几?需要多试试了。这里N取1到14,得到一组特征:
lag_1 // 平移1天,即昨天的pv
lag_2 // 平移2天,即两天前的pv
...
lag_7 // 平移7天,上周这天的pv
...
lag_14 // 平移14天,上上周这天的pv
平移后数据的样子:
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| pv| day| lag_1| lag_2| lag_3| lag_4| lag_5| lag_6| lag_7| lag_8| lag_9| lag_10| lag_11| lag_12| lag_13| lag_14|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|15156440|2019-11-01| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
|12633297|2019-11-02|15156440| null| null| null| null| null| null| null| null| null| null| null| null| null|
|11818845|2019-11-03|12633297|15156440| null| null| null| null| null| null| null| null| null| null| null| null|
|15130911|2019-11-04|11818845|12633297|15156440| null| null| null| null| null| null| null| null| null| null| null|
|14332734|2019-11-05|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null| null| null| null|
|15972959|2019-11-06|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null| null| null|
|16366371|2019-11-07|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null| null|
|16969708|2019-11-08|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null|
|12983425|2019-11-09|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null|
|11759009|2019-11-10|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null|
|13700888|2019-11-11|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null|
|15490684|2019-11-12|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null|
|15275479|2019-11-13|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null|
|14978239|2019-11-14|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null|
|16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|
|15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|
|15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|
|16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|
|16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|
|17422016|2019-11-20|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|
特征处理代码:
Dataset<Row> df = spark.read().schema(schema).option("header", "false").csv("file:///Users/sun/Downloads/pv_data.csv");
df.createOrReplaceTempView("tmp");
df = spark.sql("select * from tmp where day>='2019-11-01'");
df.createOrReplaceTempView("tmp");
// 补充待预测日期:
int predDays = 7;
String lastDay = spark.sql("select max(day) as day from tmp").first().getAs("day");
Date lastDate = DateUtils.parseDate(lastDay, new String[]{"yyyy-MM-dd"});
String sql = "select pv, day from tmp";
for (int i=0; i<predDays; i++) {
Date date = DateUtils.addDays(lastDate, (i + 1));
String day = new SimpleDateFormat("yyyy-MM-dd").format(date);
sql += " union (select 0, '" + day + "' from tmp limit 1)";
}
sql += " order by day asc";
df = spark.sql(sql);
df.createOrReplaceTempView("tmp");
// 平移特征:
int lagStart = 1;
int lagEnd = 14;
sql = "select *, ";
for (int i=lagStart; i<=lagEnd; i++) {
sql += "lag(pv, " + i + ") over (partition by null order by day) as lag_" + i;
if (i <= lagEnd - 1)
sql += ", ";
}
sql += " from tmp";
df = spark.sql(sql);
df.createOrReplaceTempView("tmp");
// 时间特征:
sql = "select *, " +
"dayofweek(day) as day_of_week, " +
"case when dayofweek(day)==1 or dayofweek(day)==7 then 1 else 0 end as is_weekend, " +
"case when day in (" + Arrays.asList(holidays.split(",")).stream().map(s -> "'" + s + "'").collect(Collectors.joining(",")) + ") then 1 else 0 end as is_holiday " +
"from tmp";
df = spark.sql(sql);
df.registerTempTable("tmp");
// 均值特征:
sql = "select tmp.*, t1.day_of_week_avg, t2.is_weekend_avg, t3.is_holiday_avg from tmp " +
"left join (select day_of_week, avg(pv) as day_of_week_avg from tmp group by day_of_week) as t1 on tmp.day_of_week = t1.day_of_week " +
"left join (select is_weekend, avg(pv) as is_weekend_avg from tmp group by is_weekend) as t2 on tmp.is_weekend = t2.is_weekend " +
"left join (select is_holiday, avg(pv) as is_holiday_avg from tmp group by is_holiday) as t3 on tmp.is_holiday = t3.is_holiday ";
df = spark.sql(sql);
df.registerTempTable("tmp");
// 中位数特征:
sql = "select tmp.*, t1.day_of_week_med, t2.is_weekend_med, t3.is_holiday_med from tmp " +
"left join (select day_of_week, percentile_approx(pv, 0.5) as day_of_week_med from tmp group by day_of_week) as t1 on tmp.day_of_week = t1.day_of_week " +
"left join (select is_weekend, percentile_approx(pv, 0.5) as is_weekend_med from tmp group by is_weekend) as t2 on tmp.is_weekend = t2.is_weekend " +
"left join (select is_holiday, percentile_approx(pv, 0.5) as is_holiday_med from tmp group by is_holiday) as t3 on tmp.is_holiday = t3.is_holiday ";
df = spark.sql(sql);
df.registerTempTable("tmp");
去热点(异常值处理)
之前提到,有些样本并非是节假日,但PV很高,可能是由于热点事件导致。
大致有两种情况:1. 运营搞了一些活动,刺激流量激增;2. 社会化热点事件(参考微博热搜)。
实际上,通过进一步的数据分析,可以知道主要原因是“疫情”间接带来的PV波动。
热点事件不像节假日一样有迹可循,而有一定的随机性、突发性。为了简化,我们采取一定策略,对异常值进行处理。
这里,使用策略为:如果非节假日PV高于中位数的1.5倍,那么取中位数。代码如下:
// 异常值处理:
// 非节假日,但流量超过中位数的1.5倍,认为这样的样本是异常的(可能是热点事件导致),处理为中位数
df = spark.sql("select *, " +
"if(is_holiday=0 and pv>day_of_week_med*1.5, day_of_week_med, pv) as y " +
"from tmp order by day asc");
df = df.na().drop();
df.registerTempTable("tmp");
// 平移特征0缺失值处理:处理为day_of_week_avg
sql = "select *, ";
for (int i=lagStart; i<=lagEnd; i++) {
sql += "case when lag_" + i + ">0 then lag_"+i + " else day_of_week_avg end as lag_" + i + "_fix";
if (i <= lagEnd - 1)
sql += ", ";
}
sql += " from tmp";
df = spark.sql(sql);
df.registerTempTable("tmp");
// 保存数据:
df.write().option("header", "true").csv("file:///Users/sun/Downloads/df");
得到数据示例:
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+--------------------+---------------+--------------+--------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| pv| day| lag_1| lag_2| lag_3| lag_4| lag_5| lag_6| lag_7| lag_8| lag_9| lag_10| lag_11| lag_12| lag_13| lag_14|day_of_week|is_weekend|is_holiday| day_of_week_avg| is_weekend_avg| is_holiday_avg|day_of_week_med|is_weekend_med|is_holiday_med| y| lag_1_fix| lag_2_fix| lag_3_fix| lag_4_fix| lag_5_fix| lag_6_fix| lag_7_fix| lag_8_fix| lag_9_fix| lag_10_fix| lag_11_fix| lag_12_fix| lag_13_fix| lag_14_fix|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+--------------------+---------------+--------------+--------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| 6| 0| 0|2.2269334259259257E7|2.1140308681818184E7| 2.047994621387283E7| 18144580| 17914823| 17128256|16900067| 1.4978239E7| 1.5275479E7| 1.5490684E7| 1.3700888E7| 1.1759009E7| 1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7| 1.515644E7|
|15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297| 7| 1| 0|2.2007671444444444E7|2.1197627611111112E7| 2.047994621387283E7| 15728601| 15623119| 17128256|15668745| 1.6900067E7| 1.4978239E7| 1.5275479E7| 1.5490684E7| 1.3700888E7| 1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7|
|15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845| 1| 1| 0|2.0387583777777776E7|2.1197627611111112E7| 2.047994621387283E7| 15245430| 15623119| 17128256|15102373| 1.5668745E7| 1.6900067E7| 1.4978239E7| 1.5275479E7| 1.5490684E7| 1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|
|16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911| 2| 0| 0|1.9976350222222224E7|2.1140308681818184E7| 2.047994621387283E7| 16614896| 17914823| 17128256|16475787| 1.5102373E7| 1.5668745E7| 1.6900067E7| 1.4978239E7| 1.5275479E7| 1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|
|16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734| 3| 0| 0|2.0061554769230768E7|2.1140308681818184E7| 2.047994621387283E7| 17121601| 17914823| 17128256|16946753| 1.6475787E7| 1.5102373E7| 1.5668745E7| 1.6900067E7| 1.4978239E7| 1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|
模型训练
标准化
lag_、 _avg、 *_med 这些特征是pv,量级为千万级,对其进行标准化:
// 标准化:lag_*、 *_avg、 *_med 特征进行标准化
VectorAssembler vectorAssembler = new VectorAssembler()
.setInputCols(new String[]{"lag_1", "lag_2", "lag_3", "lag_4", "lag_5", "lag_6", "lag_7", "lag_8", "lag_9", "lag_10", "lag_11", "lag_12", "lag_13", "lag_14", "day_of_week_avg", "is_weekend_avg", "is_holiday_avg", "day_of_week_med", "is_weekend_med", "is_holiday_med"})
.setOutputCol("feature_vec");
df = vectorAssembler.transform(df);
MinMaxScaler scaler = new MinMaxScaler()
.setInputCol("feature_vec")
.setOutputCol("feature_out");
df = scaler.fit(df).transform(df);
VectorAssembler 可以把 Dataset 的列转为Vector类型(后边算法API必须使用向量作为入参);
MinMaxScaler 把特征缩放到[0,1]区间。
处理结果:
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+
| pv| day| lag_1| lag_2| lag_3| lag_4| lag_5| lag_6| lag_7| lag_8| lag_9| lag_10| lag_11| lag_12| lag_13| lag_14|day_of_week|is_weekend|is_holiday| day_of_week_avg| is_weekend_avg| is_holiday_avg|day_of_week_med|is_weekend_med|is_holiday_med| y| lag_1_fix| lag_2_fix| lag_3_fix| lag_4_fix| lag_5_fix| lag_6_fix| lag_7_fix| lag_8_fix| lag_9_fix| lag_10_fix| lag_11_fix| lag_12_fix| lag_13_fix| lag_14_fix| feature_vec| features|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+
|16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| 6| 0| 0|2.2269334259259257E7|2.1140308681818184E7|2.047994621387283E7| 18144580| 17914823| 17128256|16900067|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7| 1.515644E7|[1.4978239E7,1.52...|[0.02878242285919...|
|15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297| 7| 1| 0|2.2007671444444444E7|2.1197627611111112E7|2.047994621387283E7| 15728601| 15623119| 17128256|15668745|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7|[1.6900067E7,1.49...|[0.05339190949495...|
|15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845| 1| 1| 0|2.0387583777777776E7|2.1197627611111112E7|2.047994621387283E7| 15245430| 15623119| 17128256|15102373|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|[1.5668745E7,1.69...|[0.03762452432660...|
|16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911| 2| 0| 0|1.9976350222222224E7|2.1140308681818184E7|2.047994621387283E7| 16614896| 17914823| 17128256|16475787|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|[1.5102373E7,1.56...|[0.03037198967476...|
|16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734| 3| 0| 0|2.0061554769230768E7|2.1140308681818184E7|2.047994621387283E7| 17121601| 17914823| 17128256|16946753|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|[1.6475787E7,1.51...|[0.04795889832547...|
|17422016|2019-11-20|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959| 4| 0| 0| 2.172384296153846E7|2.1140308681818184E7|2.047994621387283E7| 17928108| 17914823| 17128256|17422016|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|[1.6946753E7,1.64...|[0.05398973536338...|
|18010112|2019-11-21|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371| 5| 0| 0|2.1671804769230768E7|2.1140308681818184E7|2.047994621387283E7| 17962984| 17914823| 17128256|18010112|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|[1.7422016E7,1.69...|[0.06007559655750...|
|17935725|2019-11-22|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708| 6| 0| 0|2.2269334259259257E7|2.1140308681818184E7|2.047994621387283E7| 18144580| 17914823| 17128256|17935725|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|[1.8010112E7,1.74...|[0.06760631244495...|
|15623119|2019-11-23|17935725|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425| 7| 1| 0|2.2007671444444444E7|2.1197627611111112E7|2.047994621387283E7| 15728601| 15623119| 17128256|15623119|1.7935725E7|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|[1.7935725E7,1.80...|[0.06665376836589...|
|14637174|2019-11-24|15623119|17935725|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009| 1| 1| 0|2.0387583777777776E7|2.1197627611111112E7|2.047994621387283E7| 15245430| 15623119| 17128256|14637174|1.5623119E7|1.7935725E7|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|[1.5623119E7,1.79...|[0.03704027202242...|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+
features为处理后的特征列。
训练
// 训练:使用 lastDay 之前的数据进行训练
Dataset<Row> trainDataset = spark.sql("select day, features, pv, y from tmp where day<='" + lastDay + "' order by day asc");
double maxR2 = 0.0D;
double bestParam = 0.0D;
LinearRegressionModel bestModel = null;
// 搜索最优参数:
for (int i=1; i<=10; i++) {
LinearRegression lr = new LinearRegression()
.setLabelCol("y")
.setFeaturesCol("features")
.setMaxIter(10000)
.setRegParam(0.03) // 步长
.setElasticNetParam(0.1 * i);
LinearRegressionModel model = lr.fit(trainDataset);
LinearRegressionTrainingSummary trainingSummary = model.summary();
System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError());
System.out.println("r2: " + trainingSummary.r2());
if(trainingSummary.r2() > maxR2) {
bestParam = 0.1 * i;
maxR2 = trainingSummary.r2();
bestModel = model;
}
}
System.out.println("best param -> " + bestParam);
System.out.println("best r2 -> " + maxR2);
这里使用LinearRegression ,主要调节setElasticNetParam参数值,详细参数说明可以参考文档。
从0.1~1.0,寻找一个最优值,使得模型r2最高,此时的模型作为最优模型。
最终,得到elasticnet为0.5时最优,r2为0.7008858790650143。
预测
对未来7天的数据进行预测
Dataset<Row> predDataset = spark.sql("select day, features, pv, y from tmp where day>'" + lastDay + "' order by day asc");
bestModel.setPredictionCol("pv_pred");
bestModel.transform(predDataset).show();
结果如下:
+----------+--------------------+---+---+--------------------+
| day| features| pv| y| pv_pred|
+----------+--------------------+---+---+--------------------+
|2020-04-28|[0.03159798985245...| 0| 0|1.7333708553490087E7|
|2020-04-29|[0.11516156320975...| 0| 0|1.7833363920196097E7|
|2020-04-30|[0.11449520118456...| 0| 0|1.7624262847742468E7|
|2020-05-01|[0.12214671526351...| 0| 0|3.6077728160918914E7|
|2020-05-02|[0.11879605768944...| 0| 0| 1.518647529881512E7|
|2020-05-03|[0.09805043124337...| 0| 0|1.5407320504048364E7|
|2020-05-04|[0.09278448304737...| 0| 0| 3.56043256732697E7|
+----------+--------------------+---+---+--------------------+
5月1日、5月4日是节假日,预计这两天将出现流量高峰。
引用
作者:易企秀工程师 Sun
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。