原文连接:https://segmentfault.com/a/1190000022783071
背景:
数据表字段有IP,IP省份,IP城市,最新访问时间,假如目前一共有100W的IP我第一次初始化的时候完成初始化表插入,然后每天都有5W左右的IP活跃,然后我要更新这5W个IP的记录到初始化的表,他的IP省份如果有变更也要更新保留最新的数据,更新最新的访问时间等。
所有操作均在hive表上
1.需要2个表 数据拉链表,每日更新数据表
DDL语句如下
每日更新表
create table dm.day_tbl(
ip string comment'ip',
country string comment'国家',
province string comment'省份',
city string comment'城市',
day string comment'day',
)
comment '每日更新表'
row format delimited
fields terminated by '\u0001'
stored as parquet;
数据拉链表
create table dm.zipper_tbl(
ip string comment'ip',
country string comment'国家',
province string comment'省份',
city string comment'城市',
t_start_date string comment'start_date',
t_end_date string comment'end_date'
)
comment '数据拉链表'
row format delimited
fields terminated by '\u0001'
stored as parquet;
ETL语句如下
INSERT OVERWRITE TABLE dm.zipper_tbl
SELECT * FROM
(
SELECT zipper_tbl.ip,
zipper_tbl.country,
zipper_tbl.province,
zipper_tbl.city,
zipper_tbl.t_start_date,
CASE
WHEN zipper_tbl.t_end_date = '9999-12-31' AND day_tbl.ip IS NOT NULL THEN date_add(day_tbl.day,-1)
ELSE zipper_tbl.t_end_date
END AS t_end_date
FROM dm.zipper_tbl AS zipper_tbl
LEFT JOIN dm.day_tbl AS day_tbl
ON zipper_tbl.ip = day_tbl.ip
UNION ALL
SELECT union_day_tbl.ip,
union_day_tbl.country,
union_day_tbl.province,
union_day_tbl.city,
union_day_tbl.day AS t_start_date,
'9999-12-31' AS t_end_date
FROM dm.day_tbl AS union_day_tbl
) AS main_tbl WHERE main_tbl.t_end_date = "9999-12-31"
上面ETL是参考常见的拉链表,最后过滤一下时间,就可以得到全量最新的数据
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。