Flink+Hologres&#39; real-time UV precision deduplication best practice for billion-level users

Introduction to Flink+Hologres The best practice of real-time UV precision deduplication for hundreds of millions of users

UV and PV calculations are usually divided into two scenarios because of different business requirements:

Offline calculation scenario: mainly T+1, calculating historical data
Real-time calculation scenario: real-time calculation of daily new data, de-duplication of user tags

For offline computing scenarios, Hologres provides ultra-high base UV calculations based on RoaringBitmap. It only needs to perform the most fine-grained pre-aggregation calculation once, and only generates a most fine-grained pre-aggregation result table, which can achieve sub-second queries. . For specific details, please refer to articles >> 160c195508236a How does Hologres support ultra-high base UV calculation (based on RoaringBitmap implementation)

For real-time computing scenarios, you can use the Flink+Hologres method and based on RoaringBitmap to de-duplicate user tags in real time. In this way, user UV and PV data can be obtained in real-time with fine-grained data, and it is convenient to adjust the minimum statistical window (such as the UV in the last 5 minutes) according to the needs, to achieve a similar real-time monitoring effect, and better display on the big screen and other BI . Compared with deduplication in units of days, weeks, months, etc., it is more suitable for fine-grained statistics on event dates, and through simple aggregation, statistical results in larger time units can also be obtained.

Main idea

Flink converts streaming data into tables and dimension tables for JOIN operations, and then into streaming data. This can use the _ insertIfNotExists _ feature of the Hologres dimension table combined with the auto-increment field achieve efficient uid mapping.
Flink processes the associated result data according to the time window, uses RoaringBitmap to aggregate according to the query dimension, and stores the query dimension and the aggregated uid in the aggregation result table, where the aggregated uid result is put into the RoaringBitmap type field of Hologres.
When querying, similar to the offline method, you can directly query the aggregation result table according to the query conditions, and perform an OR operation on the key RoaringBitmap field and calculate the base number to obtain the corresponding number of users.
The processing flow is shown in the figure below

Best practice

1. Create related basic tables

1) Create table uid\_mapping as uid mapping table, used to map uid to 32-bit int type.

The RoaringBitmap type requires that the user ID must be a 32-bit int type and the denser the better (that is, the user ID is preferably continuous). Many user IDs in common business systems or embedded points are of string type or Long type, so a mapping table needs to be constructed using the uid\_mapping type. The mapping table uses Hologres' SERIAL type (auto-incremented 32-bit int) to realize automatic management and stable mapping of user mapping.
Since it is real-time data, set this table as a row-stored table to improve the QPS of real-time JOIN of Flink dimension tables.

BEGIN;
CREATE TABLE public.uid_mapping (
uid text NOT NULL,
uid_int32 serial,
PRIMARY KEY (uid)
);
--将uid设为clustering_key和distribution_key便于快速查找其对应的int32值
CALL set_table_property('public.uid_mapping', 'clustering_key', 'uid');
CALL set_table_property('public.uid_mapping', 'distribution_key', 'uid');
CALL set_table_property('public.uid_mapping', 'orientation', 'row');
COMMIT;

2) Create the table dws\_app as the basic aggregation table, which is used to store the aggregated results on the basic dimensions.

RoaringBitmap extention needs to be created before using RoaringBitmap, and Hologres instance is also required to be version 0.10

CREATE EXTENSION IF NOT EXISTS roaringbitmap;

For better performance, it is recommended to set the number of shards reasonably based on the amount of data in the basic aggregation table, but it is recommended that the number of shards in the basic aggregation table should not exceed the number of cores of the computing resources. It is recommended to use the following method to set the number of shards through the Table Group

--新建shard数为16的Table Group，
--因为测试数据量百万级，其中后端计算资源为100core，设置shard数为16
BEGIN;
CREATE TABLE tg16 (a int);                             --Table Group哨兵表
call set_table_property('tg16', 'shard_count', '16'); 
COMMIT;

Compared with the offline result table, this result table adds a timestamp field, which is used to implement statistics based on the Flink window period. The result table DDL is as follows:

BEGIN;
create table dws_app(
  country text,
  prov text,
  city text, 
  ymd text NOT NULL,  --日期字段
  timetz TIMESTAMPTZ,  --统计时间戳，可以实现以Flink窗口周期为单位的统计
  uid32_bitmap roaringbitmap, -- 使用roaringbitmap记录uv
  primary key(country, prov, city, ymd, timetz)--查询维度和时间作为主键，防止重复插入数据
);
CALL set_table_property('public.dws_app', 'orientation', 'column');
--日期字段设为clustering_key和event_time_column，便于过滤
CALL set_table_property('public.dws_app', 'clustering_key', 'ymd');
CALL set_table_property('public.dws_app', 'event_time_column', 'ymd');
--等价于将表放在shard数为16的table group
call set_table_property('public.dws_app', 'colocate_with', 'tg16');
--group by字段设为distribution_key
CALL set_table_property('public.dws_app', 'distribution_key', 'country,prov,city');
COMMIT;

2. Flink reads the data in real time and updates the basic aggregation table of dws\_app

For complete example source code, please see alibabacloud-hologres-connectors examples

1) Flink reads the data source (DataStream) in a streaming manner and transforms it into a source table (Table)

//此处使用csv文件作为数据源，也可以是kafka等
DataStreamSource odsStream = env.createInput(csvInput, typeInfo);
// 与维表join需要添加proctime字段，详见https://help.aliyun.com/document_detail/62506.html
Table odsTable =
    tableEnv.fromDataStream(
    odsStream,
    $("uid"),
    $("country"),
    $("prov"),
    $("city"),
    $("ymd"),
    $("proctime").proctime());
// 注册到catalog环境
tableEnv.createTemporaryView("odsTable", odsTable);

2) Associate the source table with the Hologres dimension table (uid\_mapping)

The dimension table uses the insertIfNotExists parameter, that is, it is inserted by itself when the data cannot be queried. The uid\_int32 field can be created using Hologres' serial type auto-increment.

// 创建Hologres维表，其中nsertIfNotExists表示查询不到则自行插入
String createUidMappingTable =
    String.format(
    "create table uid_mapping_dim("
    + "  uid string,"
    + "  uid_int32 INT"
    + ") with ("
    + "  'connector'='hologres',"
    + "  'dbname' = '%s'," //Hologres DB名
    + "  'tablename' = '%s',"//Hologres 表名
    + "  'username' = '%s'," //当前账号access id
    + "  'password' = '%s'," //当前账号access key
    + "  'endpoint' = '%s'," //Hologres endpoint
    + "  'insertifnotexists'='true'"
    + ")",
    database, dimTableName, username, password, endpoint);
tableEnv.executeSql(createUidMappingTable);
// 源表与维表join
String odsJoinDim =
    "SELECT ods.country, ods.prov, ods.city, ods.ymd, dim.uid_int32"
    + "  FROM odsTable AS ods JOIN uid_mapping_dim FOR SYSTEM_TIME AS OF ods.proctime AS dim"
    + "  ON ods.uid = dim.uid";
Table joinRes = tableEnv.sqlQuery(odsJoinDim);

3) Convert the association results into DataStream, process them through the Flink time window, and combine with RoaringBitmap for aggregation

DataStream<Tuple6<String, String, String, String, Timestamp, byte[]>> processedSource =
    source
    // 筛选需要统计的维度（country, prov, city, ymd）
    .keyBy(0, 1, 2, 3)
    // 滚动时间窗口；此处由于使用读取csv模拟输入流，采用ProcessingTime，实际使用中可使用EventTime
    .window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
    // 触发器，可以在窗口未结束时获取聚合结果
    .trigger(ContinuousProcessingTimeTrigger.of(Time.minutes(1)))
    .aggregate(
    // 聚合函数，根据key By筛选的维度，进行聚合
    new AggregateFunction<
        Tuple5<String, String, String, String, Integer>,
        RoaringBitmap,
        RoaringBitmap>() {
            @Override
            public RoaringBitmap createAccumulator() {
                return new RoaringBitmap();
            }
            @Override
            public RoaringBitmap add(
                Tuple5<String, String, String, String, Integer> in,
                RoaringBitmap acc) {
                // 将32位的uid添加到RoaringBitmap进行去重
                acc.add(in.f4);
                return acc;
            }
            @Override
            public RoaringBitmap getResult(RoaringBitmap acc) {
                return acc;
            }
            @Override
            public RoaringBitmap merge(
                RoaringBitmap acc1, RoaringBitmap acc2) {
                return RoaringBitmap.or(acc1, acc2);
            }
     },
    //窗口函数，输出聚合结果
    new WindowFunction<
        RoaringBitmap,
        Tuple6<String, String, String, String, Timestamp, byte[]>,
        Tuple,
        TimeWindow>() {
            @Override
            public void apply(
                Tuple keys,
                TimeWindow timeWindow,
                Iterable<RoaringBitmap> iterable,
                Collector<
                Tuple6<String, String, String, String, Timestamp, byte[]>> out)
                throws Exception {
                RoaringBitmap result = iterable.iterator().next();
                // 优化RoaringBitmap
                result.runOptimize();
                // 将RoaringBitmap转化为字节数组以存入Holo中
                byte[] byteArray = new byte[result.serializedSizeInBytes()];
                result.serialize(ByteBuffer.wrap(byteArray));
                // 其中 Tuple6.f4(Timestamp) 字段表示以窗口长度为周期进行统计，以秒为单位
                out.collect(
                    new Tuple6<>(
                        keys.getField(0),
                        keys.getField(1),
                        keys.getField(2),
                        keys.getField(3),
                        new Timestamp(
                            timeWindow.getEnd() / 1000 * 1000),
                        byteArray));
        }
    });

4) Write the result table

It should be noted that the RoaringBitmap type in Hologres corresponds to the Byte array type in Flink

// 计算结果转换为表
Table resTable =
    tableEnv.fromDataStream(
        processedSource,
        $("country"),
        $("prov"),
        $("city"),
        $("ymd"),
        $("timest"),
        $("uid32_bitmap"));
// 创建Hologres结果表, 其中Hologres的RoaringBitmap类型通过Byte数组存入
String createHologresTable =
    String.format(
        "create table sink("
        + "  country string,"
        + "  prov string,"
        + "  city string,"
        + "  ymd string,"
        + "  timetz timestamp,"
        + "  uid32_bitmap BYTES"
        + ") with ("
        + "  'connector'='hologres',"
        + "  'dbname' = '%s',"
        + "  'tablename' = '%s',"
        + "  'username' = '%s',"
        + "  'password' = '%s',"
        + "  'endpoint' = '%s',"
        + "  'connectionSize' = '%s',"
        + "  'mutatetype' = 'insertOrReplace'"
        + ")",
    database, dwsTableName, username, password, endpoint, connectionSize);
tableEnv.executeSql(createHologresTable);
// 写入计算结果到dws表
tableEnv.executeSql("insert into sink select * from " + resTable);

3. Data query

When querying, do aggregate calculations according to query dimensions from the basic aggregate table (dws\_app), query bitmap cardinality, and get the number of users under the group by condition

Query the uv of each city in a certain day

--运行下面RB_AGG运算查询，可执行参数先关闭三阶段聚合开关（默认关闭）,性能更好
set hg_experimental_enable_force_three_stage_agg=off  
SELECT  country
        ,prov
        ,city
        ,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv
FROM    dws_app
WHERE   ymd = '20210329'
GROUP BY country
         ,prov
         ,city
;

Query the uv of each province in a certain period of time

--运行下面RB_AGG运算查询，可执行参数先关闭三阶段聚合开关（默认关闭）,性能更好
set hg_experimental_enable_force_three_stage_agg=off 
SELECT  country
        ,prov
        ,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv
FROM    dws_app
WHERE   time > '2021-04-19 18:00:00+08' and time < '2021-04-19 19:00:00+08'
GROUP BY country
         ,prov
;

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Flink+Hologres' real-time UV precision deduplication best practice for billion-level users

Main idea

Best practice

1. Create related basic tables

2. Flink reads the data in real time and updates the basic aggregation table of dws\_app

3. Data query

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

仅3步！打造更精细化的制造业供应链管理体系（内附完整版PDF）

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

高端制造业财务数字化怎么做？思迈特提出了新思路

基于xtrabackup的MySQL 8.0物理备份与恢复

医院利用大数据技术开展“模拟审计”自查自纠，大幅减少违规问题

TiDB Chat2Query 深度解析：我们如何打造一款更高效、准确的智能 SQL 生成工具？

Flink+Hologres&#39; real-time UV precision deduplication best practice for billion-level users

Main idea

Best practice

1. Create related basic tables

2. Flink reads the data in real time and updates the basic aggregation table of dws\_app

3. Data query

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

仅3步！打造更精细化的制造业供应链管理体系（内附完整版PDF）

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

高端制造业财务数字化怎么做？思迈特提出了新思路

基于xtrabackup的MySQL 8.0物理备份与恢复

医院利用大数据技术开展“模拟审计”自查自纠，大幅减少违规问题

TiDB Chat2Query 深度解析：我们如何打造一款更高效、准确的智能 SQL 生成工具？

Flink+Hologres' real-time UV precision deduplication best practice for billion-level users