Flink + Iceberg, Tencent&#39;s tens of billions of real-time data into the lake actual combat

Introduction to Shanghai station Flink Meetup shared content, a case study of Tencent Data Lake's tens of billions of data scenarios.
This article is compiled from "Real-time data entry into the lake of tens of billions" shared by Chen Junjie, senior engineer of Tencent Data Lake R&D at Shanghai Flink Meetup on April 17. The content of the article is:
Introduction to Tencent Data Lake
Ten billion-level data scenarios landed
future plan
Summarize

GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~

1. Introduction to Tencent Data Lake

As can be seen from the above figure, the entire platform is relatively large, including data access, upper-level analysis, intermediate management (such as task management, analysis management, and engine management), and then to the lowest-level Table Format.

2. Landing of tens of billions of data landing scenarios

1. Traditional platform architecture

As shown in the figure above, the traditional platform architecture in the past is nothing more than two, one is Lambda architecture, the other is Kappa architecture:

In the Lambda architecture, batch and flow are separated, so there are two sets of clusters for operation and maintenance, one is For Spark/Hive, and the other is For Flink. There are several problems:
- The first is that the cost of operation and maintenance is relatively large;
- The second is development cost. For example, in terms of business, one needs to write Spark and another one needs to write Flink or SQL. Generally speaking, the development cost is not particularly friendly to data analysts.
The second is the Kappa architecture. In fact, it is the message queue, the transmission to the bottom layer, and then to do some analysis later. Its characteristic is relatively fast, based on Kafka, it has certain real-time performance.

These two architectures have their own advantages and disadvantages. The biggest problem is that storage may be inconsistent, leading to fragmentation of data links. At present, our platform has been connected to Iceberg. The following will explain the problems encountered and the process of solving according to different scenarios.

2. Scenario 1: Mobile QQ security data enters the lake

The entry of mobile QQ security data into the lake is a very typical scenario.

The current business scenario is that the message queue TubeMQ is landed into ODS through Flink to Iceberg, and then Flink is used to do some user table associations, and then a wide table is made to do some queries, put in COS, and may do some in BI scenarios analyze.

This process seems unremarkable, but you must know that the user-related dimension table of Mobile QQ is 2.8 billion, and the daily message queue is tens of billions, so it will face certain challenges.

Small File Challenge
1. Flink Writer generates small files
  Flink writes without shuffle, and the distributed data is out of order, resulting in many small files.
2. latency requirement
  The checkpoint interval is short, the commit interval is small, and small files are enlarged.
3. Small file explosion
  The small files of metadata and data exploded at the same time in a few days, and the pressure on the cluster was huge.
4. merging small files and zooming in again
  In order to solve the problem of small files, open Action to merge small files, resulting in more files.
5. Too late to delete data
  Delete snapshots, delete orphan files, but scan too many files, namenode pressure is huge.
solution
1. Flink sync merge
  - Add small file merge Operators;
  - Added Snapshot automatic cleaning mechanism.
    1）snapshot.retain-last.nums
    2）snapshot.retain-last.minutes
2. Spark asynchronous merge
  - Add background services to merge small files and delete orphan files;
  - Increase the small file filtering logic, and gradually delete small files;
  - Increase the logic of merging by partition to avoid generating too many deleted files at one time and leading to task OOM.
Flink synchronization merge

After committing all the Data files, a Commit Result will be generated. We will use Commit Result to generate a compressed task, and then concurrently send it into multiple Task Managers to do Rewrite work, and finally Commit the result to the Iceberg table.

Of course, the key here is how CompactTaskGenerator does it. At the beginning, we wanted to merge as much as possible, so we went to scan the table and scanned many files. However, its table is very large, and there are many small files. A scan made the entire Flink hang immediately.

We thought of a way to scan the data incrementally after each merge. From the last Replace Operation to the present, we have made an increase to see how much we have increased in the middle, and which ones are in line with the Rewrite strategy.

There are actually many configurations here, to see how many snapshots have been reached, or how many files can be merged, users can set these places by themselves. Of course, we also set default values to ensure that users can use these functions unconsciously.

Writer's pit

In Fanout Writer, if the amount of data is large, multiple partitions may be encountered. For example, the data of mobile QQ is divided into provinces and cities; but it is still very large after the division, so it is divided into buckets. At this time, each Task Manager may be divided into many partitions. When each partition opens a Writer, there will be a lot of Writers, resulting in insufficient memory.

Here we have done two things:

The first is KeyBy support. Do the KeyBy action according to the partition set by the user, and then gather the same partitions in a Task Manager, so that it will not open so many partitions of Writer. Of course, this approach will bring some performance losses.
The second is to make an LRU Writer and maintain a Map in the memory.

3. Scenario 2: News platform index analysis

Above is the online indexing structure of news articles based on Iceberg's flow batch integration. On the left is Spark collecting the dimension table on HDFS, and on the right is the access system. After the collection, Flink and the dimension table will be used to make a Window-based Join, and then written into the index flow table.

Function
- Near real-time detail layer;
- Real-time streaming consumption;
- Streaming MERGE INTO;
- Multidimensional analysis
- Offline analysis.
scene characteristics
The above scenarios have the following characteristics:
- order of magnitude: index single table exceeds 100 billion, single batch 20 million, daily average 100 billion;
- latency requirements: end-to-end data visibility in minutes;
- data source: full volume, quasi real-time increment, message flow;
- consumption mode: streaming consumption, batch loading, point check, row update, multi-dimensional analysis.
Challenge: MERGE INTO
Some users put forward the needs of Merge Into, so we thought about it from three aspects:
- function: the flow table after each batch join into the real-time index table for downstream use;
- performance: The requires high index timeliness, and it is necessary to consider the batch consumption window that merge into can catch up with the upstream;
- ease of use: Table API? Or Action API? Or SQL API?
solution
1. first step
  - Design JoinRowProcessor with reference to Delta Lake;
  - Use Iceberg's WAP mechanism to write temporary snapshots.
2. Second step
  - Option to skip Cardinality-check;
  - When writing, you can choose to only hash, not sort.
3. third step
  - Support DataframeAPI;
  - Spark 2.4 supports SQL;
  - Spark 3.0 uses the community version.

4. Scenario 3: Advertising data analysis

advertising data mainly has the following characteristics:
- magnitude: daily one hundred billion data PB, a single 2K;
- data source: SparkStreaming incrementally into the lake;
- data features: tags keep increasing, schema keep changing;
- use: Interactive query analysis.
encountered by 16101007826863 and corresponding solutions:
- Challenge 1: Schema nesting is complicated, and it is nearly ten thousand columns after tiling, and it can be OOM as soon as it is written.
  Solution: By default, each Parquet Page Size is set to 1M, and the Page Size needs to be set according to the Executor memory.
- Challenge 2: 30-day basic data cluster burst.
  Solution: provides Action for life cycle management. Documents distinguish between life cycle and data life cycle.


*   **挑战三：**交互式查询。
    
    **解决方案：**
    
    *   1）column projection；
    *   2）predicate push down。

3. Future planning

The future planning is mainly divided into the kernel side and the platform side.

1. Core side

In the future, we hope to have the following plans on the kernel side:

More data access
- Incremental lake support;
- V2 Format support;
- Row Identity support.
Faster query
- Index support;
- Alloxio acceleration layer support;
- MOR optimization.
Better data governance
- Data governance Action;
- SQL Extension support;
- Better metadata management.

2. Platform side

We have the following plans on the platform side:

Data governance
- Serving metadata cleaning;
- Service-oriented data governance.
Incremental lake support
- Spark consumes CDC into the lake;
- Flink consumes CDC into the lake.
indicator monitoring alarm
- Write data indicators;
- Small file monitoring and warning.

Four, summary

After mass production application and practice, we have got three conclusions:

Availability: Through the actual combat of multiple business lines, it is confirmed that Iceberg can withstand the test of tens of billions or even hundreds of billions per day.
ease of use: a relatively high threshold for use, and more work needs to be done to allow users to use it.
scene support: currently supports less scenes into the lake than Hudi, and the incremental reading is also missing, and everyone needs to work hard to make it up.
- *

In addition~The e-book "Apache Flink-Real Time Computing" was released. This book will help you easily get the latest features of Apache Flink version 1.13. It also contains multi-scenario Flink practical experience from well-known manufacturers, learning and using, and a lot of dry goods! Quickly click on the link below to receive it~

https://developer.aliyun.com/article/784856?spm=a2c6h.13148508.0.0.61644f0eskgxgo

Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Flink + Iceberg, Tencent's tens of billions of real-time data into the lake actual combat

1. Introduction to Tencent Data Lake

2. Landing of tens of billions of data landing scenarios

1. Traditional platform architecture

2. Scenario 1: Mobile QQ security data enters the lake

3. Scenario 2: News platform index analysis

4. Scenario 3: Advertising data analysis

3. Future planning

1. Core side

2. Platform side

Four, summary

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

DNS服务器地址大全

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

网络安全：数字时代的永恒命题

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

云电竞巅峰对决：ToDesk/网易云/START实战测评，谁是真王者？

Flink + Iceberg, Tencent&#39;s tens of billions of real-time data into the lake actual combat

1. Introduction to Tencent Data Lake

2. Landing of tens of billions of data landing scenarios

1. Traditional platform architecture

2. Scenario 1: Mobile QQ security data enters the lake

3. Scenario 2: News platform index analysis

4. Scenario 3: Advertising data analysis

3. Future planning

1. Core side

2. Platform side

Four, summary

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

DNS服务器地址大全

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

网络安全：数字时代的永恒命题

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

云电竞巅峰对决：ToDesk/网易云/START实战测评，谁是真王者？

Flink + Iceberg, Tencent's tens of billions of real-time data into the lake actual combat