Apache Hudi&#39;s practice of building a real-time data lake at station B

The author of this article, Yu Zhaojing, introduced why station B chose Flink + Hudi's data lake technical solution and the optimizations made for it. The main content is:
Traditional offline data warehouse pain points
Data Lake Technology Solution
Hudi mission stability guarantee
Data into the lake practice
Incremental data lake platform revenue
Community contribution
Future development and thinking

1. Pain Points of Traditional Offline Data Warehouse

1. Pain points

The warehousing process of the number warehouse at station B before is roughly as follows:

Under this architecture, the following core pain points have arisen:

After large-scale data is landed on HDFS, it can only be queried and processed after being partitioned and archived in the early morning;
RDS data synchronization with a large amount of data needs to be partitioned and archived in the early morning before processing, and sorting, deduplication, and joining the data of the previous day's partition are required to generate the data of the day;
Data can only be read through partition granularity, and there will be a lot of redundant IO in scenarios such as shunting.

The summary is:

Scheduled to start late;
Slow merging speed;
Repeat reading more.

2. Thinking about pain points

scheduling start late
Idea: Since Flink's ODS is written in quasi real-time, there is a clear concept of file increment. You can use file-based incremental synchronization to process the logic of cleaning, complementing, and splitting in an incremental way, so that you can The data is processed when the ODS partition is not archived. In theory, the data delay depends only on the processing time of the last batch of files.
merge speed is slow
Idea: Now that the read can be incremental, the merge can also be incremental. The incremental read can be combined with the ability of the data lake to complete the incremental merge.
repeated reads
Idea: The main reason for repeated reading is that the granularity of the partition is too coarse, and it can only be accurate to the hour/day level. We need to try some more fine-grained data organization schemes, and Data Skipping can be at the field level, so that efficient data query can be carried out.

3. Solution: Magneto-Hudi-based incremental data lake platform

The following is the warehouse entry process based on Magneto:

Flow
- Use the flow method to unify offline and real-time ETL Pipline
Organizer
- Data reorganization to speed up query
- Supports compaction of incremental data
Engine
- The computing layer uses Flink, and the storage layer uses Hudi
Metadata
- Refine table calculation SQL logic
- Standardized Table Format calculation paradigm

2. Data Lake Technical Solution

1. The choice between Iceberg and Hudi

1.1 Comparison of technical details

1.2 Comparison of community activity

Statistics as of 2021-08-09

1.3 Summary

It can be roughly divided into the following major latitudes for comparison:

support for Append
The main support plan at the beginning of Iceberg's design has been optimized for this scenario. Hudi supports the Appned mode in version 0.9. At present, the gap with Iceberg is not big in most scenarios. The current 0.10 version is still continuously optimized, and the performance is very similar to Iceberg.
for Upsert
Hudi's main support scheme at the beginning of the design has obvious advantages over Iceberg's design in terms of performance and number of files, and the Compaction process and logic are all highly abstract interfaces. Iceberg's support for Upsert started late, and the community solution has a significant gap with Hudi in terms of performance and small files.
Community activity
Compared with the Iceberg community, Hudi's community is significantly more active. Thanks to the active community, Hudi's richness of features has opened a certain gap with Iceberg.

Comprehensive comparison, we chose Hudi as our data lake component, and continue to optimize the functions we need on it (Flink better integration, clustering support, etc.)

2. Select Flink + Hudi as the writing method

There are three main reasons why we chose Flink + Hudi to integrate Hudi:

We partially maintain the Flink engine ourselves, which supports real-time calculations for the entire company. We do not want to maintain two sets of calculation engines at the same time in terms of cost, especially when our internal Spark version has also made many internal changes.
There are mainly two Index solutions for the integration of Spark + Hudi, but they all have disadvantages:
- Bloom Index: When using the Bloom Index, Spark will list all the files in each task when writing, and read the Bloom filtered data written in the footer. This will cause the HDFS that is already under great internal pressure. Very scary pressure.
- Hbase Index: This method can achieve O(1) to find the index, but it needs to introduce external dependencies, which will make the whole scheme heavier.
We need to interface with Flink's incremental processing framework.

3. Optimization of Flink + Hudi integration

3.1 Hudi 0.8 version integrated Flink solution

In response to the problems exposed by the integration of Hudi 0.8, Station B and the community have cooperated to optimize and improve.

3.2 Bootstrap State cold start

Background: supports starting Flink task writing in the existing Hudi table, so that the solution can be switched from Spark on Hudi to Flink on Hudi

Original plan:

question: each task processes the full amount of data, and then selects the HoodieKey belonging to the current task and saves it into the state optimization solution.

When each Bootstrap Operator is initialized, it loads the BaseFile and logFile related to the fileId of the current Task;
Assemble the recordKey in BaseFile and logFile into HoodieKey, send it to BucketAssignFunction in the form of Key By, and then store HoodieKey as an index in the state of BucketAssignFunction.

effect: separates the Bootstrap function from an Operator to achieve the scalability of index loading, and the loading speed is increased by N times (depending on the degree of concurrency).

3.3 Checkpoint consistency optimization

Background: has data consistency problems in extreme cases.

Original plan:

problem: CheckpointComplete is not in the CK life cycle. There is a situation where CK succeeds but instant does not commit, which leads to data loss.

optimization plan:

3.4 Append mode support and optimization

Background: Append mode is used to support data sets that do not require update. Unnecessary processing such as indexing and merging can be omitted in the process, thereby greatly improving writing efficiency.

main modification:

Support FlushBucket to write a new file every time to avoid the amplification of reading and writing;
Add parameters to support turning off the internal rate limiting mechanism of BoundedInMemeoryQueue. In Flink Append mode, you only need to set the size of the Queue and the bucket buffer to the same size;
Develop a custom Compaction plan for each small file generated by CK;
After the above development and optimization, the performance in the pure Insert scenario can reach 5 times of the original COW.

3. Hudi mission stability guarantee

1. Hudi integrates Flink Metrics

By reporting metrics at key nodes, you can clearly grasp the running status of the entire task:

2. Data verification in the system

3. Data verification outside the system

Fourth, the practice of data entering the lake

1. CDC data into the lake

1.1 TiDB into the lake solution

Since the various open source solutions currently cannot directly support the data export of TiDB, the direct use of Select will affect the stability of the database, so it is split into a full + incremental method:

Start TI-CDC and write the CDC data of TIDB to the corresponding Kafka topic;
Use the Dumpling component provided by TiDB to modify part of the source code to support direct writing to HDFS;
Start Flink and write all data into Hudi via Bulk Insert;
Consumption of incremental CDC data is written to Hudi through Flink MOR.

1.2 MySQL into the lake program

MySQL's lake entry solution is to directly use the open source Flink-CDC, and write full and incremental data to Kafka topic through a Flink task:

Start the Flink-CDC task to import the full amount of data and CDC data into Kafka topic;
Start the Flink Batch task to read the full amount of data, and write to Hudi through Bulk Insert;
Switch to the Flink Streaming task to write incremental CDC data to Hudi via MOR.

2. Log data incrementally enters the lake

Realize HDFSStreamingSource and ReaderOperator, incrementally synchronize ODS data files, and reduce the list request to HDFS by writing ODS partition index information;
Support transform SQL configuration, allowing users to perform custom logic transformations, including but not limited to dimension table joins, custom UDFs, stream splitting by fields, etc.;
Realize the Append mode of Flink on Hudi, which greatly increases the data write rate that does not need to be merged.

Five, incremental data lake platform revenue

Through Flink incremental synchronization, the timeliness of data synchronization is greatly improved, and the partition ready time is advanced from 2:00~5:00 to 00:30;
The storage engine uses Hudi to provide users with a variety of query methods based on COW and MOR, so that different users can choose the appropriate query method according to their own application scenarios, instead of just waiting for the partition to archive and query;
Compared with the T+1 Binlog merge method of previous data warehouses, Hudi-based automatic Compaction allows users to query Hive as a MySQL snapshot;
Significantly save resources. The offloading tasks that originally required repeated queries only need to be executed once, saving about 18,000 cores.

6. Community contribution

The above optimizations have been merged into the Hudi community. Station B will further strengthen the construction of Hudi in the future and grow together with the community.

Part of the core PR

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-1923

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-1924

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-1954

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2019

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2052

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2084

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2342

7. Future development and thinking

The platform supports stream batch integration, unified real-time and offline logic;
Promote the incrementalization of data warehouses, and achieve the whole process of Hudi ODS -> Flink -> Hudi DW -> Flink -> Hudi ADS;
Support Hudi's Clustering on Flink, which reflects Hudi's advantages in data organization, and explores the performance of accelerating multi-dimensional queries such as Z-Order;
Support inline clustering.

Apache Hudi's practice of building a real-time data lake at station B

1. Pain Points of Traditional Offline Data Warehouse

1. Pain points

2. Thinking about pain points

3. Solution: Magneto-Hudi-based incremental data lake platform

2. Data Lake Technical Solution

1. The choice between Iceberg and Hudi

1.1 Comparison of technical details

1.2 Comparison of community activity

1.3 Summary

2. Select Flink + Hudi as the writing method

3. Optimization of Flink + Hudi integration

3.1 Hudi 0.8 version integrated Flink solution

3.2 Bootstrap State cold start

3.3 Checkpoint consistency optimization

3.4 Append mode support and optimization

3. Hudi mission stability guarantee

1. Hudi integrates Flink Metrics

2. Data verification in the system

3. Data verification outside the system

Fourth, the practice of data entering the lake

1. CDC data into the lake

1.1 TiDB into the lake solution

1.2 MySQL into the lake program

2. Log data incrementally enters the lake

Five, incremental data lake platform revenue

6. Community contribution

7. Future development and thinking

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

Apache Hudi&#39;s practice of building a real-time data lake at station B

1. Pain Points of Traditional Offline Data Warehouse

1. Pain points

2. Thinking about pain points

3. Solution: Magneto-Hudi-based incremental data lake platform

2. Data Lake Technical Solution

1. The choice between Iceberg and Hudi

1.1 Comparison of technical details

1.2 Comparison of community activity

1.3 Summary

2. Select Flink + Hudi as the writing method

3. Optimization of Flink + Hudi integration

3.1 Hudi 0.8 version integrated Flink solution

3.2 Bootstrap State cold start

3.3 Checkpoint consistency optimization

3.4 Append mode support and optimization

3. Hudi mission stability guarantee

1. Hudi integrates Flink Metrics

2. Data verification in the system

3. Data verification outside the system

Fourth, the practice of data entering the lake

1. CDC data into the lake

1.1 TiDB into the lake solution

1.2 MySQL into the lake program

2. Log data incrementally enters the lake

Five, incremental data lake platform revenue

6. Community contribution

7. Future development and thinking

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

Apache Hudi's practice of building a real-time data lake at station B