头图

The author of this article, Yu Zhaojing, introduced why station B chose Flink + Hudi's data lake technical solution and the optimizations made for it. The main content is:

  1. Traditional offline data warehouse pain points
  2. Data Lake Technology Solution
  3. Hudi mission stability guarantee
  4. Data into the lake practice
  5. Incremental data lake platform revenue
  6. Community contribution
  7. Future development and thinking

1. Pain Points of Traditional Offline Data Warehouse

1. Pain points

The warehousing process of the number warehouse at station B before is roughly as follows:

img

Under this architecture, the following core pain points have arisen:

  1. After large-scale data is landed on HDFS, it can only be queried and processed after being partitioned and archived in the early morning;
  2. RDS data synchronization with a large amount of data needs to be partitioned and archived in the early morning before processing, and sorting, deduplication, and joining the data of the previous day's partition are required to generate the data of the day;
  3. Data can only be read through partition granularity, and there will be a lot of redundant IO in scenarios such as shunting.

The summary is:

  • Scheduled to start late;
  • Slow merging speed;
  • Repeat reading more.

2. Thinking about pain points

  • scheduling start late

    Idea: Since Flink's ODS is written in quasi real-time, there is a clear concept of file increment. You can use file-based incremental synchronization to process the logic of cleaning, complementing, and splitting in an incremental way, so that you can The data is processed when the ODS partition is not archived. In theory, the data delay depends only on the processing time of the last batch of files.

  • merge speed is slow

    Idea: Now that the read can be incremental, the merge can also be incremental. The incremental read can be combined with the ability of the data lake to complete the incremental merge.

  • repeated reads

    Idea: The main reason for repeated reading is that the granularity of the partition is too coarse, and it can only be accurate to the hour/day level. We need to try some more fine-grained data organization schemes, and Data Skipping can be at the field level, so that efficient data query can be carried out.

3. Solution: Magneto-Hudi-based incremental data lake platform

The following is the warehouse entry process based on Magneto:

img

  • Flow

    • Use the flow method to unify offline and real-time ETL Pipline
  • Organizer

    • Data reorganization to speed up query
    • Supports compaction of incremental data
  • Engine

    • The computing layer uses Flink, and the storage layer uses Hudi
  • Metadata

    • Refine table calculation SQL logic
    • Standardized Table Format calculation paradigm

2. Data Lake Technical Solution

1. The choice between Iceberg and Hudi

1.1 Comparison of technical details

img

1.2 Comparison of community activity

Statistics as of 2021-08-09

img

1.3 Summary

It can be roughly divided into the following major latitudes for comparison:

  • support for Append

    The main support plan at the beginning of Iceberg's design has been optimized for this scenario. Hudi supports the Appned mode in version 0.9. At present, the gap with Iceberg is not big in most scenarios. The current 0.10 version is still continuously optimized, and the performance is very similar to Iceberg.

  • for Upsert

    Hudi's main support scheme at the beginning of the design has obvious advantages over Iceberg's design in terms of performance and number of files, and the Compaction process and logic are all highly abstract interfaces. Iceberg's support for Upsert started late, and the community solution has a significant gap with Hudi in terms of performance and small files.

  • Community activity

    Compared with the Iceberg community, Hudi's community is significantly more active. Thanks to the active community, Hudi's richness of features has opened a certain gap with Iceberg.

Comprehensive comparison, we chose Hudi as our data lake component, and continue to optimize the functions we need on it (Flink better integration, clustering support, etc.)

2. Select Flink + Hudi as the writing method

There are three main reasons why we chose Flink + Hudi to integrate Hudi:

  1. We partially maintain the Flink engine ourselves, which supports real-time calculations for the entire company. We do not want to maintain two sets of calculation engines at the same time in terms of cost, especially when our internal Spark version has also made many internal changes.
  2. There are mainly two Index solutions for the integration of Spark + Hudi, but they all have disadvantages:

    • Bloom Index: When using the Bloom Index, Spark will list all the files in each task when writing, and read the Bloom filtered data written in the footer. This will cause the HDFS that is already under great internal pressure. Very scary pressure.
    • Hbase Index: This method can achieve O(1) to find the index, but it needs to introduce external dependencies, which will make the whole scheme heavier.
  3. We need to interface with Flink's incremental processing framework.

3. Optimization of Flink + Hudi integration

3.1 Hudi 0.8 version integrated Flink solution

img

In response to the problems exposed by the integration of Hudi 0.8, Station B and the community have cooperated to optimize and improve.

3.2 Bootstrap State cold start

Background: supports starting Flink task writing in the existing Hudi table, so that the solution can be switched from Spark on Hudi to Flink on Hudi

Original plan:

img

question: each task processes the full amount of data, and then selects the HoodieKey belonging to the current task and saves it into the state optimization solution.

img

  • When each Bootstrap Operator is initialized, it loads the BaseFile and logFile related to the fileId of the current Task;
  • Assemble the recordKey in BaseFile and logFile into HoodieKey, send it to BucketAssignFunction in the form of Key By, and then store HoodieKey as an index in the state of BucketAssignFunction.

effect: separates the Bootstrap function from an Operator to achieve the scalability of index loading, and the loading speed is increased by N times (depending on the degree of concurrency).

3.3 Checkpoint consistency optimization

Background: has data consistency problems in extreme cases.

Original plan:

img

problem: CheckpointComplete is not in the CK life cycle. There is a situation where CK succeeds but instant does not commit, which leads to data loss.

optimization plan:

img

3.4 Append mode support and optimization

Background: Append mode is used to support data sets that do not require update. Unnecessary processing such as indexing and merging can be omitted in the process, thereby greatly improving writing efficiency.

img

main modification:

  • Support FlushBucket to write a new file every time to avoid the amplification of reading and writing;
  • Add parameters to support turning off the internal rate limiting mechanism of BoundedInMemeoryQueue. In Flink Append mode, you only need to set the size of the Queue and the bucket buffer to the same size;
  • Develop a custom Compaction plan for each small file generated by CK;
  • After the above development and optimization, the performance in the pure Insert scenario can reach 5 times of the original COW.

3. Hudi mission stability guarantee

1. Hudi integrates Flink Metrics

By reporting metrics at key nodes, you can clearly grasp the running status of the entire task:

img

img

2. Data verification in the system

img

3. Data verification outside the system

img

Fourth, the practice of data entering the lake

1. CDC data into the lake

1.1 TiDB into the lake solution

Since the various open source solutions currently cannot directly support the data export of TiDB, the direct use of Select will affect the stability of the database, so it is split into a full + incremental method:

  1. Start TI-CDC and write the CDC data of TIDB to the corresponding Kafka topic;
  2. Use the Dumpling component provided by TiDB to modify part of the source code to support direct writing to HDFS;
  3. Start Flink and write all data into Hudi via Bulk Insert;
  4. Consumption of incremental CDC data is written to Hudi through Flink MOR.

1.2 MySQL into the lake program

MySQL's lake entry solution is to directly use the open source Flink-CDC, and write full and incremental data to Kafka topic through a Flink task:

  1. Start the Flink-CDC task to import the full amount of data and CDC data into Kafka topic;
  2. Start the Flink Batch task to read the full amount of data, and write to Hudi through Bulk Insert;
  3. Switch to the Flink Streaming task to write incremental CDC data to Hudi via MOR.

img

2. Log data incrementally enters the lake

  • Realize HDFSStreamingSource and ReaderOperator, incrementally synchronize ODS data files, and reduce the list request to HDFS by writing ODS partition index information;
  • Support transform SQL configuration, allowing users to perform custom logic transformations, including but not limited to dimension table joins, custom UDFs, stream splitting by fields, etc.;
  • Realize the Append mode of Flink on Hudi, which greatly increases the data write rate that does not need to be merged.

img

Five, incremental data lake platform revenue

  • Through Flink incremental synchronization, the timeliness of data synchronization is greatly improved, and the partition ready time is advanced from 2:00~5:00 to 00:30;
  • The storage engine uses Hudi to provide users with a variety of query methods based on COW and MOR, so that different users can choose the appropriate query method according to their own application scenarios, instead of just waiting for the partition to archive and query;
  • Compared with the T+1 Binlog merge method of previous data warehouses, Hudi-based automatic Compaction allows users to query Hive as a MySQL snapshot;
  • Significantly save resources. The offloading tasks that originally required repeated queries only need to be executed once, saving about 18,000 cores.

6. Community contribution

The above optimizations have been merged into the Hudi community. Station B will further strengthen the construction of Hudi in the future and grow together with the community.

Part of the core PR

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-1923

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-1924

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-1954

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2019

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2052

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2084

https://issues.apache.org/jira/projects/Hudi/issues/Hudi-2342

7. Future development and thinking

  • The platform supports stream batch integration, unified real-time and offline logic;
  • Promote the incrementalization of data warehouses, and achieve the whole process of Hudi ODS -> Flink -> Hudi DW -> Flink -> Hudi ADS;
  • Support Hudi's Clustering on Flink, which reflects Hudi's advantages in data organization, and explores the performance of accelerating multi-dimensional queries such as Z-Order;
  • Support inline clustering.

ApacheFlink
946 声望1.1k 粉丝