Building a streaming data lake platform with Flink Hudi

Abstract: This article is compiled from the sharing of Alibaba technical expert Chen Yuzhao (Yu Zhao) and Alibaba development engineer Liu Dalong (Feng Li) at Flink Forward Asia 2021. The main contents include:
Apache Hudi 101
Flink Hudi Integration
Flink Hudi Use Case
Apache Hudi Roadmap

FFA 2021 Live Replay & Presentation PDF Download

1. Apache Hudi 101

When it comes to data lakes, everyone will have such questions, what is a data lake? Why is the data lake so popular in the past two years? The data lake is actually not a new concept. The earliest concept of the data lake was proposed in the 1980s. At that time, the definition of the data lake was the original data layer, which can store various structured, semi-structured and even unstructured data. In many scenarios such as machine learning and real-time analysis, the schema of the data is determined at the time of query.

The lake has the characteristics of low storage cost and high flexibility, which is very suitable for centralized storage in query scenarios. With the rise of cloud services in recent years, especially the maturity of object storage, more and more enterprises choose to build storage services on the cloud. The storage-computing separation architecture of the data lake is very suitable for the current cloud service architecture. It provides basic acid transactions through snapshot isolation, and supports docking with multiple analysis engines to adapt to different query scenarios. It can be said that lake storage is cost-effective and open. had a great advantage.

The current lake storage has begun to assume the function of the data warehouse, and realizes the integrated structure of the lake and warehouse by connecting with the computing engine. Lake storage is a table format that encapsulates the high-level semantics of tables based on the original data format. Hudi has put data lakes into practice since 2016. At that time, it was to solve the data update problem on the file system in the big data scenario. The Hudi-like LSM table format is currently unique in the lake format. It is friendly to near real-time updates and has semantics. Also relatively perfect.

Table format is the basic attribute of the three popular data lake formats. Hudi has been evolving towards the platform since the beginning of the project. It has relatively complete data governance and table service. For example, users can write concurrently. Optimize the layout of the file, the metadata table can greatly optimize the file search efficiency of the query side when writing.

The following introduces some basic concepts of Hudi.

Timeline service is the core abstraction of Hudi transaction layer. All data operations of Hudi are carried out around the timeline service. Each operation is bound to a specific timestamp through the instant abstraction. A series of instants constitute the timeline service, and each instance records The corresponding action and state are displayed. Through the timeline service, Hudi can know the status of the current table operation. Through a set of file system view abstraction combined with the timeline service, it can expose the file layout view at a specific timestamp to the current readers and writers of the table.

File group is the core abstraction of Hudi in the file layout layer. Each file group is equivalent to a bucket, which is divided by file size. Each write behavior of it will generate a new version, and a version is abstracted as a file slice. , the file slice internally maintains the corresponding version of the data file. When a file group is written to the specified file size, a new file group will be switched.

Hudi's writing behavior in file slices can be abstracted into two semantics, copy on write and merge on read.

copy on write will write the full amount of data each time, the new data will be merged with the data of the previous file slice, and then a new file slice will be written to generate a new bucket file.

Merge on read is more complicated. Its semantics is to append write, that is, only incremental data is written each time, so no new file slice will be written. It first tries to append the previous file slice, and only cuts the new file slice when the written file slice is included in the compression plan.

2. Flink Hudi Integration

The writing pipeline of Flink Hudi consists of several operators. The first operator is responsible for converting the rowdata of the table layer into HudiRecord, the message format of Hudi. Then go through a Bucket Assigner, which is mainly responsible for allocating the transferred HudiRecords to a specific file group, and then the records of the assigned file group will flow into the Writer operator to perform real file writing. Finally, there is a coordinator, which is responsible for the table service scheduling of the Hudi table layer and the initiation and submission of new transactions. In addition, there are some background cleaning roles responsible for cleaning old versions of data.

In the current design, each bucket assign task will hold a bucket assigner, which independently maintains its own set of file groups. When writing new data or non-updated insert data, the bucket assign task will scan the file view, and preferentially write this batch of new data to the file group determined to be a small bucket.

For example, in the above picture, the default size of file group is 120M, then task1 in the left picture will be written to file group1 and file group2 first. Note that file group3 will not be written here, because file group3 already has 100M data, which is closer to the target threshold. The bucket is no longer written to avoid excessive write amplification. The task2 in the figure on the right will directly write a new file group, and will not append those larger file groups that have already been written.

Next, the state switching mechanism of the Flink Hudi writing process is introduced. When the job is just started, the coordinator will first try to create a new table on the file system. If the current table does not exist, it will write some meta information on the file directory, that is, build a table. After receiving the initialization meta information of all tasks, the coordinator will start a new transaction. After the write task sees the initiation of the transaction, it will unlock the flush behavior of the current data.

Write Task will first accumulate a batch of data. There are two flush strategies. One is that the current data buffer reaches the specified size, and the data in the memory will be flushed out; the other is that when the upstream checkpoint barrier reaches the required size When taking a snapshot, all in-memory data is flushed to disk. After each flush data, the meta information will be sent to the coordinator. After the coordinator receives the success event of the checkpoint, it will submit the corresponding transaction and initiate the next new transaction. After the writer task sees the new transaction, it will unlock the writing of the next round of transactions. In this way, the entire writing process is strung together.

Flink Hudi Write provides very rich writing scenarios. Currently supports writing to the log data type, that is, the non-update data type, and supports small file merging. In addition, Hudi's core writing scenarios such as update streams and CDC data are also supported by Hudi. At the same time, Flink Hudi also supports efficient batch import of historical data. The bucket insert mode can efficiently import offline data such as Hive or offline data in the database into Hudi format through batch query. In addition, Flink Hudi also provides full and incremental index loading. Users can efficiently import batch data into the lake format at one time, and then realize full and incremental data import through the docking stream writer.

The Flink Hudi read side also supports a very rich query view. Currently, it mainly supports full reading, incremental reading of historical time range, and streaming reading.

The above figure is an example of writing Hudi through Flink sql. Hudi supports a lot of use cases, and it also simplifies the parameters that users need to configure. By simply configuring table path, concurrency and operation type, users can easily write upstream data into Hudi format.

3. Flink Hudi Use Case

The following introduces the classic application scenarios of Flink Hudi.

The first classic scenario is DB import data lake. At present, there are two ways to import DB data into the data lake: you can import full and incremental data into Hudi format at one time through the CDC connector; you can also import data into Hudi format through Flink's CDC format by consuming the CDC changelog on Kafka. .

The second classic scenario is ETL (near real-time olap analysis) for stream computing. By docking the upstream stream to calculate some simple ETL, such as dual-stream join or dual-stream join followed by an agg, the change stream is directly written into the Hudi format, and then the downstream read end can be docked with traditional classic olap engines such as presto and spark. End-to-end near real-time queries.

The third classic scenario is somewhat similar to the second one. Hudi supports native changelog, that is, it supports saving row-level changes in Flink computing. Based on this capability, end-to-end near-real-time ETL production can be realized by streaming reading and consuming changes.

In the future, the two major versions of the community will focus on stream reading and stream writing, and will strengthen the semantics of stream reading; in addition, self-management will be done in catalog and metadata; we will also launch a trino native connector in the near future. Support, replace the current way of reading Hive, improve efficiency.

4. Apache Hudi Roadmap

Below is a demo of MySql to Hudi thousands of tables into the lake.

First of all, we have prepared two libraries for the data source, benchmark1 and benchmark2. There are 100 tables under benchmark1 and 1000 tables under benchmark2. Because thousands of tables are strongly dependent on the catalog, we first need to create the catalog, for the data source we need to create the MySql catalog, and for the target we need to create the Hudi catalog. MySql catalog is used to obtain all source table related information, including table structure, table data, etc. The Hudi catalog is used to create targets.

After executing the two sql statements, the two catalogs are created successfully.

Next, go to the job development page to create a job with thousands of meters entering the lake. Only 9 simple lines of SQL are required. The first syntax is create database as database. Its function is to synchronize all table structures and table data under the MySql benchmark1 library to the Hudi CDS demo library with one click. The relationship between the tables is one-to-one. map. The second syntax is create table as table. Its function is to synchronize all tables matching sbtest.regular expressions in the MySql benchmark2 library to the ctas_dema table under Hudi's DB1. It is a many-to-one mapping relationship and will be divided into points. Merge of library sub-tables.

Then we run and go online, and then go to the job operation and maintenance page to start the job. You can see that the configuration information has been updated, indicating that it has been online again. Then click the Start button to start the job. You can then go to the Job Overview page to view job-related status information.

The above figure is the topology of the job, which is very complex, with 1100 source tables and 101 target tables. Here we have made some optimizations - source merge, which merges all tables into one node, which can be pulled only once in the incremental binlog pull phase, reducing the pressure on MySql.

Next, refresh the oss page, you can see that there is an additional cdas_demo path, enter the subtest1 path, and you can see that metadata is already being written, indicating that the data is actually being written.

Then go to the job development page and write a simple SQL query to a table to verify whether the data is really being written. Execute the SQL statement above, you can see that the data can be queried, and the data is consistent with the inserted data.

Using the metadata capabilities provided by catalog, combined with CDS and CTS syntax, we can easily realize the data entry of thousands of tables into the lake through a few lines of simple SQL, which greatly simplifies the process of data entry into the lake and reduces development, operation and maintenance. workload.

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

Building a streaming data lake platform with Flink Hudi

1. Apache Hudi 101

2. Flink Hudi Integration

3. Flink Hudi Use Case

4. Apache Hudi Roadmap

ApacheFlink

引用和评论

Lalamove基于Flink实时湖仓演进之路

为什么说异步编程是反人类

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

【TVM教程】为 Mobile GPU 自动调优卷积网络

《ESP32-S3使用指南—IDF版 V1.6》第三章 ESP32-S3基础知识

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计