Car home: practice of integrated lake and warehouse architecture based on Flink + Iceberg

brief content:
1. The background of the data warehouse architecture upgrade
Second, the practice of integrated lake and warehouse architecture based on Iceberg
3. Summary and benefits
Four, follow-up planning

<p style="text-align:center"> GitHub address
https://github.com/apache/flink
Welcome everyone to give Flink likes and stars~</p>

1. The background of the data warehouse architecture upgrade

1. Pain points of Hive-based data warehouse

The original data warehouse was built entirely based on Hive, and there were three main pain points:

Pain point 1: Does not support ACID

1) Upsert scenario is not supported;

2) Row-level delete is not supported, and data correction costs are high.

Pain point 2: Timeliness is difficult to improve

1) The data is difficult to be visible in quasi real-time;

2) It is impossible to read incrementally, and it is impossible to realize the unification of stream batches at the storage level;

3) Data analysis scenarios with minute delay cannot be supported.

Pain Point 3: Table Evolution

1) Write-in Schema, poor support for Schema changes;

2) Partition Spec change support is not friendly.

2. Iceberg key features

Iceberg has four key features: support for ACID semantics, incremental snapshot mechanism, open table format and streaming batch interface support.

supports ACID semantics
- Will not read incomplete Commit;
- Support concurrent Commit based on optimistic locking;
- Row-level delete, supports Upsert.

incremental snapshot mechanism
- The data can be seen after Commit (minute level);
- Historical snapshots can be traced back.

open table format
- Data format: parquet, orc, avro
- Computing engine: Spark, Flink, Hive, Trino/Presto

streaming batch interface supports
- Support streaming and batch writing;
- Support streaming and batch reading.

Second, the practice of integrated lake and warehouse architecture based on Iceberg

The meaning of the integration of the lake and the warehouse is that I don't need to see the lake and the warehouse. The data has an open metadata format, it can flow freely, and it can also be connected to the upper-level diversified computing ecology.

——Jia Yangqing (Senior Researcher of Alibaba Cloud Computing Platform)

1. Append the link into the lake

The picture above shows the link for log data into the lake. Log data includes client logs, client logs, and server logs. These log data will be entered into Kafka in real time, then written to Iceberg through Flink tasks, and finally stored in HDFS.

2. Flink SQL connects to the lake

Our Flink SQL link into the lake is completed based on "Flink 1.11 + Iceberg 0.11". We mainly did the following to connect with Iceberg Catalog:

1) Meta Server adds support for Iceberg Catalog;

2) SQL SDK adds Iceberg Catalog support.

Then on this basis, the platform opens the management function of Iceberg tables, allowing users to build SQL tables on the platform by themselves.

3. Into the lake-support proxy users

The second step is internal practice to connect with the existing budget system and authority system.

Because when the platform used to do real-time operations, the platform was run by Flink users by default. The previous storage did not involve HDFS storage, so there may be no problem, and there is no consideration of budget division.

But writing Iceberg now may involve some problems. For example, the data warehouse team has its own market, the data should be written under their catalog, the budget is also allocated to their budget, and the system of permissions and offline team accounts is connected.

As shown above, this part is mainly used as a proxy user function on the platform. The user can specify which account to use to write this data into Iceberg. The implementation process mainly includes the following three.

Add Table level configuration:'iceberg.user.proxy' ='targetUser'
1) Enable Superuser
2) Team account authentication

Enable proxy user when accessing HDFS:

when accessing Hive Metastore
1) Refer to the related implementation of Spark:
org.apache.spark.deploy.security.HiveDelegationTokenProvider
2) Dynamic proxy HiveMetaStoreClient, use proxy users to access Hive metastore

4. Flink SQL into the lake example

DDL + DML

5. CDC data into the lake link

As shown above, we have an AutoDTS platform that is responsible for real-time access of business database data. We will connect the data of these business libraries to Kafka, and it also supports the configuration of distribution tasks on the platform, which is equivalent to distributing the data into Kafka to different storage engines. In this scenario, it is distributed to Iceberg.

6. Flink SQL CDC connects to the lake

The following are the changes we made based on "Flink1.11 + Iceberg 0.11" to support CDC into the lake:

Improve Iceberg Sink:
Flink 1.11 version is AppendStreamTableSink, which cannot handle CDC stream, modify and adapt.
Table management
1) Support Primary key (PR1978)
2) Open the V2 version:'iceberg.format.version' = '2'

7. CDC data into the lake

1. Support Bucket

In the Upsert scenario, it is necessary to ensure that the same piece of data is written to the same bucket. How can this be achieved?

At present, Flink SQL syntax does not support the declaration of bucket partitions. Buckets are declared through configuration:

'partition.bucket.source'='id', // Specify bucket field

'partition.bucket.num'='10', // Specify the number of buckets

2. Copy-on-write sink

The reason for doing Copy-on-Write is that the original community’s Merge-on-Read does not support merging small files, so we temporarily implemented the Copy-on-write sink. The current business has been testing and using it with good results.

The above is the implementation of Copy-on-Write, which is actually similar to the original Merge-on-Read. It also has StreamWriter multiple parallelism write and FileCommitter single parallelism order submission .

In Copy-on-Write, you need to set the number of buckets reasonably according to the data volume of the table, and there is no need to merge small files.

multiple parallelism in the snapshotState phase
1) Increase Buffer;
2) Before writing, it is necessary to judge that the last checkpoint has been committed successfully;
3) Group and merge by bucket, and write in buckets one by one.

FileCommitter single parallel degree sequential submission
1）table.newOverwrite()
2）Flink.last.committed.checkpoint.id

8. Example-CDC data configuration into the lake

As shown in the figure above, in actual use, the business party can create or configure distribution tasks on the DTS platform.

Select the Iceberg table for the instance type, and then select the target database to indicate which table data is to be synchronized to Iceberg. Then you can select the mapping relationship between the fields of the original table and the target table. After configuration, you can start the distribution task. After startup, a real-time task will be submitted in the real-time computing platform Flink, and then the Copy-on-write sink will be used to write the data to the Iceberg table in real time.

9. Other practices in the lake

Practice 1: Reduce empty commit

Problem Description:
When the upstream Kafka has no data for a long time, each checkpoint will still generate a new snapshot, resulting in a large number of empty files and unnecessary snapshots.
Solution (PR-2042):
Increase the configuration of Flink.max-continuousempty-commits, which will trigger Commit and generate Snapshot only when there is no data in Checkpoint for a specified number of consecutive times.

Practice 2: Record watermark

Problem Description:
Currently, the Iceberg table itself cannot directly reflect the progress of data writing, and offline scheduling is difficult to accurately trigger downstream tasks.
Solution (PR-2109):
In the Commit phase, the Flink Watermark is recorded in the Properties of the Iceberg table, which can intuitively reflect the end-to-end delay, and can also be used to judge the integrity of the partition data, for scheduling and triggering downstream tasks.

Practice 3: Delete table optimization

Problem Description:
Deleting Iceberg may be slow, causing the platform interface to time out accordingly. Because Iceberg uses object-oriented storage to abstract the IO layer, there is no way to quickly clear the directory.
solution:
Extend FileIO and add the deleteDir method to quickly delete table data on HDFS.

10. Small file merging and data cleaning

Periodically perform batch processing tasks (spark 3) for each table, divided into the following three steps:

1. Regularly merge small files of newly added partitions:

rewriteDataFilesAction.execute(); Only merge small files, will not delete old files.

2. Delete expired snapshots, clean up metadata and data files:

table.expireSnapshots().expireOld erThan(timestamp).commit();

3. Clean up orphan files. By default, files that cannot be touched 3 days ago are cleaned up:

removeOrphanFilesAction.older Than(timestamp).execute();

11. Calculation Engine – Flink

Flink is the core computing engine of the real-time platform. At present, it mainly supports the scene of data entering the lake. It mainly has the following characteristics.

data into the lake in real-time:
Flink and Iceberg have the highest integration of data into the lake, and the Flink community actively embraces the data lake technology.
platform integration:
AutoStream introduces IcebergCatalog and supports table creation through SQL and into the lake. AutoDTS supports the configuration of MySQL, SQLServer, and TiDB tables into the lake.
flow batch integration:
Under the concept of integration of flow and batch, Flink's advantages will gradually be manifested.

12. Computing Engine – Hive

Hive is more integrated with Iceberg and Spark 3 at the SQL batch level, and mainly provides the following three functions.

Regular small file consolidation and meta information query:
SELECT * FROM prod.db.table.history can also view snapshots, files, manifests.
Offline data writing:
1）Insert into 2）Insert overwrite 3）Merge into
Analysis query:
It mainly supports daily quasi real-time analysis and query scenarios.

13. Calculation Engine – Trino/Presto

AutoBI has been integrated with Presto for reporting and analytical query scenarios.

Trino
1) Directly use Iceberg as the report data source
2) Need to add metadata caching mechanism: https://github.com/trinodb/trino/issues/7551
Presto
Community integration: https://github.com/prestodb/presto/pull/15836

14. Stepped Pit

1. Access to Hive Metastore is abnormal

problem description: HiveConf's construction method is misused, causing the configuration declared in the Hive client to be overwritten, resulting in an exception when accessing the Hive metastore

solution (PR-2075): fix the structure of HiveConf, show that the addResource method is called to ensure that the configuration will not be overwritten: hiveConf.addResource(conf);

Hive metastore lock is not released

problem description: "CommitFailedException: Timed out after 181138 ms waiting for lock xxx." The reason is that the hiveMetastoreClient.lock method needs to display unlock when the lock is not obtained, otherwise it will cause the above exception.

solution (PR-2263): optimizes the HiveTableOperations#acquireLock method to display the call to unlock to release the lock when the lock fails to be acquired.

3. Metadata file is missing

problem description: Iceberg table can not be accessed, report "NotFoundException Failed to open input stream for file: xxx.metadata.json"

solution (PR-2328): When calling Hive metastore to update the metadata_location of the iceberg table over time, add a check mechanism to confirm that the metadata is not saved successfully before deleting the metadata file.

3. Benefits and summary

1. Summary

Through the exploration of the integration of the lake and warehouse and the integration of flow and batch, we have made a summary respectively.

lake warehouse one
1) Iceberg supports Hive Metastore;
2) The overall usage is similar to the Hive table: the same data format, the same calculation engine.
Stream batch fusion
Realize the unified flow and batch in the quasi-real-time scenario: the same source, the same calculation, and the same storage.

2. Business income

Data timeliness improvement:
The warehouse entry delay was reduced from more than 2 hours to less than 10 minutes; the algorithm core task SLA was completed 2 hours ahead of schedule.
Quasi real-time analysis query:
Combining Spark 3 and Trino, it supports quasi-real-time multi-dimensional analysis and query.
feature engineering to improve efficiency:
Provide quasi-real-time sample data to improve the timeliness of model training.
CDC data quasi real-time warehousing:
You can do quasi-real-time analysis and query for business tables in the data warehouse.

3. Structure benefits-quasi real-time data warehouse

As mentioned above, we support quasi-real-time warehousing and analysis, which is equivalent to providing basic architecture verification for the subsequent quasi-real-time data warehouse construction. The advantage of quasi-real-time data warehouse is one-time development, unified caliber, unified storage, and it is a real batch-flow integration. The disadvantage is that the real-time performance is poor. The original delay may be seconds or milliseconds, but now it is minute-level data visibility.

But at the architectural level, this is of great significance. In the future, we can see some hope that the entire original "T + 1" data warehouse can be made into a quasi-real-time data warehouse to improve the overall data timeliness of the data warehouse. , And then better support upstream and downstream business.

Four, follow-up planning

1. Follow the Iceberg version

Fully open the V2 format and support the MOR of CDC data into the lake.

2. Building a quasi-real-time data warehouse

Based on Flink, the data pipeline mode is used to comprehensively speed up the various layers of the data warehouse.

3. Flow batch integration

With the gradual improvement of the upsert function, continue to explore the integration of flow and batch at the storage level.

4. Multidimensional analysis

Based on Presto/Spark3 output quasi real-time multi-dimensional analysis.

For more Flink-related technical exchanges, you can scan the code to join the community DingTalk group~

activity recommendation

Alibaba Cloud's enterprise-level product based on Apache Flink, the real-time computing Flink version, is now open for a limited time in June:
Try the real-time calculation of Flink's fully managed version (annual and monthly subscription, 10CU) for 0 yuan to have the opportunity to obtain Flink exclusive customized T-shirts; another package for 3 months and above will also have a 15% discount!
Learn about the event details: https://www.aliyun.com/product/bigdata/sc