Where is Flink + Iceberg&#39;s real-time data warehouse practice

Introduction to This article introduces some practices of Qunar Data Platform using Flink + Iceberg 0.11.

Author: Yu Dong

Abstract: This article introduces some practices of Qunar Data Platform using Flink + Iceberg 0.11. content include:
Background and pain points
Iceberg architecture
Pain point 1: Kafka data loss
Pain point 2: Near real-time Hive is under great pressure
Iceberg optimization practice
to sum up

GitHub address
https://github.com/apache/flink
Welcome everyone to give Flink likes and send stars~

1. Background and pain points

1. Background

In the process of using Flink for real-time data warehouse and data transmission, we encountered some problems: such as Kafka data loss, Flink combined with Hive's near real-time data warehouse performance, etc. The new features of Iceberg 0.11 solve the problems encountered in these business scenarios. Compared with Kafka, Iceberg has its own advantages in certain specific scenarios. Here we have done some practical sharing based on Iceberg.

2. Original architecture scheme

The original architecture uses Kafka to store real-time data, including data such as logs, orders, and tickets. Then use Flink SQL or Flink datastream to consume data for circulation. The platform for submitting SQL and Datastream is self-developed internally, and real-time jobs are submitted through the platform.

3. Pain points

Kafka has high storage costs and a large amount of data. Kafka sets the data expiration time to be relatively short due to high pressure. When the data has back pressure, backlog, etc., if the data is not consumed within a certain period of time, the data will expire, which will cause data loss.
Flink supports near real-time reading and writing on Hive. In order to share the pressure of Kafka, put some data with low real-time performance into Hive, and let Hive do minute-level partitions. However, as metadata continues to increase, the pressure on Hive metadata has become increasingly significant, queries have become slower, and the pressure on the database storing Hive metadata has also increased.

Two, Iceberg architecture

1. Iceberg architecture analysis

Term Analysis

data files
The file that the Iceberg table actually stores data is generally stored in the data directory and ends with ".parquet".
Manifest file
Each row is a detailed description of each data file, including the status of the data file, file path, partition information, column-level statistics (such as the maximum and minimum values of each column, the number of null values, etc.). Through this file, irrelevant data can be filtered out and retrieval speed can be improved.
Snapshot
A snapshot represents the state of a table at a certain moment. Each snapshot version contains a list of all data files at a certain time. Data files are stored in different manifest files, manifest files are stored in a Manifest list file, and a Manifest list file represents a snapshot.

2. Iceberg query plan

The query plan is the process of finding the "documents required for the query" in the table.

metadata filtering
The manifest file includes partition data tuples and column-level statistics for each data file. During planning, query predicates are automatically converted to predicates on partitioned data, and are first applied to filter data files. Next, use column-level value count, null count, lower limit and upper limit to eliminate files that do not match the query predicate.
Snapshot ID
Each Snapshot ID is associated with a set of manifest files, and each set of manifest files contains many manifest files.
manifest files file list
Each manifest files records the metadata information of the current data data block, which contains the maximum and minimum values of the file column, and then according to this metadata information, index to the specific file block, so as to query the data faster .

3. Pain point 1: Kafka data loss

1. Introduction to pain points

Usually we choose Kafka for real-time data warehouse and log transmission. Kafka itself has high storage costs, and the data retention time is time-sensitive. Once the consumption backlog is over and the data reaches the expiration time, the data will be lost and not consumed.

2. Solution

Bring real-time business data that is not demanding into the lake, for example, can accept a delay of 1-10 minutes. Because Iceberg 0.11 also supports SQL real-time reading, and can save historical data. This can not only reduce the pressure on online Kafka, but also ensure that data can be read in real time without losing data.

3. Why Iceberg can only enter the lake in near real time?

Iceberg submits Transaction at file granularity. It is impossible to submit the transaction in seconds, otherwise the number of files will be expanded;
There is no online service node. For real-time high-throughput and low-latency writes, pure real-time response cannot be obtained;
Flink writes in checkpoint as the unit. After the physical data is written to Iceberg, it cannot be directly queried. When the checkpoint is triggered, the metadata file will be written. At this time, the data changes from invisible to visible. Checkpoint will have a certain amount of time each time it is executed.

4. Analysis of Flink's entry into the lake

component introduction

IcebergStreamWriter
It is mainly used to write records to the corresponding avro, parquet, orc files, generate a corresponding Iceberg DataFile, and send it to the downstream operator.

The other is called IcebergFilesCommitter, which is mainly used to collect all DataFile files when the checkpoint arrives, and submit the Transaction to Apache Iceberg to complete the data writing of this checkpoint and generate DataFile.

IcebergFilesCommitter
A DataFile file list is maintained for each checkpointId, namely map<Long, List>, so that even if a checkpoint transaction fails to be submitted, its DataFile file is still maintained in the State and can still be submitted through subsequent checkpoints Data into the Iceberg table.

5. Flink SQL Demo

Flink Iceberg enters the lake in real time, consumes Kafka data to write to Iceberg, and reads data from Iceberg in near real time.

5.1 Preliminary work

enable real-time read and write function
set execution.type = streaming
Turn on the table sql hint function to use the OPTIONS attribute
set table.dynamic-table-options.enabled=true

Registered Iceberg catalog used to operate Iceberg table

CREATE CATALOG Iceberg_catalog WITH (\n" +
            "  'type'='Iceberg',\n" +
            "  'catalog-type'='Hive'," +
            "  'uri'='thrift：//localhost：9083'" +
            ");

Kafka real-time data into the lake

insert into Iceberg_catalog.Iceberg_db.tbl1 \n 
            select * from Kafka_tbl;

Real-time transfer between data lakes tbl1 -> tbl2

  insert into Iceberg_catalog.Iceberg_db.tbl2  
    select * from Iceberg_catalog.Iceberg_db.tbl1 
    /*+ OPTIONS('streaming'='true', 
'monitor-interval'='10s',snapshot-id'='3821550127947089987')*/

5.2 Parameter explanation

monitor-interval
The time interval for continuously monitoring newly submitted data files (default value: 1s).
start-snapshot-id
Read data from the specified snapshot ID. Each snapshot ID is associated with a set of manifest file metadata files. Each metadata file maps its own real data file. Through the snapshot ID, a certain version can be read. data.

6. Records of stepping on pits

I used to write data to Iceberg in SQL Client. The data in the data directory has been updated, but the metadata has no data, which leads to no counts when querying, because Iceberg queries require metadata to index the real data. SQL Client does not turn on checkpoint by default, and it needs to be turned on through a configuration file. So it will cause the data directory to write data but the metadata directory does not write metadata.

PS: Whether entering the lake through SQL or Datastream, Checkpoint must be turned on.

7. Data sample

The following two pictures show the effect of real-time querying Iceberg, and the data changes one second before and one second later.

one second ago

Data refreshed after one second

4. Pain point 2: Flink combined with Hive's near real-time is getting slower and slower

1. Introduction to pain points

Although the near real-time architecture of Flink + Hive supports real-time read and write, the problem that this architecture brings is that as tables and partitions increase, the following problems will be faced:

Too many
Hive changed the partition to hour/minute level. Although it improved the quasi-real-time performance of the data, the pressure of meteorore was also obvious. Too much metadata caused the generation of query plans to slow down, and it would also affect the stability of other online businesses.
database pressure increases
As metadata increases, the pressure on the database storing Hive metadata will increase. After a period of time, the database needs to be expanded, such as storage space.

2. Solution

Migrate the original Hive to Iceberg in near real time. Why Iceberg can handle the problem of large amount of metadata, while Hive is easy to form a bottleneck when the metadata is large?

Iceberg maintains metadata on a scalable distributed file system, there is no centralized metadata system;
Hive maintains the metadata on the partition in the metastore (too many partitions puts huge pressure on mysql), while the metadata in the partition is actually maintained in the file (starting the job requires listing a large number of files to determine whether the file is Need to be scanned, the whole process is very time-consuming).

Five, optimization practice

1. Small file handling

Before Iceberg 0.11, small files were merged by triggering batch api periodically. Although it can be merged, a set of Actions code needs to be maintained, and it is not merged in real time.
```
Table table = findTable(options, conf);
Actions.forTable(table).rewriteDataFiles()
        .targetSizeInBytes(10 * 1024) // 10KB
        .execute();
```
Iceberg 0.11 is a new feature that supports stream merging of small files.
Use the partition/bucket key to write data using hash shuffling and merge files directly from the source. The advantage of this is that a task will process the data of a certain partition and submit its own Datafile. For example, a task only processes the corresponding partition. data. This avoids the problem of multiple tasks processing and submitting many small files, and no additional maintenance code is required. You only need to specify the attribute write.distribution-mode when building the table. This parameter is common to other engines, such as Spark.
```
CREATE TABLE city_table ( 
     province BIGINT,
     city STRING
  ) PARTITIONED BY (province, city) WITH (
    'write.distribution-mode'='hash' 
  );
```

2. Iceberg 0.11 sort

2.1 Introduction to Sorting

Before Iceberg 0.11, Flink did not support the Iceberg sorting function, so previously it could only be combined with Spark to support the sorting function in batch mode. 0.11 added support for the sorting feature, which also means that we can also experience this benefit in real time. .

The essence of sorting is to scan faster, because after aggregating according to the sort key, all the data is sorted from small to large, and max-min can filter out a lot of invalid data.

2.2 Sorting demo

insert into Iceberg_table select days from Kafka_tbl order by days, province_id;

3. Detailed explanation of the manifest after sorting by Iceberg

Parameter explanation

file\_path: physical file location.
partition: The partition corresponding to the file.
lower\_bounds: The minimum value of multiple sort fields in this file. The figure below shows the minimum value of my days and province\_id.
upper\_bounds: The maximum value of multiple sort fields in this file. The figure below shows the maximum value of my days and province\_id.

Determine whether to read the file\_path file by the upper and lower limit information of the partition and column. After the data is sorted, the information of the file column will also be recorded in the metadata. The query plan will locate the file from the manifest, and there is no need to record the information in Hive. metadata, thereby reducing the pressure on Hive metadata and improving query efficiency.

Using the sorting feature of Iceberg 0.11, the day is used as the partition. Sort by day, hour, and minute, then the manifest file will record this sorting rule, thereby improving query efficiency when retrieving data, which can not only achieve the retrieval advantages of Hive partitions, but also avoid excessive Hive metadata metadata. pressure.

Six, summary

Compared with the previous version, Iceberg 0.11 has added many practical functions. Compared with the previous version, the following summary is made:

Flink + Iceberg sorting function
Before Iceberg 0.11, the sorting function was integrated with Spark, but not with Flink. At that time, a batch of Hive tables were migrated in batches with Spark + Iceberg 0.10. The benefits of BI are: Originally, BI built multi-level partitions to improve Hive query speed, resulting in too many small files and metadata. In the process of entering the lake, Spark was used to sort BI frequently query conditions, combined with implicit partitions, and finally improved While BI retrieval speed, there is no problem of small files. Iceberg has its own metadata, which also reduces the pressure on Hive metadata.
Icebeg 0.11 supports Flink's sorting, which is a very useful feature. We can transfer the original Flink + Hive partition to Iceberg sorting, which can not only achieve the effect of Hive partition, but also reduce small files and improve query efficiency.
read data in real time
Through the SQL programming method, the data can be read in real time. The advantage is that you can put data with low real-time requirements, such as business that can accept 1-10 minutes of delay, into Iceberg. While reducing the pressure on Kafka, it can also achieve near-real-time data reading and save historical data. .
real-time merge small files
Before Iceberg 0.11, it was necessary to use Iceberg's merge API to maintain small file merging. The API needed to pass in table information and timing information, and the merging was carried out in batches, not real-time. In terms of code, it increases maintenance and development costs; in terms of timeliness, it is not real-time. 0.11 Use the Hash method to merge data from the source in real time. You only need to specify the ('write.distribution-mode'='hash') attribute when creating the SQL table, without manual maintenance.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Where is Flink + Iceberg's real-time data warehouse practice

1. Background and pain points

1. Background

2. Original architecture scheme

3. Pain points

Two, Iceberg architecture

1. Iceberg architecture analysis

2. Iceberg query plan

3. Pain point 1: Kafka data loss

1. Introduction to pain points

2. Solution

3. Why Iceberg can only enter the lake in near real time?

4. Analysis of Flink's entry into the lake

5. Flink SQL Demo

5.1 Preliminary work

5.2 Parameter explanation

6. Records of stepping on pits

7. Data sample

4. Pain point 2: Flink combined with Hive's near real-time is getting slower and slower

1. Introduction to pain points

2. Solution

Five, optimization practice

1. Small file handling

2. Iceberg 0.11 sort

2.1 Introduction to Sorting

2.2 Sorting demo

3. Detailed explanation of the manifest after sorting by Iceberg

Six, summary

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Y 分钟速成 zfs

2025年1月国产数据库大事记-墨天轮

零代码生成SQL实操：跟着focus_mcp_sql三步搞定数据查询需求

2025年2月中国数据库排行榜：OceanBase迎来开门红，金仓、GBASE排名节节高

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

基于xtrabackup的MySQL 8.0物理备份与恢复

Where is Flink + Iceberg&#39;s real-time data warehouse practice

1. Background and pain points

1. Background

2. Original architecture scheme

3. Pain points

Two, Iceberg architecture

1. Iceberg architecture analysis

2. Iceberg query plan

3. Pain point 1: Kafka data loss

1. Introduction to pain points

2. Solution

3. Why Iceberg can only enter the lake in near real time?

4. Analysis of Flink's entry into the lake

5. Flink SQL Demo

5.1 Preliminary work

5.2 Parameter explanation

6. Records of stepping on pits

7. Data sample

4. Pain point 2: Flink combined with Hive's near real-time is getting slower and slower

1. Introduction to pain points

2. Solution

Five, optimization practice

1. Small file handling

2. Iceberg 0.11 sort

2.1 Introduction to Sorting

2.2 Sorting demo

3. Detailed explanation of the manifest after sorting by Iceberg

Six, summary

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Y 分钟速成 zfs

2025年1月国产数据库大事记-墨天轮

零代码生成SQL实操：跟着focus_mcp_sql三步搞定数据查询需求

2025年2月中国数据库排行榜：OceanBase迎来开门红，金仓、GBASE排名节节高

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

基于xtrabackup的MySQL 8.0物理备份与恢复

Where is Flink + Iceberg's real-time data warehouse practice