Flink + Iceberg + object storage to build a data lake solution

This article is compiled from "Iceberg and Object Storage to Build a Data Lake Solution" shared by Sun Wei, senior software R&D manager of Dell Technology Group at the Flink Meetup in Shanghai on April 17. The content of the article is:
Introduction to Data Lake and Iceberg
Object storage supports the Iceberg data lake
Demonstration plan
Some thoughts on storage optimization

1. Introduction to Data Lake and Iceberg

1. Data Lake Ecology

As shown in the figure above, for a mature data lake ecology:

First of all, we believe that it should have the ability to store massive amounts of storage, common object storage, public cloud storage and HDFS;
On top of this, it also needs to support rich data types, including unstructured images and videos, semi-structured CSV, XML, Log, and structured database tables;
In addition, efficient and unified metadata management is required so that the computing engine can easily index various types of data for analysis.
Finally, we need to support a wealth of computing engines, including Flink, Spark, Hive, Presto, etc., so as to facilitate the docking of some existing application architectures in the enterprise.

2. Application scenarios of structured data on the data lake

The picture above shows a typical application scenario on a data lake.

There may be various data, different data sources and different formats on the data source. For example, transaction data, logs, buried point information, IOT, etc. These data pass through some streams and then enter the computing platform. At this time, it needs a structured solution to organize the data on a storage platform, and then use the back-end data application for real-time or regular query.

What characteristics does such a database solution need to have?

First of all, you can see that there are many types of data sources, so organizations that support richer data schemas are needed;
Secondly, it needs to support real-time data query during the injection process, so it needs the guarantee of ACID to ensure that some dirty data in the intermediate state that has not been written will not be read;
Finally, such as the log may temporarily need to change the format, or add a column. Similar to this situation, it is necessary to avoid the need to re-present all the data, write it again, and re-inject it into the storage like the traditional data warehouse; instead, a lightweight solution is needed to meet the requirements.

The positioning of the Iceberg database is to achieve such a function, docking with the computing platform at the top and docking with the storage platform at the bottom.

3. Typical solutions of structured data on the data lake

For data structured organization, the typical solution is to use the traditional organization of the database.

As shown in the figure above, there is a namespace at the top, and the isolation of database tables; there are multiple tables in the middle, which can provide a variety of data schema storage; the bottom will put data, the table needs to provide ACID characteristics, and also support the evolution of local Schema.

4. Iceberg table data organization structure

snapshot Metadata : table Schema, Partition, Partition spec, Manifest List path, current snapshot, etc.
Manifest List: Manifest File path and Partition, data file statistics.
Manifest File: Data File path and the upper and lower boundaries of each column of data.
Data File: actual table content data, organized in Parque, ORC, Avro and other formats.

Next, let's take a look at how Iceberg organizes the data. As shown in FIG:

You can see that the right side starts from the data file, and the data file stores the table content data, which generally supports Parquet, ORC, Avro and other formats;
On the top is the Manifest File, which will record the path of the data file underneath and the upper and lower boundaries of each column of data, which is convenient for filtering and querying files;
Going up is the Manifest List, which links multiple Manifest Files underneath, and records the partition range information corresponding to the Manifest File at the same time, which is also for the convenience of subsequent filtering and query;
The Manifest List actually represents the information of the snapshot. It contains all the data links of the current database table. It is also a key guarantee for Iceberg to support the ACID feature.
With a snapshot, when reading data, only the data that can be referenced by the snapshot can be read, and the data that is still being written will not be referenced by the snapshot, and dirty data will not be read. Multiple snapshots will share the previous data files, and the previous data can be shared by sharing these Manifest Files.
Above it is the snapshot metadata, which records the current or historical changes to the table Scheme, the configuration of the partition, the path of all snapshot Manifest File, and which one the current snapshot is.
At the same time, Iceberg provides abstraction of namespaces and tables for complete data organization and management.

5. Iceberg writing process

The above is the flow chart of Iceberg data writing. Here, the calculation engine Flink is used as an example.

First, Data Workers will read the data from the metadata for analysis, and then hand over a record to Iceberg for storage;
Like common databases, Iceberg will also have predefined partitions, and those records will be written to different partitions to form some new files;
Flink has a CheckPoint mechanism. After the file arrives, Flink will complete the writing of this batch of files, then generate a list of this batch of files, and then hand it over to the Commit Worker;
The Commit Worker will read out the information of the current snapshot, and then merge it with the file list generated this time to generate a new Manifest List and the information of the subsequent metadata table file, and then submit it. After success, a new snapshot will be formed.

6. Iceberg query process

Above is the Iceberg data query process.

First, the Flink Table scan worker makes a scan. When scanning, it can be like a tree, starting from the root, find the current snapshot or a historical snapshot specified by the user, and then take out the Manifest List file of the current snapshot from the snapshot. You can filter out the Manifest File that meets the query conditions;
Then go through the information recorded in the Manifest File to filter out the Data Files needed below. After the file is taken out, it is handed over to Recorder reader workers. It reads the Recode that meets the conditions from the file, and then returns it to the upper layer to call.

One feature can be seen here, that is, no List is used in the entire data query process. This is because Iceberg has recorded it completely. The tree structure of the entire file does not require List, and it is directly pointed to by a single path. , So there is no time-consuming List operation in query performance, which is more friendly to object storage, because it is a resource-consuming operation to store objects on the List.

7. The features of Iceberg Catalog at a glance

Iceberg provides Catalog with a good abstraction to connect data storage and metadata management. Any storage, as long as it implements Iceberg's Catalog abstraction, has the opportunity to connect with Iceberg to organize access to the above data lake solution.

As shown in the figure above, the Catalog mainly provides several abstractions.

It can define a series of role files for Iceberg;
Its File IO can be customized, including reading, writing and deleting;
Its namespace and table operations (also called metadata operations) can also be customized;
Including table reading/scanning, table submission, can be customized with Catalog.

This can provide a flexible operating space and facilitate the docking of various underlying storage.

2. Object storage supports the Iceberg data lake

1. Current implementation of Iceberg Catalog

The existing Iceberg Catalog implementation in the community can be divided into two parts, one is the data IO part, and the other is the metadata management part.

As shown in the figure above, in fact, there is no Catalog implementation for private object storage. S3A can connect to object storage in theory, but it uses file system semantics instead of natural object storage semantics. Simulating these file operations will have additional overhead. What we want to achieve is to transfer all data and metadata management to an object storage, rather than a separate design.

2. Comparison of Object Storage and HDFS

There is a question here. Why use object storage when HDFS is available?

As shown below, we compare object storage with HDFS from various angles.

In summary, we believe that:

Object storage has more advantages in cluster scalability, small file friendliness, multi-site deployment and low storage overhead;
The advantage of HDFS is to provide additional upload and atomic rename, which are exactly what Iceberg needs.

The following is a brief description of the respective advantages of the two storage.

1) Comparison: Cluster scalability

The HDFS architecture uses a single Name Node to store all metadata, which determines that its single node capacity is limited, so there is no horizontal scalability in terms of metadata.
Object storage generally uses a hash method to separate metadata into blocks, and hand this block to services on different Nodes for management. Naturally, the upper limit of its metadata will be higher, and it can even be rehashed in extreme cases. Cut this block into more detail and hand it over to more Nodes to manage metadata to achieve scalability.

2) Comparison: Small file friendly

Nowadays, in big data applications, small files are becoming more and more common and gradually become a pain point.

HDFS is based on architectural limitations. Small file storage is limited by resources such as Name Node memory. Although HDFS provides an Archive method to merge small files and reduce the pressure on Name Node, this requires additional complexity and is not native.
Similarly, the TPS of small files is also limited by the processing capacity of the Name Node, because it has only a single Name Node. The metadata of object storage is distributed storage and management, and traffic can be well distributed to each Node, so that a single node can store a large number of small files.
At present, many object storages provide multi-media and hierarchical acceleration, which can improve the performance of small files.

3) Comparison: Multi-site deployment

Object storage supports multi-site deployment
- Global namespace
- Support rich rule configuration
The multi-site deployment capability of object storage is suitable for the two-site, three-center, multi-active architecture, while HDFS does not have native multi-site deployment capabilities. Although we have seen some commercial versions add the ability of multiple sites to be responsible for data to HDFS, because its two systems may be independent, it cannot support the ability to live more in the real global namespace.

4) Comparison: low storage overhead

For storage systems, in order to adapt to random hardware failures, it generally has a copy mechanism to protect data.
- Commonly, such as three copies, the data is stored in three copies, and then stored separately on three Nodes. The storage cost is three times, but it can tolerate failures of two copies at the same time to ensure that the data will not be lost.
- The other is Erasure Coding, usually called EC. Taking 10+2 as an example, it cuts the data into 10 data blocks, and then uses an algorithm to calculate two code blocks, a total of 12 blocks. Then distributed to four nodes, the storage overhead is 1.2 times. It can also tolerate two block failures at the same time. In this case, the remaining 10 blocks can be used to calculate all the data, thus reducing storage overhead and achieving failure tolerance at the same time.
HDFS uses the three-copy mechanism by default, and the new HDFS version already supports EC capabilities. After research, it is EC based on files, so it has a natural disadvantage for small files. Because if the size of the small file is smaller than the size required by the block, its overhead will be greater than the original overhead, because the two code blocks cannot be saved here. In extreme cases, if its size is equivalent to the size of a single code block, it is already equivalent to three copies.
At the same time, once HDFS is EC, it can no longer support append, hflush, hsync and other operations, which will greatly affect the scenarios that EC can use. Object storage natively supports EC. For small files, it internally merges the small files into one large block to make EC, so as to ensure that the data overhead is always constant, based on a pre-configured strategy.

3. The challenge of object storage: additional upload of data

In the S3 protocol, objects need to provide their size when uploading.

Take the S3 standard as an example. When the object storage is docked with Iceberg, the S3 standard object storage does not support the interface for additional data upload. The protocol requires the file size to be provided when uploading files. So in this case, it is not very friendly to this kind of streaming File IO incoming.

1) Solution 1: S3 Catalog data additional upload-small file cache local/memory

For some small files, they are written to the local cache/memory when they are streamed, and after they are completely written, they are uploaded to the object storage.

2) Solution 2: Additional upload of S3 Catalog data-MPU uploads large files in

For large files, the MPU defined by the S3 standard will be used for multipart upload.

It is generally divided into several steps:

The first step is to create an initial MPU, get an Upload ID, and then assign an Upload ID and a number to each segment, and these segments can be uploaded in parallel;
After the upload is completed, a complete operation is required, which is equivalent to the notification system. It will arrange the same Upload ID and all the numbers from small to large to form a large file;
Applying the mechanism to the data uploading scenario, the conventional implementation is to write a file and cache the file locally. When the required size of the block is reached, it can be initialized to the MPU, and one of its blocks can be uploaded. The same operation is applied to each subsequent block, until the last block is uploaded, and finally a completion operation is called to complete the upload.

MPU has advantages and disadvantages:

The disadvantage is that there is an upper limit on the number of shards of the MPU, and there may be only 10,000 shards in the S3 standard. If you want to support large files, this block cannot be too small, so for files smaller than the block, you still have to use the previous method for cache upload;
The advantage of MPU is the ability to upload in parallel. Suppose you do an asynchronous upload. After the file is cached, you can continue to cache the next one without waiting for the previous one to upload successfully, and then start uploading. When the front-end injection speed is fast enough, the asynchronous submission of the back-end becomes a parallel operation. Using this mechanism, it can provide faster uploading capabilities than a single stream.

4. The challenge of object storage: atomic commit

The next problem is the atomic submission of object storage.

As mentioned earlier, in the process of data injection, the final commit is actually divided into several steps, which is a linear transaction. First, it needs to read the current snapshot version, then merge the file list this time, and then submit its new version. This operation is similar to the common "i=i+1" in our programming. It is not an atomic operation, and the object storage standard does not provide this capability.

The figure above is a scenario where meta-information is submitted concurrently.

Here Commit Worker 1 got the v006 version, then merged its own files, and submitted v007 successfully.
At this time, there is another Commit Worker 2, which also got v006, then merged it out, and also provided v007. At this point we need a mechanism to tell it that v007 has conflicted and cannot be uploaded, and then let it go to Retry by itself. Retry will later take out the new v007 merge, and then submit it to v008.

This is a typical conflict scenario. A mechanism is needed here, because if it cannot detect that it is a conflict, submitting v007 again will overwrite the above v007, which will cause all the data submitted last time to be lost.

As shown in the figure above, we can use a distributed lock mechanism to solve the above problems.

First, Commit Worker 1 gets v006, and then merges the files. Before committing, it needs to get this lock. After getting the lock, judge the current snapshot version. If it is v006, then v007 can be submitted successfully, and it will be unlocked after successful submission.
Similarly, after Commit Worker 2 gets the v006 merge, it can't get the lock at first, and it can't get it until Commit Worker 1 releases the lock. When you get the lock and check again, you will find that the current version is already v007, which conflicts with your own v007, so this operation will definitely fail, and then it will retry.

This is to solve the problem of concurrent submission through locks.

5. Additional upload of data from Dell EMC ECS

The S3 standard-based object storage and Iceberg problem solutions have some problems, such as performance loss, or the need to deploy additional lock services.

Dell EMC ECS is also an object storage. Based on this question, there are different answers. It has some extensions based on the S3 standard protocol and can support additional uploads of data.

The difference between its additional upload and MPU is that it has no block size limit. The block can be set smaller, and it will be connected internally after uploading, and it is still a valid file.

Both the additional upload and MPU can be adapted to different scenarios to a certain extent.

The MPU has the ability to accelerate uploading. The performance of additional uploads is sufficient when the speed is not very fast, and it does not have the initialization and merging operations of the MPU, so the two can be used in different scenarios in terms of performance.

6. Dell EMC ECS solution under concurrent submission

ECS object storage also provides an If-Match semantics. There is such an interface capability on Microsoft's cloud storage and Google's cloud storage.

If-Match means that when Commit Worker 1 submits and gets v006, it also gets the eTag of the file. When submitting, the eTag will be brought along. The system needs to determine whether the eTag of the file to be overwritten is the same as the real eTag of the current file. If they are the same, this overwrite operation is allowed, then v007 can be submitted successfully;
In another case, Commit Worker 2 also got the eTag of v006, and found that the eTag obtained is different from the file in the current system when uploading, it will return a failure, and then trigger a Retry.

This implementation has the same effect as the lock mechanism, and there is no need to redeploy the lock service externally to ensure the issue of atomic submission.

7. S3 Catalog-unified storage of data

To recap, above we have solved the problem of uploading data IO in file IO, and solved the problem of atomic submission of metadata tables.

After solving these problems, all the management of data and metadata can be transferred to the object storage, no additional deployment of metadata services is required, and the concept of truly unified data storage can be achieved.

3. Demonstration plan

As shown above, the demonstration program uses Pravega, which can be simply understood as an alternative to Kafka, but it has been optimized for performance.

In this example, we will inject data into Pravega's stream, and then Flink will read the data from Pravega for analysis, and then store it in the Iceberg organization. Iceberg uses ECS Catalog to directly connect to the object storage without any other deployment. Finally, Flink reads this data.

Fourth, some thoughts on storage optimization

The above picture shows the current data organization structure supported by Iceberg, and you can see that it directly stores Parquet files in the storage.

Our thinking is if this lake and the metadata lake are actually a lake, is it possible that the generated Parquet files and source files have great data redundancy, and whether the storage of redundant information can be reduced.

For example, in the most extreme case, if a piece of information of the source file is recorded in Iceberg, the Parquet data file is not stored. When you want to query, you can customize File IO to generate a Parquet-like format in real time based on the original file in memory and submit it to the upper application for query to achieve the same effect.

However, this method is limited to situations where the storage cost is very high, but the query performance is not high. Achieving this is also based on Iceberg's good abstraction, because its file metadata and File IO are abstracted, and the source file can be disassembled to make it think it is a Parquet file.

Think further about whether you can optimize query performance while saving storage space.

For example, to pre-calculate, take out some commonly used columns of the source file, and then transfer the statistical information to Iceberg, and use the source file and cloud computing files when reading, you can quickly query the information, and at the same time save infrequent data Column storage space.

This is a relatively preliminary idea. If it can be realized, Iceberg can not only index the structured Parquet file format, but also index some semi-structured and structured data, and solve the upper-level query tasks through temporary calculations. A more complete Data Catalog.

Flink + Iceberg + object storage to build a data lake solution

1. Introduction to Data Lake and Iceberg

1. Data Lake Ecology

2. Application scenarios of structured data on the data lake

3. Typical solutions of structured data on the data lake

4. Iceberg table data organization structure

5. Iceberg writing process

6. Iceberg query process

7. The features of Iceberg Catalog at a glance

2. Object storage supports the Iceberg data lake

1. Current implementation of Iceberg Catalog

2. Comparison of Object Storage and HDFS

3. The challenge of object storage: additional upload of data

4. The challenge of object storage: atomic commit

5. Additional upload of data from Dell EMC ECS

6. Dell EMC ECS solution under concurrent submission

7. S3 Catalog-unified storage of data

3. Demonstration plan

Fourth, some thoughts on storage optimization

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

小米基于 Apache Paimon 的流式湖仓实践

物化视图详解：数据库性能优化的利器

基于Flink的配置化实时反作弊系统

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元