Introduction to 4.17 Shanghai Station Meetup Teacher Hu Zheng shared content: What are the challenges of data entering the lake, and how to use Flink + Iceberg to solve such problems.
GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~
1. The core challenge of data entering the lake
The real-time data entry into the lake can be divided into three parts, namely data source, data pipeline and data lake (data warehouse). The content of this article will focus on these three parts.
1. Case #1: Program BUG causes data transmission to be interrupted
- First of all, when the data source is transmitted to the data lake (data warehouse) through the data pipeline, it is very likely that there will be a BUG in the job, which will cause half of the data to be transmitted, which will affect the business;
- The second question is how to restart the job when this situation is encountered, and ensure that the data is not duplicated or missing, and is completely synchronized to the data lake (data warehouse).
2. Case #2: Data changes are too painful
data change
When data changes occur, it will bring greater pressure and challenges to the entire link. The following figure is an example. Originally, a table defined two fields, ID and NAME. At this time, the business classmates expressed the need to add the address in order to better tap the value of users.
First, we need to add a column Address to the Source table, then add the link to the middle of Kafka, then modify the job and restart. Then the entire link has to be changed all the way, add new columns, modify the job and restart, and finally update all the data in the data lake (data warehouse) to realize the new column. The operation of this process is not only time-consuming, but also introduces a problem, that is, how to ensure the isolation of the data, and will not affect the reading of the analysis job during the change process.
Partition change
As shown in the figure below, the table in the data warehouse is partitioned in units of "months". Now I want to change to partitions in units of "days". This may require updating all the data of many systems and then using new ones. This process is very time-consuming.
3. Case #3: A near real-time report that is getting slower and slower?
When the business needs more near real-time reports, the data import cycle needs to be changed from "days" to "hours" or even "minutes". This may cause a series of problems.
As shown in the figure above, the first problem that comes first is that the number of files grows at a speed visible to the naked eye, which will put more and more pressure on the external system. The pressure is mainly reflected in two aspects:
The first pressure is that start analysis job getting slower and slower, Hive Metastore is facing expansion difficulties, as shown in the figure below.
- With more and more small files, the bottleneck of using the centralized Metastore will become more and more serious, which will cause the start of the analysis job to become slower and slower, because when the job is started, all the original data of the small files will be scanned. .
- The second is that because Metastore is a centralized system, it is easy to encounter Metastore expansion problems. For example, Hive may have to find a way to expand the back of MySQL, which will cause greater maintenance costs and overhead.
The second pressure is that the scan analysis job getting slower and slower.
As the number of small files increases, after analyzing the job, you will find that the scanning process becomes slower and slower. The essence is that the large increase in small files causes the scan job to switch frequently among many Datanodes.
4. Case #4: It is difficult to analyze CDC data in real time
After investigating various systems in Hadoop, we found that the entire link needs to run fast, well and stable, and have good concurrency, which is not easy.
First of all, from the source side, for example, if you want to synchronize MySQL data to the data lake for analysis, you may face a problem, that is, there is stock data in MySQL. If incremental data is continuously generated later, how to perfectly synchronize full and incremental data In the data lake, ensure that there is no more and no less data.
In addition, assuming that the full and incremental switching of the source is resolved, if an exception is encountered during the synchronization process, such as an upstream schema change that causes the job to be interrupted, how to ensure that the CDC data is synchronized to the downstream with more than one row.
The construction of the entire link needs to involve the switching of the entire source and synchronization, including the collusion of intermediate data streams, and the process of writing to the data lake (data warehouse). Building the entire link requires writing a lot of code, and the development threshold is high. .
The last problem, and the key one, is that we find that it is difficult to find efficient and high-concurrency data that analyzes the nature of CDC changes in the open source ecosystem and system.
5. The core challenge of data entering the lake
Data synchronization task interrupted
- Inability to effectively isolate the impact of writing on analysis;
- Synchronous tasks do not guarantee exactly-once semantics.
End-to-end data change
- DDL leads to complex updates and upgrades across the entire link;
- It is difficult to modify the stock data in the lake/silo.
slower and slower near real-time reports
- Frequent writing generates a lot of small files;
- The Metadata system is under heavy pressure and slow to start operations;
- A large number of small files causes slow data scanning.
CDC data cannot be analyzed in near real time
- It is difficult to complete the switch from full to incremental synchronization;
- Involving end-to-end code development, the threshold is high;
- The open source world lacks efficient storage systems.
2. Introduction to Apache Iceberg
1. Netflix: Summary of the pain points of Hive on the cloud
The most important reason for Netflix to be Iceberg is to solve the pain points of Hive's cloud migration. The pain points are mainly divided into the following three aspects:
1.1 Pain point 1: Difficulties in data changes and retrospectives
- No ACID semantics are provided. When data changes occur, it is difficult to isolate the impact on the analysis task. Typical operations such as: INSERT OVERWRITE; modify data partition; modify Schema;
- Unable to handle multiple data changes, causing conflicts;
- It is not possible to go back to the historical version effectively.
1.2 Pain point 2: It is difficult to replace HDFS with S3
- The data access interface directly depends on the HDFS API;
- Relying on the atomicity of the RENAME interface, it is difficult to achieve the same semantics on object storage like S3;
- It relies heavily on the list interface of the file directory, which is very inefficient on the object storage system.
1.3 Pain point 3: Too many details
- When the schema is changed, the behavior of different file formats is inconsistent. Different FileFormat and even data type support are inconsistent;
- Metastore only maintains partition-level statistics, causing no task plan overhead; Hive Metastore is difficult to expand;
- Non-partition fields cannot be used as partition prune.
2. Apache Iceberg core features
Universal standard design
- Perfectly decoupled computing engine
- Schema standardization
- Open data format
- Support Java and Python
Perfect Table Semantics
- Schema definition and changes
- Flexible Partition Strategy
- ACID semantics
- Snapshot semantics
Rich data management
- Stored stream batch unified
- Scalable META design support
- Batch update and CDC
- Support file encryption
Value for money
- Calculate pushdown design
- Low-cost metadata management
- Vectorized computing
- Lightweight index
3. Apache Iceberg File Layout
Above is a standard Iceberg TableFormat structure, the core is divided into two parts, one is Data, the other is Metadata, no matter which part is maintained on S3 or HDFS.
4. Apache Iceberg Snapshot View
The picture above shows the general flow of Iceberg's writing and reading.
You can see that there are three levels:
- The yellow one at the top is a snapshot;
- The blue one in the middle is Manifest;
- At the bottom is the file.
Each write will generate a batch of files, one or more Manifests, and snapshots.
For example, the first snapshot Snap-0 is formed, the second snapshot Snap-1 is formed, and so on. However, when maintaining the original data, additional maintenance is done incrementally step by step.
In this way, it can help users do batch data analysis on a unified storage, and can also do incremental analysis between snapshots based on storage. This is why Iceberg can do some support in stream and batch read and write. .
5. The company that chooses Apache Iceberg
The above picture shows some companies currently using Apache Iceberg. The domestic examples are familiar to everyone. Here is a general introduction to the use of foreign companies.
- NetFlix now has hundreds of petabytes of data on top of Apache Iceberg, and Flink's daily data increment is hundreds of terabytes of data.
- Adobe's daily data increase is several terabytes, and the total data scale is about tens of petabytes.
- AWS uses Iceberg as the base of the data lake.
- Cloudera builds its entire public cloud platform based on Iceberg. The trend of privatized deployment of HDFS like Hadoop is weakening, and the trend of going to the cloud is gradually increasing. Iceberg plays a key role in the stage of Cloudera's data architecture going to the cloud.
Apple has two teams in use:
- First, the entire iCloud data platform is based on Iceberg;
- The second is the artificial intelligence voice service Siri, which is also based on Flink and Iceberg to build the entire database ecology.
3. How Flink and Iceberg solve the problem
Back to the most critical content, the following explains how Flink and Iceberg solve the series of problems encountered in the first part.
1. Case #1: Program BUG causes data transmission to be interrupted
First of all, using Flink for the synchronization link can guarantee the semantics of exactly once. When a job fails, strict recovery can be performed to ensure data consistency.
The second is Iceberg, which provides rigorous ACID semantics, which can help users easily isolate the adverse effects of writing on analysis tasks.
2. Case #2: Data changes are too painful
As shown above, when data changes occur, Flink and Iceberg can solve this problem.
Flink can capture the upstream schema change event, and then synchronize this event to the downstream. After synchronization, the downstream Flink directly forwards the data down, and then forwards it to the storage. Iceberg can instantly change the schema.
When doing DDL such as Schema, Iceberg directly maintains multiple versions of Schema, and then the old data source is completely intact, and the new data is written to the new Schema, achieving one-click Schema isolation.
Another example is the problem of partition changes. Iceberg's approach is shown in the figure above.
Before partitioning by "month" (yellow data block above), if you want to change to partition by "day", you can directly change the Partition with one key. The original data remains unchanged, and all new data is partitioned by "day". Semantics achieve ACID isolation.
3. Case #3: A near real-time report that is getting slower and slower?
The third problem is the pressure on the Metastore caused by small files.
First of all, for the Metastore, Iceberg stores the original data in the file system, and then maintains it in the way of metadata. The whole process actually removes the centralized Metastore and only relies on file system expansion, so the scalability is better.
Another problem is that there are more and more small files, causing data scanning to become slower and slower. On this issue, Flink and Iceberg provided a series of solutions:
- The first solution is to optimize the problem of small files when writing, and write in the Shuffle method of Bucket. Because of the small file of Shuffle, the written file is naturally small.
- The second option is for batch jobs to periodically merge small files.
- The third solution is relatively smart, which is to merge small files incrementally automatically.
4. Case #4: It is difficult to analyze CDC data in real time
- The first is the problem of synchronizing full data with incremental data. In fact, the community already has a Flink CDC Connected solution, which means that Connected can automatically connect full data with incremental data.
The second problem is how to ensure that a lot of Binlog lines are synchronized to the lake during the synchronization process, even if an exception is encountered in the middle.
For this problem, Flink can well identify different types of events at the Engine level, and then with the help of Flink's exactly once semantics, even if it encounters a failure, it can automatically recover and deal with it.
The third problem is that building the entire link requires a lot of code development, and the threshold is too high.
After using the Flink and Data Lake solutions, you only need to write a source table and sink table, and then an INSERT INTO, and the entire link can be opened without writing any business code.
- Finally, how to support near real-time CDC data analysis at the storage level.
4. Community Roadmap
The picture above shows Iceberg's Roadmap. It can be seen that Iceberg only released one version in 2019, but directly released three versions in 2020, and became the top project in version 0.9.0.
The above picture shows the Roadmap of Flink and Iceberg, which can be divided into 4 stages.
- The first stage is to establish a connection between Flink and Iceberg.
- The second stage is the scene where Iceberg replaces Hive. In this scenario, many companies have already started to go online to implement their own scenarios.
- The third stage is to solve more complex technical problems through Flink and Iceberg.
- The fourth stage is to change this set from a purely technical solution to a more complete product solution perspective.
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。