Big Data Study Notes 2: Iceberg of Modern Data Lakes

This article was first published on Mooring Floating Purpose : https://www.jianshu.com/u/204b8aaab8ba

Version	date	Remark
1.0	2021.6.20	Article first published

Iceberg has been a little popular recently, and here I also make a note and output based on the information I have seen.

I won’t talk about the definition of data lake. For those who don’t know it, you can read my previous notes Big Data Study Notes 1: Data Warehouse, Data Lake, and Data Center .

1. Current status of data lake development

In a broad sense, the data lake system mainly includes the data lake village and data lake analysis
Existing data lake technologies are mainly driven by cloud vendors, including object storage-based data lake storage and analytics suites on top of it
- Data lake storage technologies based on object storage (S3, WASB), such as Azure ADLS, AWS Lake Formation, etc.
- And the analytics tools running on it like AWS EMR, Azure HDinsight, RStudio and many more

2. Industry trends

It has become a trend to build unified and efficient data storage to meet the needs of different data processing scenarios
- ETL jobs and OLAP analysis - high-performance structured storage, distributed capabilities
- Machine Learning Training and Inference - Massive Unstructured Storage, Container Mounting Capability
General data warehouses (Hive, Spark) are generalizing to data lake analysis, while data warehouses are evolving to high-performance architectures

3. Capability Requirements for Modern Data Lakes

Support stream batch computing
Data Mutation
Support transactions
Compute Engine Abstraction
Storage Engine Abstraction
data quality
Metadata support extension

4. Common modern data lake technologies

Iceberg
Apache Hudi
Delta Lake

In general, these data lakes provide some of the following capabilities:

data organization built on top of the storage format
Provide ACID capabilities, provide certain transaction features and concurrency capabilities
Provides row-level data modification capabilities
Ensure the schema's accuracy , provide a certain schema modification ability

Some specific comparisons can be seen in this picture:

5. Iceberg

Let's take a look at how Iceberg's official website introduces it:

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Trino and Spark that use a high-performance format that works just like a SQL table.

My understanding is that Iceberg organizes the underlying data in the form of tables and provides high-performance table-level computing power to it.

Its core idea is to track all the changes of the table on the timeline:

A snapshot represents a complete set of table data files
Each update operation generates a new snapshot

Iceberg's major manufacturers currently known to be in use:

Abroad: Netflix, Apple, Linkined, Adobe, Dremio
Domestic: Tencent, NetEase, Alibaba Cloud

5.1 Advantages of Iceberg

Write: support transactions, visible when written; and provide the ability to update/merge into
Read: Support for reading incremental data in a streaming manner: both Flink Table Source and Spark Struct streaming support; not afraid of schema changes
Computation: No engine is bound by abstraction. Provide native Java Native API, ecologically speaking, currently supports Spark, Flink, Presto, Hive
Storage: The underlying storage is abstracted and not bound to any underlying storage; it supports hidden partitions and partition evolution to facilitate business data partitioning strategies; supports Parquet, ORC, Avro and other formats to be compatible with row storage and column storage

5.2 Features

5.2.1 Snapshot Design Method

Implement snapshot-based tracking
- Record table structure, partition information, parameters, etc.
- Keep track of old snapshots to ensure eventual reclamation
The metadata of the table is immutable and always iterates forward
The current snapshot can be rolled back

5.2.2 Metadata Organization

Write operations must:
- Record the version of the current metadata - Base Version
- Create new metadata and mainfest files
- Atomically replace the base version with the new version
Atomic replacement guarantees a linear history
Atomic replacement is guaranteed by the following operations
- Capabilities provided by the metadata manager
- Atomic rename capability provided by HDFS or local file system
Conflict Resolution - Optimistic Locking
- Assume there are no other writes currently
- If a conflict is encountered, retry based on the current latest metadata

5.2.2 Transactional Commit

write operations must
- Record the version of the current metadata - base version
- Create new metadata and mainfest files
- Atomically replace the base version with the new version
Atomic replacement guarantees linear history
Atomic substitution is guaranteed by relying on the following operations
- Capabilities provided by the metadata manager
- Atomic rename capability provided by HDFS or local file system
Conflict Resolution - Optimistic Locking
- Assume there are no other writes currently
- If a conflict is encountered, retry based on the current latest metadata

5.3 Scenario

5.3.1 Real-time ingestion of CDC data

What we want to discuss here is how to analyze the binlog of the relational database. In the hadoop ecosystem, this scenario is generally not very friendly.

The most common way is to write to hive, mark this as binlog, and declare its type (I, U, D), and then run a batch task to the inventory table. But hive can only do hour-level partitioning, but iceberg can do it within 1 minute.

5.3.2 Stream-batch integration in near real-time scenarios

In the lambda architecture, there are real-time links and offline links. The main technology stack is very complex, if it can accept quasi-real-time (30s~1min) delay, iceberg is competent.

Big Data Study Notes 2: Iceberg of Modern Data Lakes

1. Current status of data lake development

2. Industry trends

3. Capability Requirements for Modern Data Lakes

4. Common modern data lake technologies

5. Iceberg

5.1 Advantages of Iceberg

5.2 Features

5.2.1 Snapshot Design Method

5.2.2 Metadata Organization

5.2.2 Transactional Commit

5.3 Scenario

5.3.1 Real-time ingestion of CDC data

5.3.2 Stream-batch integration in near real-time scenarios

泊浮目

引用和评论

紧跟Flink 2.0，FlinkSQL提效神器v2025.3.0发布！

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 pyflink 的算法工作流设计和改造

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

数据无界、湖仓无界，Apache Doris 湖仓一体典型场景实战指南（下篇）