2
This article was first published on Mooring Floating Purpose : https://www.jianshu.com/u/204b8aaab8ba
VersiondateRemark
1.02021.6.20Article first published

Iceberg has been a little popular recently, and here I also make a note and output based on the information I have seen.

I won’t talk about the definition of data lake. For those who don’t know it, you can read my previous notes Big Data Study Notes 1: Data Warehouse, Data Lake, and Data Center .

1. Current status of data lake development

  • In a broad sense, the data lake system mainly includes the data lake village and data lake analysis
  • Existing data lake technologies are mainly driven by cloud vendors, including object storage-based data lake storage and analytics suites on top of it

    • Data lake storage technologies based on object storage (S3, WASB), such as Azure ADLS, AWS Lake Formation, etc.
    • And the analytics tools running on it like AWS EMR, Azure HDinsight, RStudio and many more

2. Industry trends

  • It has become a trend to build unified and efficient data storage to meet the needs of different data processing scenarios

    • ETL jobs and OLAP analysis - high-performance structured storage, distributed capabilities
    • Machine Learning Training and Inference - Massive Unstructured Storage, Container Mounting Capability
  • General data warehouses (Hive, Spark) are generalizing to data lake analysis, while data warehouses are evolving to high-performance architectures

3. Capability Requirements for Modern Data Lakes

  • Support stream batch computing
  • Data Mutation
  • Support transactions
  • Compute Engine Abstraction
  • Storage Engine Abstraction
  • data quality
  • Metadata support extension

4. Common modern data lake technologies

  • Iceberg
  • Apache Hudi
  • Delta Lake

In general, these data lakes provide some of the following capabilities:

  1. data organization built on top of the storage format
  2. Provide ACID capabilities, provide certain transaction features and concurrency capabilities
  3. Provides row-level data modification capabilities
  4. Ensure the schema's accuracy , provide a certain schema modification ability

Some specific comparisons can be seen in this picture:

5. Iceberg

Let's take a look at how Iceberg's official website introduces it:

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Trino and Spark that use a high-performance format that works just like a SQL table.

My understanding is that Iceberg organizes the underlying data in the form of tables and provides high-performance table-level computing power to it.

Its core idea is to track all the changes of the table on the timeline:

  • A snapshot represents a complete set of table data files
  • Each update operation generates a new snapshot

Iceberg's major manufacturers currently known to be in use:

  • Abroad: Netflix, Apple, Linkined, Adobe, Dremio
  • Domestic: Tencent, NetEase, Alibaba Cloud

5.1 Advantages of Iceberg

  • Write: support transactions, visible when written; and provide the ability to update/merge into
  • Read: Support for reading incremental data in a streaming manner: both Flink Table Source and Spark Struct streaming support; not afraid of schema changes
  • Computation: No engine is bound by abstraction. Provide native Java Native API, ecologically speaking, currently supports Spark, Flink, Presto, Hive
  • Storage: The underlying storage is abstracted and not bound to any underlying storage; it supports hidden partitions and partition evolution to facilitate business data partitioning strategies; supports Parquet, ORC, Avro and other formats to be compatible with row storage and column storage

5.2 Features

5.2.1 Snapshot Design Method

  • Implement snapshot-based tracking

    • Record table structure, partition information, parameters, etc.
    • Keep track of old snapshots to ensure eventual reclamation
  • The metadata of the table is immutable and always iterates forward
  • The current snapshot can be rolled back

5.2.2 Metadata Organization

  • Write operations must:

    • Record the version of the current metadata - Base Version
    • Create new metadata and mainfest files
    • Atomically replace the base version with the new version
  • Atomic replacement guarantees a linear history
  • Atomic replacement is guaranteed by the following operations

    • Capabilities provided by the metadata manager
    • Atomic rename capability provided by HDFS or local file system
  • Conflict Resolution - Optimistic Locking

    • Assume there are no other writes currently
    • If a conflict is encountered, retry based on the current latest metadata

5.2.2 Transactional Commit

  • write operations must

    • Record the version of the current metadata - base version
    • Create new metadata and mainfest files
    • Atomically replace the base version with the new version
  • Atomic replacement guarantees linear history
  • Atomic substitution is guaranteed by relying on the following operations

    • Capabilities provided by the metadata manager
    • Atomic rename capability provided by HDFS or local file system
  • Conflict Resolution - Optimistic Locking

    • Assume there are no other writes currently
    • If a conflict is encountered, retry based on the current latest metadata

5.3 Scenario

5.3.1 Real-time ingestion of CDC data

What we want to discuss here is how to analyze the binlog of the relational database. In the hadoop ecosystem, this scenario is generally not very friendly.

The most common way is to write to hive, mark this as binlog, and declare its type (I, U, D), and then run a batch task to the inventory table. But hive can only do hour-level partitioning, but iceberg can do it within 1 minute.

5.3.2 Stream-batch integration in near real-time scenarios

In the lambda architecture, there are real-time links and offline links. The main technology stack is very complex, if it can accept quasi-real-time (30s~1min) delay, iceberg is competent.


泊浮目
4.9k 声望1.3k 粉丝