理解 Apache Paimon 的一致性模型第 1 部分 - Jack Vanlightly

Introduction: Apache Paimon is an open-source table format born in Apache Flink. It has features setting it apart from Iceberg, Delta Lake, and Apache Hudi. This post focuses on its core mechanics rather than implementation details.
- Catalog, Databases, and Tables: Paimon has a catalog, databases, and tables. This analysis focuses on primary key tables with additional concepts like append-only tables and change logs.
The logical model of Paimon:
- Metadata layer: Organized into a set of files per time point, including a tree root, manifest-list files, an index manifest file, data files, etc. Snapshot files form a log with a monotonically increasing integer. Each snapshot file contains version, schema ID, manifest lists, and other information. The base manifest list of a snapshot is the merged base + delta manifest lists of the previous snapshot.
- Data layer: Tables are stored as a set of data files organized into partitions and buckets. Partitioning allows for efficient querying by pruning irrelevant partitions. Data within each partition is distributed over buckets based on a bucket key or primary key. The data and metadata files are organized into directories. A sorted run is a set of sorted String Table files.
Paimon’s LSM tree approach:
- Similar to ClickHouse but with some differences. Data is buffered in-memory and flushed as level 0 sorted-runs. Compaction is time-aligned, and different merge engines are supported. Each sorted run can be formed of multiple sorted data files, and data of a given key can exist in multiple sorted runs.
Merge-on-read: Readers merge the data of multiple sorted runs. The Paimon writer can do full compactions or use a dedicated compaction job. The size of compacted files depends on various factors.
Deletion vectors: Improve read performance by invalidating specific rows. Without deletion vectors, read parallelism is limited. Deletion vectors allow parallel reading and efficient merge operations. They are maintained by compactions and stored in a single DV file per bucket.
Support for Copy-on-write (COW) and merge-on-read (MOR): Iceberg, Delta, and Hudi support both. Paimon is naturally a merge-on-read design, but COW can be emulated to some extent.
Next: In part 2, the consistency model will be explored. Analysis parts include Part 1 - The basic mechanics, Part 2 - The consistency model, and Part 3 - Formal verification with Fizzbee.