大数据 - Blog recommendation｜How to use Apache Pulsar + Hudi to build Lakehouse - ApachePulsar

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.

GitHub address: http://github.com/apache/pulsar/
Article transferred from: ApacheHudi, author: Guo Sijie StreamNative CEO, member of Apache Pulsar PMC.

This issue of article layout: StreamNative@Tango

motivation

Lakehouse was first proposed by Databricks. It can be used as a low-cost, direct access to cloud storage and provides traditional DBMS management system performance and ACID transaction, version, audit, index, cache, and query optimization data management system. Lakehouse combines data lake and data warehouse Advantages: including low-cost storage of data lakes and open data format access, powerful management and optimization capabilities of data warehouses. Delta Lake, Apache Hudi and Apache Iceberg are three technologies for building Lakehouses.

At the same time, Pulsar provides a series of features: including tiered storage, streaming offload, column offload, etc., making it a storage layer that can unify batch and event streams. Especially the characteristics of tiered storage make Pulsar a lightweight data lake, but Pulsar still lacks some performance optimizations, such as index, data version (very common in traditional DBMS management systems), the purpose of introducing column uninstall program is In order to narrow the performance gap, but not enough.

This proposal attempts to use Apache Pulsar as the Lakehouse. The proposal only provides the top-level design. The detailed design and implementation will be resolved in the following sub-proposals (interested friends can continue to pay attention).

analysis

This part will analyze the key features needed to build Lakehouse, and then analyze whether Pulsar meets the requirements and identify any gaps.

Lakehouse has the following key features:

Transaction support: Many data pipeliines in the enterprise-level Lakehouse will concurrently read and write data. Supporting ACID transactions can ensure the consistency of concurrent reading and writing, especially using SQL; the three data lake frameworks of Delta Lake, Iceberg, and Hudi are all based on low-cost objects Storage implements the transaction layer and supports transactions. Pulsar introduced transaction support after version 2.7.0, and supports cross-topic transactions;
Schema constraints and governance: Lakehouse needs to support the constraints and evolution of Schema, and support data warehouse Schema paradigm, such as star/snowflake Schema. In addition, the system should be able to reason about data integrity, and it should have a robust governance and audit mechanism. All systems have this capability. Pulsar has a built-in Schema registration service, which meets the basic requirements of Schema constraints and governance, but there may still be some areas that need improvement.
BI support: Lakehouses can directly use BI tools on the source data, which can reduce staleness, improve freshness, reduce waiting time, and reduce the cost of having to operate two data copies in the data lake and warehouse at the same time. The integration of the three data lake frameworks with Apache Spark is very good. At the same time, Redshift and Presto/Athena can be allowed to query the source data. The Hudi community has also completed support for multiple engines such as Flink. Pulsar exposes the segments in tiered storage for direct access, so that it can be tightly integrated with popular data processing engines. However, the tiered storage in Pulsar itself still has performance gaps in serving BI workloads, and we will address these gaps in this proposal.
Separation of storage and computing: This means that storage and computing use separate clusters, so these systems can be expanded independently and horizontally. All three boxes support the separation of storage and calculation. Pulsar uses a multi-layer architecture deployment that separates storage and computing.
Openness: use open and standardized data formats, such as Parquet, and they provide APIs, so various tools and engines (including machine learning and Python/R libraries) can "directly" and effectively access data. Three frameworks support Parquet Format, Iceberg also supports ORC format, the Hudi community is supporting ORC format. Pulsar does not yet support any open formats, and column storage offloading supports the Parquet format.
Supports multiple data types from unstructured data to structured data: Lakehouse can be used to store, optimize, analyze and access data types required by many new data applications, including images, videos, audio, semi-structured data and text . It is unclear how Delta, Iceberg, and Hudi support this. Pulsar supports various types of data.
Supports various workloads: including data science, machine learning, and SQL and analytics. Multiple tools may be required to support all these workloads, but they all rely on the same data repository. The three frameworks are closely integrated with Spark, and Spark provides a wide range of tool options. Pulsar is also closely integrated with Spark .
End-to-end streaming: Real-time reporting is the norm for many companies. Support for streaming eliminates the need for a separate system dedicated to serving real-time data applications. Delta Lake and Hudi provide streaming capabilities through change logs. But this is not a real "flow". Pulsar is a real streaming system.

You can see that Pulsar meets all the conditions for building Lakehouse. However, today's tiered storage has a big performance gap, such as:

Pulsar does not store data in an open and standard format, such as Parquet;
Pulsar will not deploy any indexing mechanism for unloaded data;
Plusar does not support efficient Upserts;

This is to solve the performance problem of Pulsar storage layer, so that Pulsar can be used as a Lakehouse.

Current plan

Figure 1 shows the storage layout of the current Pulsar stream.

Pulsar stores segment metadata in ZooKeeper;
The latest segment is stored in Apache BookKeeper (faster storage layer)
Old segments are offloaded from Apache BookKeeper to tiered storage (cheap storage tier). The metadata of the unloaded segment remains in Zookeeper and refers to the unloaded object in the tiered storage.

The current scheme has some disadvantages:

It does not use any open storage format to store unloaded data. This means it is difficult to integrate with the wider ecosystem.
It keeps all metadata information in ZooKeeper, which may limit scalability.

New Lakehouse storage solution

The new scheme recommends using Lakehouse to store unloaded data in tiered storage. The proposal recommends using Apache Hudi as Lakehouse storage for the following reasons:

The cloud provider provides good support on Apache Hudi.
Apache Hudi has graduated as a top-level project.
Apache Hudi supports both Spark and Flink multi-engines. At the same time, there is a very active community in China.

## New storage layout

Figure 2 shows the new layout of Pulsar topic.

The metadata of the latest clip (unloaded clip) is stored in ZooKeeper.
The data of the latest clips (unloaded clips) is stored in BookKeeper.
The metadata and data of the offloaded segment are directly stored in tiered storage. Because it is only append stream. We don't have to use Lakehouse repositories like Apache Hudi. But if we also store metadata in tiered storage, it makes more sense to use the Lakehouse repository to ensure ACID.

Support efficient Upserts

Pulsar does not directly support upsert. It supports upsert through topic compression. But current theme compression methods are neither scalable nor efficient.

Theme compression is done in the broker. It cannot support the insertion of large amounts of data, especially when the data set is large.
Theme compression does not support storing data in tiered storage.

In order to support efficient and scalable Upsert, the proposal recommends using Apache Hudi to store compressed data in tiered storage. Figure 3 shows how to use Apache Hudi to support effective upserts in theme compression.

The idea is to implement theme compression services. The theme compression service can be run as a separate service (the Pulsar function) to compress themes.

The agent sends a topic compression request to the compression service.
The compression service receives the compression request, reads the message and inserts it up into the Hudi table.
After completing the upsert, advance the topic compression cursor to the last message it compressed.

The subject compression cursor stores the metadata of the reference location in the tiered storage that stores the Hudi table.

Treat Hudi table as Pulsar Topic

Hudi will timeline of all operations performed on the table real-time times, which helps to provide an instant view of the table, and also effectively supports data retrieval in _arrival_ order. Hudi supports incremental pull changes from the table. We can support _ReadOnly_ themes backed up by Hudi tables. This allows the application to stream Hudi table changes from the Pulsar proxy. Figure 4 shows this idea.

`Scalable metadata management`

When we started to store all data in tiered storage, the proposal suggested not to store metadata for unloaded or compressed data, and only rely on tiered storage to store metadata for unloaded or compressed data.

The proposal proposes to organize the unloaded and compressed data in the following directory layout.

- <tenant>/
  - <namespace>/
    - <topics>/
      - segments/ <= Use Hudi to store the list of segments to guarantee ACID
        - segment_<segment-id>
        - ...
      - cursors/
        - <cursor A>/ <= Use Hudi to store the compacted table for cursor A.
        - <cursor B>/ <= ...

`Quote`

[1] Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

[2] What is a Lakehouse? https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

[3] Diving Deep into the inner workings of the Lakehouse and Delta Lake. https://databricks.com/blog/2020/09/10/diving-deep-into-the-inner-workings-of-the-lakehouse-and-delta-lake.html

Blog recommendation｜How to use Apache Pulsar + Hudi to build Lakehouse

About Apache Pulsar

motivation

analysis

Current plan

New Lakehouse storage solution

Support efficient Upserts

Treat Hudi table as Pulsar Topic

`Scalable metadata management`

`Quote`

ApachePulsar

`引用和评论`

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

Dolphinscheduler IDEA本地调试

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 MCP 的 AI Agent 应用开发实践

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！