大数据 - Event Reservation｜9.3 Lakehouse Meetup - 个人文章

Introduction: Starting at 13:30 pm on September 3, let's discuss data lake warehouse solutions together.

title=

At 13:30 pm on September 3, Feitian club and StreamNative jointly held a Lakehouse Meetup, inviting four technical experts from Alibaba and StreamNative to discuss data lake warehouse solutions. The specific agenda is as follows:

Bi Yan (Pathfinding)｜Alibaba Technical Expert

"Building Data Lake Warehouse Architecture Based on Data Lake Format"

Analyze the key features of the data lake warehouse architecture, and briefly describe the three data lake formats.
Combine Delta Lake and Hudi to share the use cases of Alibaba Cloud EMR in classic data warehouse scenarios.
Finally, the overall data lake warehouse solution provided by Alibaba Cloud EMR+DLF is introduced.

Chen Hang｜ StreamNative Senior Engineer

"APACHE PULSAR's Lake and Warehouse Integration Solution: Detailed Explanation of PULSAR's LAKEHOUSE Layered Storage Integration"

Apache Pulsar is a message bus for caching data and decoupling between different systems. To support long-term topic data storage, we introduce tiered storage to offload cold data to tiered storage, such as GCS, S3, HDFS, etc. However, the currently unloaded data is non-open format data managed by Pulsar, it is the original data format, and only Pulsar can access the data. So it is difficult to integrate it with other big data components such as Presto, Flink SQL and Spark SQL. To solve this problem, we introduce Lakehouse to manage offload data and integrate with the current topic cold data offload mechanism. We can use all the features provided by Lakehouse, such as transaction support, Schema enforcement and BI support. We do streaming data reads from BookKeeper or tiered storage based on data location. Thanks to Lakehouse's open storage format, we can support reading data from the various ecosystems that Lakehouse maintains. To support stream offloading and make the offloading mechanism more scalable, we introduce a per-reader offloading mechanism to read data from topics and write to tiered storage. In addition, we can also provide a compression service backend through the offloader, with topics as tables. Every update operation on a key is translated into an upsert operation on the table.

Chen Yuzhao (Yuzhao)｜ Alibaba Technical Expert

"Apache Hudi real-time lake warehouse solution"

Data warehouse solution based on Hudi
Hudi's core scene
Building Pulsar tiered storage with Hudi
Recent Roadmaps

Zhang Yong｜ StreamNative Software Engineer

"Integrating PULSAR and LAKEHOUSE Data: Using CONNECTOR to SINK Data from PULSAR TOPIC to LAKEHOUSE STORAGE"

We may use different systems to process streaming data in different application scenarios, and integrating data across these systems may be problematic. This talk will focus on Lakehouse Connector and discuss how to use this tool to sink data from Pulsar Topic to Lakehouse.

title=

Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.

Event Reservation｜9.3 Lakehouse Meetup

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Mybatis源码-缓存机制

Mybatis源码-加载映射文件与动态代理

Fluss：面向实时分析设计的下一代流存储

Mybatis源码-配置加载

Apache Flink 2.0：Streaming into the Future

GitHub Copilot Fridays｜GitHub Copilot 全新课程上线，助力开发者解锁 AI 编程超能力