Introduction: Starting at 13:30 pm on September 3, let's discuss data lake warehouse solutions together.

 title=

At 13:30 pm on September 3, Feitian club and StreamNative jointly held a Lakehouse Meetup, inviting four technical experts from Alibaba and StreamNative to discuss data lake warehouse solutions. The specific agenda is as follows:

01

Bi Yan (Pathfinding)|Alibaba Technical Expert

"Building Data Lake Warehouse Architecture Based on Data Lake Format"

  • Analyze the key features of the data lake warehouse architecture, and briefly describe the three data lake formats.
  • Combine Delta Lake and Hudi to share the use cases of Alibaba Cloud EMR in classic data warehouse scenarios.
  • Finally, the overall data lake warehouse solution provided by Alibaba Cloud EMR+DLF is introduced.

02

Chen Hang| StreamNative Senior Engineer

"APACHE PULSAR's Lake and Warehouse Integration Solution: Detailed Explanation of PULSAR's LAKEHOUSE Layered Storage Integration"

Apache Pulsar is a message bus for caching data and decoupling between different systems. To support long-term topic data storage, we introduce tiered storage to offload cold data to tiered storage, such as GCS, S3, HDFS, etc. However, the currently unloaded data is non-open format data managed by Pulsar, it is the original data format, and only Pulsar can access the data. So it is difficult to integrate it with other big data components such as Presto, Flink SQL and Spark SQL. To solve this problem, we introduce Lakehouse to manage offload data and integrate with the current topic cold data offload mechanism. We can use all the features provided by Lakehouse, such as transaction support, Schema enforcement and BI support. We do streaming data reads from BookKeeper or tiered storage based on data location. Thanks to Lakehouse's open storage format, we can support reading data from the various ecosystems that Lakehouse maintains. To support stream offloading and make the offloading mechanism more scalable, we introduce a per-reader offloading mechanism to read data from topics and write to tiered storage. In addition, we can also provide a compression service backend through the offloader, with topics as tables. Every update operation on a key is translated into an upsert operation on the table.

03

Chen Yuzhao (Yuzhao)| Alibaba Technical Expert

"Apache Hudi real-time lake warehouse solution"

  • Data warehouse solution based on Hudi
  • Hudi's core scene
  • Building Pulsar tiered storage with Hudi
  • Recent Roadmaps

04

Zhang Yong| StreamNative Software Engineer

"Integrating PULSAR and LAKEHOUSE Data: Using CONNECTOR to SINK Data from PULSAR TOPIC to LAKEHOUSE STORAGE"

We may use different systems to process streaming data in different application scenarios, and integrating data across these systems may be problematic. This talk will focus on Lakehouse Connector and discuss how to use this tool to sink data from Pulsar Topic to Lakehouse.

 title=

Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。