Big data architecture series: how to understand the integration of lake and warehouse?

Introduction | This article is selected from the column of Tencent Cloud Developer Community - [Technical Thinking Guangyi·Tencent Technician Original Collection]. This column is a sharing and communication window created by the Tencent Cloud developer community for Tencent technicians and a wide range of developers. The column invites Tencent technicians to share original technical accumulation, and to inspire and grow with a wide range of developers. The author of this article is Ye Qiangsheng, a back-end development engineer at Tencent.

introduction

Big data technology has flourished in the past ten years. From the perspective of market performance, data storage and computing based on big data are very valuable. Among them, Snowflake, a company whose main business is cloud data warehouses, has the highest market value ($44.9 billion as of now) , Databricks, another company focusing on the integration of lakes and warehouses, may be valued at US$38 billion; major cloud manufacturers waiting for opportunities have also launched their own data lakes, cloud data warehouses, and lake-warehouse integrated products.

There are still a lot of concepts (terms) in the field of big data. Most of the time, we shoot arrows and then draw targets. We have been working on some requirements for a while, and then some authoritative people put forward some concepts (terms) for description, so we cannot Strictly use mathematical definitions to frame the boundaries of these concepts (terms); and many times a term "image" is easier to spread than "accurate", image means easy to understand, accurate means huge amount of information (refer to the mathematical definition). It is suggested that we can understand these big data concepts and technologies from the perspective of demand, and do not pursue accurate definitions too much. \
Whether it is a data lake or a data warehouse, it is ultimately aimed at solving user problems. What users want is actually the information in the data. The data ingestion, storage, and computing capabilities of the lake and warehouse are mainly due to the massive and diverse data. If user data Small businesses are simple enough to use local Excel to import data for various effective analysis. The following discussion of data lakes, data warehouses, and lakes and warehouses is based on the fact that user data is massive and complex. \

data flow

As shown in the figure above, in a complex scenario, data analysts need to conduct business modeling and data modeling; technicians need to design, develop, and maintain data architecture; users can use business models and data models to generate business value; App according to Algorithms, models, user portraits, etc. provide functions and recommendations. \

What: What is a data lake and data warehouse?

To explain, the current mainstream data lake technology is not friendly to binary data (pictures, audio, etc.), and the context of the article is all analytical (structured, semi-structured) data.

As long as the complex data in the business scenario is diversified, you have to store all kinds of data based on any storage framework, and then you have to have a computing engine to calculate the data; at the same time, due to business requirements, you need to analyze the data in real time. The data lake technology integrates and standardizes the above process; at the beginning of the data entering the lake, the data is organized according to the specified standards, supporting the integration of streams and batches, and different frameworks have different organization methods (optimized for specific scenarios), but the purpose After entering the lake, a standardized data reading method is provided to support the calculation of various MPP engines; because the data is organized in advance, the writing performance is reduced and the query performance is improved. So you may have been using data lakes before, but not using data lake technology.

Before the data warehouse is stored, data modeling is generally required; then the data is standardized according to the table format and the data is organized by the storage engine specified by the table. At this time, some information may be lost; The data structure is optimized to obtain the ultimate query experience. In the daily design and implementation of big data architecture, we generally do more than the scope limited by data warehouse, but we still call it data warehouse, so let me mention it again, don't pursue accurate definition too much.

Lake warehouse comparison

(The above pictures are from Alibaba Cloud)

Why: Why does the industry want to integrate lakes and warehouses?

Let me describe it vividly: combining the advantages of both, a data lake managed like a data warehouse, and a data warehouse open like a data lake.

From the description of the data lake and data warehouse in the What description, it can be seen that the commonly used big data architecture in the industry is basically the integration of the lake and the warehouse, that is, the function of the expanded data warehouse will also actively regulate the storage and use of data. According to the information currently shared in the industry, it is mainly to replace the old Lambda and Kappa architectures, and to reduce costs and improve efficiency through a relatively simple architecture.

Intersection of Lake Bin Values

(The above pictures are from Alibaba Cloud)

How: How does the industry integrate lakes and warehouses?

At present, the integrated lake and warehouse architecture in the industry is generally called the integrated lake and warehouse architecture based on a certain data warehouse. Users will put hot data (frequent queries) in the data warehouse. There are a lot of optimizations in both storage and computing. The speed is fast and the cost is high; cold data is placed in the data lake, the calculation is slow and the cost is low. When users want to query, they can directly access the data in the data lake format through the computing layer of the data warehouse. In many architectures, there will be temporary expansion. Elastic computing nodes are used to calculate cold data to avoid the impact of efficient query of hot data.

Lake and warehouse integrated cold and hot storage architecture

As shown in the figure above, the hot data of the past N days is queried in the resident MPP computing layer. After the data is cooled, it is converted into a data lake storage format and entered into the lake. The data is subsequently calculated by the elastic MPP computing layer. Generally, the frequency of cold data is low.

Lake and warehouse integrated storage and computing separation architecture

As shown in the figure above, all data enters the lake asynchronously, the metadata of the data warehouse will be updated, the original data that needs to be scanned will be cached when the user queries, and the data with low calculation frequency will be cleaned up through the cache elimination mechanism.

The real business scenario may be that the above two implementations are supported in the same architecture. There are also some lake-warehouse-integrated architectures that do not have data warehouse products, and only use Presto as query acceleration (Volcano Engine, Bilibili), but the overall architecture is roughly the same.

The following is a list of solutions implemented by the industry:

Alibaba Cloud MaxCompute+Hologres

Alibaba Cloud EMR+Sarrocks

Huawei cloud and lake warehouse integration

ByteDance’s Doris-based Lake and Warehouse Integrated Exploration

ByteDance-Volcano Engine Lake and Warehouse Integrated Cloud Service

bilibili lake and warehouse integrated architecture

Google BigLake

Amazon Lake House

Azure Lake House

SnowFlake Data Lake

Summarize

At present, the integration of lake and warehouse is mainly aimed at solving the scenarios where the amount of user data is particularly large and diversified. The function of the warehouse is to speed up, and the function of the lake is to support massive data concurrent writing and massive storage; and the designer hopes to minimize the complexity of the architecture ,Improve efficiency.

The following personal assessments are for reference only:

SnowFlake is basically a natural integration of lakes and warehouses in analytical data scenarios, with huge advantages. \
The architecture of Doris/Starrocks will also be improved in the direction of Snowflake, with full potential.

Based on Spark/Presto integration of lake and warehouse, the query efficiency will be lower than the above two, but it can be used to supplement some of the above scenarios.

References:
1. Multi-angle analysis: the fundamental difference between data lake VS data warehouse

2. In-depth comparison of the three open source data lake solutions of Delta, Iceberg and Hudi

32,000 words to explain the data lake in detail: concepts, features, architecture and cases

4. Explain the data lake, concepts, features, architecture, solutions, scenarios and the whole process of building the lake in detail

54,000 words to fully grasp the database, data warehouse, data mart, data lake, data center

6. After 20 years of big data development, is the "integration of warehouses and lakes" the end?

7. Iceberg-based Lake and Warehouse Integrated Architecture Practice at Station B

8. Amazon lake warehouse integration

9. Build a practical and effective integrated structure of lake and warehouse

About the Author

Ye Qiangsheng

Tencent Cloud Developer Community Author

Tencent back-end development engineer, currently responsible for the R&D work related to Tencent Tianqi's big data OLAP engine, and has rich experience in big data framework R&D.

If you are a creator of Tencent technical content, the Tencent Cloud developer community sincerely invites you to join the [Tencent Cloud Original Sharing Program] to receive gifts and help with your rank promotion.

Big data architecture series: how to understand the integration of lake and warehouse?

How: How does the industry integrate lakes and warehouses?

Summarize

腾讯云开发者

引用和评论

具身智能全解读，从实验室到产业化 | TVP技术夜未眠

大模型时代，后端程序员如何避免被AI卷死？

【2025年2月最新】Axure RP9无法免费使用Axure Cloud的解决方案

What？废柴，还在本地部署DeepSeek吗？Are you kidding？

AI编程神器巅峰对决！Cursor、Windsurf、Trae谁将取代Copilot？实测结果颠覆认知！

国内版的AI编程工具Trea，真的来了！免费使用DeepSeek！

揭秘Chrome DevTools：从原理到自定义调试工具

Big data architecture series: how to understand the integration of lake and warehouse?

How: How does the industry integrate lakes and warehouses?

Summarize

腾讯云开发者

引用和评论

具身智能全解读，从实验室到产业化 | TVP技术夜未眠

大模型时代，后端程序员如何避免被AI卷死？

【2025年2月最新】Axure RP9无法免费使用Axure Cloud的解决方案

What？废柴， 还在本地部署DeepSeek吗？Are you kidding？

AI编程神器巅峰对决！Cursor、Windsurf、Trae谁将取代Copilot？实测结果颠覆认知！

国内版的AI编程工具Trea，真的来了！免费使用DeepSeek！

揭秘Chrome DevTools：从原理到自定义调试工具

What？废柴，还在本地部署DeepSeek吗？Are you kidding？