Introduction to Alibaba Cloud Intelligence Researcher Lin Wei: Alibaba’s evolution from lake to warehouse brings us the integration of lake and warehouse, which makes the flexibility of the lake, the richness of data types, the growth of the warehouse and the enterprise-level management gained Organic integration is a valuable asset of Alibaba's best practices and a new generation of big data architecture.
**
Lin Wei, Alibaba Cloud Intelligent Researcher, Alibaba Cloud Intelligent General Computing Platform MaxCompute, Machine Learning PAI Platform Technical Leader**
The content of this article will tell readers from three parts the continuous evolution of the cloud-native big data platform that integrates offline and real-time data warehouses and lake warehouses. Through the history from the data lake to the data warehouse, we will reflect on why the integration of the warehouse and the warehouse is necessary, and why the integration of the warehouse and the warehouse has begun to integrate the offline and real-time warehouses at this stage.
- Lake and warehouse in one
- Offline and online data warehouse integration
- Smart warehouse
I hope that this sharing will give you a better understanding of why we are one of the lakes and warehouses.
One, lake and warehouse integration
(1) Alibaba's journey from data lake to data warehouse
The Ningbo Strategy Conference in 2007 determined the establishment of a developed, collaborative and prosperous e-commerce ecosystem, in which the core of the ecosystem is data. But at this time, all business departments are developing data capabilities vertically, using data to support business decision-making services. These data centers support the development of business departments. But when we have reached a stage, we hope to further dig out the correlation between the data of various business departments, so as to use these high-level data analysis to mine higher business value. We have encountered many difficulties because the data comes from different departments. Different people will provide you with different data sets. Without clear data quality monitoring, you don't know whether the data is complete or not. You need to spend a lot of time constantly to calibrate the data. This process takes too long and in most cases a lot of useless work will be done, which actually reduces the efficiency of the company as a whole.
So in 2012, we decided to link the data of all business departments and determined to do "One Data, One Service". In fact, this process is a typical process of upgrading a data lake to a data warehouse, but because we lack a good integrated system precipitation of the lake and warehouse, this process is very difficult. We call this process "moon landing". You can see the difficulty in the middle from this name. During this period of time, each team even needs to stop daily business development to coordinate with the collation of data, and move all the existing data analysis processes to a unified data warehouse system. In the end, after 18 months, we spent a very high price and completed the establishment of a unified big data warehouse platform in December 2015. This is Alibaba's MaxCompute. Through this unified data warehouse platform, whether it is a business team, a service merchant, logistics or other links, it is convenient, quick and better to tap business opportunities. So you can see that after the completion of Alibaba's unified big data platform, business growth has also entered the fast lane. It is precisely because of better data support that merchants and customers can quickly make some business decisions.
(2) The relationship between data warehouse and data lake
From the developer's point of view, the data lake is more flexible and prefers this freewheeling mode. Any engine can be read and written without constraints, and it is very easy to start.
From the perspective of data managers, the data lake can be used as a starting point, but when it reaches a certain scale, when data is used as an asset or when larger business decisions need to be made, it is hoped that there will be a good data warehouse.
(3) Growth curve of data warehouse and data lake system
The growth curve in the above figure is basically the curve of Ali's development. At the beginning, it was also in the state of the data lake. Each business department developed independently, with a fast start and strong flexibility. But when it reaches a certain scale, the data is unmanaged, and the logical language of the data of each business department is inconsistent, and it is difficult to align. So at that time, 50% and 80% of the invalid time were spent verifying data. As the scale continues to expand, such losses are getting bigger and bigger, forcing us to promote the establishment of the company's unified data warehouse.
(4) Lake and warehouse one
It is precisely because we have experienced the pain comparable to "moon landing", we do not want MaxCompute's future enterprise customers to experience such a painful process, so we build a development platform that integrates the lake and warehouse. When the company is small, you can use the data lake capabilities to customize your analysis more quickly. When the company has grown to a certain stage and needs better data management and governance methods, the Hucang integrated platform can seamlessly perform effective upgrade management of data and data analysis, making the company more standardized for data management. This is the core idea behind the overall design of Hucang One.
We organically integrate the lake system and the warehouse system. At the beginning, there is no metadata. When you want to build a data warehouse, we can extract this metadata on the lake. This metadata is the metadata of the warehouse. The data is placed on an integrated metadata analysis platform. Many data warehouse data management platforms can be built on top of this metadata.
At the same time, on the integrated data warehouse and lake warehouse platform, we effectively support many analysis engines, including task-based computing engines, including MaxCompute for batch processing, Flink for stream processing, machine learning, etc., as well as open source components for analysis. Our data; there are also service data engines that can support interactive query services, which can display our data in a more real-time manner, so that users can build their own data service applications on this service engine.
On top of the engine, we build rich data management tools to enable business departments to conduct efficient and overall data governance. And this is due to the fact that we have opened up the data of the lake and the warehouse, which is also the core of the integrated design of the whole lake and warehouse.
2. Offline and online data warehouse integration
Nowadays, society is becoming more and more convenient, and customers need to make business decisions faster. We can see the data analysis of Double Eleven GMV real-time big screen, Spring Festival Gala live real-time big screen, and the trend of machine learning from offline model to online model. These demands have promoted the development of real-time data warehouses.
In fact, real-time data warehouse and offline data warehouse have a similar development process. At the early stage of the development of real-time systems, the first thing we considered was the engine, because you can perform real-time data analysis only if you have the engine first, so Alibaba puts its research and development efforts on stream computing engines like Flink. But there is only a stream computing engine, similar to the data lake stage, we lack the management of the analyzed result data, so in the second stage, we use our offline data warehouse products to manage these analysis results, so as to incorporate the analysis results to us The overall data warehouse and data management. But putting the results after real-time analysis in the offline data warehouse is obviously not timely enough for real-time business decisions. So we are now developing the third stage: real-time data warehouse.
We will write the analysis results of the streaming engine to the real-time data warehouse Hologres in real time, so that the analysis results can be analyzed in real time, so as to effectively support customers' real-time business decisions.
This is the integrated design of offline and online data warehouses.
To sum up, the original analysis was a very complicated process before the integration of offline and online data warehouses. There were offline, online, and many different engines. Now we summarize or simplify it into the architecture of the above figure. We will use the real-time engine for preprocessing. After the preprocessing, we write the data to the offline data warehouse of MaxCompute, and also write it to the real-time data warehouse of Hologres at the same time, so as to achieve more real-time service. BI analysis. However, MaxCompute's offline data warehouse storage costs are lower, the throughput performance is better, and a large amount of offline data analysis can be done. This is the integrated design of the offline data warehouse.
With an integrated design, you can bring a very balanced system to customers. Depending on the data scenario or the business scenario, you can use batch processing. And through data compression and cold storage, the data is stored in different gradients according to the hot and cold methods, and offline analysis can be obtained at a lower cost.
When more attention is paid to the real-time value of data, it can be done with a stream computing engine. At the same time, I hope that there will be quick interaction, and I hope to quickly observe the generated reports in various ways, dimensions, and angles. At this time, you can use the interactive engine to perform insights in various dimensions after highly purified data.
It is hoped that a good balance can be achieved with the Hucang integrated platform, and a better point can be achieved according to the actual business volume, requirements, and scale of costs.
In general, I hope that the integrated system of the lake and warehouse, whether offline or online. Through different analysis engines, various types of analysis are supported. At the same time, BI can be carried out in real time through online service engines, which can achieve low-cost, customizable capabilities, and various balances between real-time and online services. Allow customers to choose according to actual business scenarios.
Three, smart data warehouse
With a unified data warehouse platform, we can build a powerful data governance or analysis platform on top of it. This is our DataWorks. There are many data modeling tools on this platform, providing data quality and standards, providing blood analysis, providing programming assistants, and so on. It is precisely because of the integrated base capabilities of the online and offline integration of Hucang that we have the possibility to achieve a more intelligent way of big data development and governance platform. In order to share more proven and effective data governance experience to our corporate customers.
For more information about big data computing and cloud data warehouse technology exchanges, please scan the code for consultation.
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。