In the technology trend outlook for the whole year at the beginning of 2021, the integration of data lakes and data warehouses has become a trend focus in the field of big data. Until the end of the year, discussions on the two were still lively. The main differences in the industry are the control of data lakes and data warehouses on storage system access, authority management, etc.; the main consensus point in the industry is that the combination of the two must be reduced The cost of big data analysis improves ease of use.
And this kind of controversy reflects the industry's core demands in the field of big data processing: how to effectively meet the data architecture requirements of modern applications through the design of data lakes and data warehouses. As a leading cloud vendor in the industry, Amazon Cloud Technology has also launched a "smart lake warehouse" related to the integration of data lakes and data warehouses. Why can "smart lake warehouse" integrate data lakes, data warehouses and other data processing services more intelligently? What does it mean that the "smart lake warehouse" architecture has attracted much attention? At the 2021 Amazon Cloud Technology re:Invent conference, which is the vane of the technology industry, we saw the present and future ideas of the "smart lake warehouse" architecture.
The widely concerned "smart lake warehouse" architecture
To understand the present and future of the "smart lake warehouse" architecture, one needs to understand its past. As early as 2017, the “smart lake warehouse” architecture has begun to take shape. At that time, Amazon Cloud Technology released Amazon Redshift Spectrum, giving Amazon Redshift the ability to open up data warehouses and data lakes, and realize data query across data lakes and data warehouses.
This incident inspired the formation of the "smart lake warehouse" architecture. At the 2020 Amazon Cloud Technology re:Invent conference, Amazon Cloud Technology officially released the "Smart Lake Warehouse". If you start from the early technological exploration, the serverless capability released at the 2021 Amazon Cloud Technology re:Invent conference represents the 8th round of technological evolution of the "smart lake warehouse" architecture. Today, "Smart Lake Warehouse" builds a data lake based on Amazon S3, and integrates data warehouses, big data processing, log analysis, and machine learning data services around the lake. Using tools such as Amazon Lake Formation and Amazon Glue can achieve free flow of data and unified governance. .
Specifically, under the “smart lake warehouse” architecture, it is first necessary to break the data islands to form a data lake; secondly, it is necessary to provide users with corresponding analysis tools in different application scenarios around the data lake; in addition, it is necessary to ensure that the data is in the lake, The warehouse and specialized services can move freely; in addition, it is necessary to ensure that the security, access control and auditing of the data in the lake are managed in a unified way; ultimately, it is necessary to be able to use low-cost methods to effectively use the respective advantages of the lake and the warehouse Make use of it, and use artificial intelligence and other innovative means to innovate.
Just like when Amazon Redshift was released in 2012, it guided the development direction of cloud-native data warehouses. Once the "smart lake warehouse" architecture was released, it attracted widespread attention in the industry. On the one hand, it is because of Amazon Cloud Technology's industry status as a leading cloud vendor. , On the other hand, because the technical innovation of this architecture can bring some new thinking to the industry.
"Smart Hucang" emphasizes "architecture" rather than "product", and emphasizes the free flow of data and unified governance, as well as "smart innovation" based on Hucang. Today, the "smart lake warehouse" architecture does not simply connect the lake and the warehouse, but connects the lake, the warehouse, and a specially constructed data service into a whole, allowing data to move seamlessly between them. In the face of data growing to the terabyte, petabyte, or even EB level, "how to store" and "how to use" are no longer relatively isolated topics. "Smart Lake Warehouse" sends a signal to the industry: companies need to unify data analysis tools to realize the free flow of data across the entire data platform.
Whether it is from the perspective of enterprise data management concepts or from the perspective of technology, the "smart lake warehouse" architecture has been widely paid attention to also means that as the boundary between data lake and data warehouse is gradually weakened, the big data processing system based on both The architecture is being refactored.
Under the "smart lake warehouse" architecture, the big data infrastructure is being reconstructed
This refactoring can be roughly divided into several dimensions to understand, the most important of which are stronger data security, governance and data sharing capabilities, more agile construction methods, and smarter innovation methods.
Data security, governance, and sharing, focusing on data circulation and governance across lakes, warehouses, and even across enterprises, and committed to realizing cross-domain data interoperability in the true sense; a more agile construction method will raise the enterprise’s agile pursuit to Ultimately, the application of serverless capabilities is the key; smarter innovative methods combine AI/ML capabilities and big data governance into a unified category to avoid the misunderstanding of "big data for big data".
In 2022, when we talk about the integration of data lakes and data warehouses again, the "smart lake warehouse" architecture that includes the above key points is likely to become one of the key construction ideas in the industry.
Stronger data security, governance and data sharing capabilities
Data security, governance, and sharing are the original tasks of big data, but when the data reaches the PB or even EB level and requires data sharing or data interaction across multiple regions, organizations, and accounts, companies sometimes do not want to be fine-grained Manage data, but cannot manage. This granular authority control is often much more complicated than a single-machine system design or a single distributed system. Therefore, data governance has become an important force for the "smart lake warehouse".
At the 2021 Amazon Cloud Technology re:Invent conference, Amazon Lake Formation, the "smart lake warehouse" component that supports unified data governance and free flow capabilities, released a number of new functions. In addition to previously supported table and column-level security, Amazon Lake Formation now supports row and unit-level permissions, which makes it easier to restrict access to sensitive information by restricting user access to some data.
In addition, the concept of Data mesh was also mentioned at the 2021 Amazon Cloud Technology re:Invent conference. The Data mesh concept is also one of the top ten data technology trends proposed by Gartner. In the Data mesh mode, the "smart lake warehouse" can realize domain data into products, easily enable fine-grained authorization, data easier to use, cross-enterprise visibility of data calls, and federal data control and compliance. This means that under the "smart lake warehouse" architecture, Data mesh can realize data sharing and calculation across data lakes. With the help of its own data lake security, tag-level access control and sharing capabilities, Amazon Cloud Technology provides data mesh implementation methods and means, allowing the concept of Data mesh to come to fruition.
A more agile way to build
In addition to stronger data security, governance, and data sharing capabilities, more agile construction methods are also one of the major technological innovations that most companies are currently focusing on. The recognition and application of agile among enterprises are getting higher and higher, and "smart lake warehouse" is originally agile architecture . In the "smart lake warehouse" architecture, Amazon Lake Formation can shorten the time to establish a data lake from months to days. Users can use Serverless data integration tools like Amazon Glue to quickly realize data entry into the lake; use Serverless query engines like Amazon Athena to directly implement SQL-based data query and analysis on the lake. Whether it is a very large company or a studio, you can quickly benefit from this agile construction method and extract the value of data.
In order to make the construction method more agile, at the 2021 Amazon Cloud Technology re:Invent conference, Amazon Cloud Technology announced the launch of more serverless versions of data analysis services. With the help of serverless capabilities, users can build their own data storage more agilely. , Analysis, and intelligent application solutions.
- Amazon Redshift Serverless makes the data warehouse more agile, supports automatic setting and expansion of resources within a few seconds, users do not need to manage data warehouse clusters, and achieve PB-level data-scale operation of high-performance analysis workloads;
- Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless allows streaming data access and processing, supports rapid resource expansion, simplifies real-time data ingestion and streaming, realizes comprehensive monitoring, movement, and even cross-cluster loading partitions, automatic deployment and expansion Computing and storage resources, allowing users to use Kafka on demand;
- Amazon EMR Serverless makes big data processing more agile. Users do not need to deploy, manage, and expand the underlying infrastructure, and use open source big data frameworks (such as Apache Spark, Hive, and Presto) to run analytical applications;
- Amazon Kinesis Data Streams on Demand makes streaming data analysis and real-time data scenarios more agile. It can handle several gigabytes of write and read throughput per minute, without having to provision and manage servers and storage, and balance cost and performance and become simpler.
Data from Amazon Cloud Technology shows that tens of thousands of users are using Amazon Redshift to process more than 2EB of data every day. Dr. Yannick Misteli, chief cloud platform and machine learning engineer at Roche, one of the world’s largest pharmaceutical companies, said: “Amazon Redshift Serverless can reduce operational burdens, reduce costs, and help Roche Pharmaceuticals implement the Go-to-Market strategy on a large scale. This minimalist approach changes the rules of the game, helps us get started quickly and supports various arduous analysis scenarios."
Smarter innovation
As Yannick Misteli mentioned, in recent years, the underlying technological innovation has promoted changes in the business layer, and the demands of the business layer have also forced the progress of the underlying technology. The rules of the game are changing in technological upgrades. Today, "intelligence" is the evolution goal of most technologies. In the "smart lake warehouse" architecture of Amazon Cloud Technology, "smart" is also mentioned to a very important position.
Under the "smart lake warehouse" architecture, database services are deeply integrated with artificial intelligence and machine learning. In terms of specific products, Amazon Cloud Technology provides many database-native machine learning services such as Amazon Aurora ML, Amazon Neptune ML, and Amazon Redshift ML.
At the same time, in the "smart lake warehouse" architecture, there is also a cloud-native artificial intelligence platform Amazon SageMaker, which provides multiple types of machine learning libraries and development kits to help users quickly build artificial intelligence applications. When users need to face a large amount of data processing scenarios, they can use the built-in tools of Amazon SageMaker to easily and quickly connect to the Amazon EMR cluster for big data processing. And Amazon EMR Serverless also helps artificial intelligence-related data processing and analysis become sufficiently agile.
In the report "Magic Quadrant for Cloud Database Management Systems" released by Gartner in 2021, Amazon Cloud Technology has been rated as the "Leader" for 7 consecutive years. This report is mainly for cloud databases and cloud data provided by major vendors. The analysis tool performs a panoramic evaluation and gives an "evaluation report" of the final location, which shows the gold content. The products evaluated by Amazon Cloud Technology are all representative products in the "smart lake warehouse" architecture. The technological maturity behind this "leadership" is self-evident.
We can see that the iteration of each service tool provided by "Smart Hucang" is moving towards the goal of a more agile, safer, and smarter data architecture. As the lowest level of enterprise digital transformation, data architecture is also the underlying motivation for application modernization. The changes in data management methods brought about by the "smart lake warehouse" also carry Amazon Cloud Technology's vision of application modernization.
Write at the end
Going back to the problem mentioned at the beginning of the article, there is a consensus in the industry that the integration of data lakes and data warehouses will reduce the cost of big data analysis. The main differences are in data lakes, data warehouses access to storage systems, and permission management. Control. In these aspects, Amazon Cloud Technology's "smart warehouse" architecture provides related tools or services around these issues.
Whether it is data infrastructure, unified analysis or business innovation, from connecting data lakes and data warehouses to cross-database and cross-domain sharing, "smart lake warehouse" does not exist in isolation in actual business scenarios, but is closely related to applications Connected.
The modern evolution of the underlying data architecture will also bring greater value to the enterprise and the entire industry. Data, as the "fifth factor of production" alongside land, labor, capital, and technology, is self-evident. Today, the practice of Amazon Cloud Technology's "smart lake warehouse" architecture in enterprises has provided a path for enterprises to follow to build a modern data platform.
activity recommendation
As far as the technology circle is concerned, the development of various technologies and fields during the year has not only stood at the peak, but also experienced ups and downs. At the last moment of 2021, we also want to listen to the voices of developers in the cloud computing field. To this end, the award-winning survey of cloud computing developers has officially opened. We sincerely invite all partners to participate, and multiple gifts are waiting for you!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。