Due to the promotion of cloud technology, enterprises’ demands for cross-company, cross-industry, and cross-field comprehensive data are becoming more and more obvious. The correlation between different types and formats of data is becoming more and more intense, stimulating the innovation and development of data technology. Formed a big data ecological structure. The complexity, comprehensiveness, and intersectionality of the current problems have led to higher and higher data usage costs. Enterprises urgently need a hybrid data platform that can effectively break data islands, solve data sovereignty, and unify data aggregation and sharing. Came into being.
What is a data lake
As early as 2011, an article by Forbes introduced the concept of Data Lake, which supplemented the deficiencies in data warehouses such as long development cycles, high maintenance and development costs, and missing detailed data. The data lake is a large data repository and processing engine. It can store a large amount of various types of data, has powerful information processing capabilities and the ability to handle almost unlimited concurrent tasks or jobs. Wikipedia's interpretation of the data lake is: a data lake is a method of storing data in a natural format in a system or repository. It helps to configure data in various modes and structures, usually object blocks or files.
Visual description Data lake refers to the platform that uses lake to describe data storage. The water flowing into the lake represents unprocessed raw data. These data include tables, texts, sounds, images, and so on. The water in the lake represents the various data stored. Data processing, analysis, modeling, and processing can be carried out in the lake, and the processed data can still remain in the lake. The outflowing water represents the data needed downstream after analysis, and then arrives at the user end to provide information to draw conclusions.
The main idea of the data lake is to uniformly store raw data of different types and fields, including structured data, semi-structured data, and binary data, to form a centralized data storage set that accommodates all forms of data. This data storage set has a huge data storage scale, PB-level computing power, and meets diversified data information cross-analysis and large-capacity, high-speed data pipelines.
Advantages of Data Lake
Easily collect data: The big difference between a data lake and a data warehouse is that Schema On Read means that Schema information is only needed when data is used; while data warehouse is Schema On Write, which means that Schema needs to be designed when storing data. In this way, since there are no restrictions on data writing, the data lake can collect data more easily.
Discover more value from the data: Data warehouses and data marts can only answer some pre-defined questions because they only use part of the attributes in the data; while the data lake stores all the most primitive and detailed data, so it can answer More questions. And the data lake allows various roles in the organization to analyze data through self-service analysis tools, and use AI and machine learning technologies to dig out more value from the data.
Elimination of data islands: Data from various systems are collected in the data lake, which eliminates the problem of data islands.
Has better scalability and agility: The data lake can use a distributed file system to store data, so it has a high scalability. The use of open source technology also reduces storage costs. The structure of the data lake is not so strict, so it is inherently more flexible, thereby increasing agility.
The difference between a data lake and a data warehouse
The data warehouse is an optimized database for analyzing relational data from transaction systems and line-of-business applications. The data structure and Schema are defined in advance to optimize fast SQL queries, and the results are usually used for operation reports and analysis. The data has been cleaned, enriched and transformed, so it can serve as a trusted "single source of information" for users.
The concept of data lake was proposed in 2011. Initially, the data lake was a supplement to the data warehouse. It was to solve the problems of long development cycle, high development and maintenance costs, and loss of detailed data. Data lake and data warehouse are very similar, both are data storage. The main differences between the two are shown in the following figure:
The data warehouse is an optimized database, and the data structure must be defined before storing the data. The data lake is a data storage platform that does not need to define data and can freely store different types of data. When loading data, the data warehouse needs to be pre-defined, that is, the write-time mode; the data lake is to define the data when the data is ready to be used, that is, the read-time mode. Therefore, the data lake improves the definition flexibility of the data model and can better meet the needs of different businesses.
Lake and warehouse in one
As the advantages of the data lake are seen by more and more companies, more and more companies are beginning to integrate the platform of the data lake and the data warehouse, which can not only realize the functions of the data warehouse, but also realize the processing of various types of data Functionality, data science, advanced functions for discovering new models, this is the so-called lake-warehouse integration.
Manageability: Hucang provides comprehensive data management capabilities. There will be two types of data in the data lake: raw data and processed data. The data in the data lake will continue to accumulate and evolve, so it includes the following data management capabilities: data sources, data connections, data formats, and data schemas (libraries/tables/columns/rows). At the same time, the data lake is a unified data storage place in a single enterprise, so it also has certain rights management capabilities.
Traceability: To provide a storage place for the entire amount of data in an enterprise, the entire life cycle of the data needs to be managed, including the entire process of data definition, access, storage, processing, analysis, and application. The realization of a powerful data lake requires that the access, storage, processing, and consumption process of any piece of data between it be traceable, and the complete data generation process and flow process can be clearly reproduced.
Abundant computing engines: provide various computing engines ranging from batch processing, streaming computing, interactive analysis to machine learning. In general, batch computing engines are used for data loading, conversion, and processing; streaming computing engines are used for parts that require real-time computing; for some exploratory analysis scenarios, interactive analysis engines may need to be introduced. As the integration of big data technology and artificial intelligence technology is getting closer, various machine learning/deep learning algorithms have been continuously introduced, which can support reading sample data from HDFS/S3 for training. Therefore, the all-in-one solution of Hucang provides the scalability/pluggability of the computing engine.
Multi-modal storage engine: Hucang Integration itself has a built-in multi-modal storage engine to meet the data access requirements of different applications (considering the response time/concurrency/access frequency/cost and other factors). However, in the actual use process, in order to achieve acceptable cost performance, the integrated solution of the lake warehouse provides a pluggable storage framework, which supports HDFS/S3, etc., and can also cooperate with an external storage engine when necessary Work to meet diverse application needs.
Oushu Data Cloud (Oushu Data Cloud) is a new generation of data infrastructure, which can easily realize integrated lake and warehouse solutions. It can rely on cloud native features, separate computing and storage architecture, strong ACID features, strong SQL standard support, and high Performance parallel execution capabilities and a series of changes in the underlying technology, to achieve high elasticity, strong scalability, strong sharing, strong compatibility, strong complex query capabilities, automated machine learning support and other changes in upper-level technical capabilities, and ultimately help companies effectively deal with large-scale Digitalization trends such as scale, strong sensitivity, high time-efficiency, and intelligence are becoming more and more obvious.