Big data study notes 1: data warehouse, data lake, data center

This article was first published on short book : https://www.jianshu.com/u/204b8aaab8ba

version	date	Remarks
1.0	2021.5.9	First article
1.1	2021.5.11	Add a graded title
1.2	2021.6.6	Enhance the summary part and introduce Lake House

This is my study notes, a large number of excerpts from the Internet and books, to present the content that I think is highly relevant.

1. Data Warehouse

Business Intelligence (Business Intelligence) was born in the 1990s. It converts the existing data of a company into knowledge and helps the company make business analysis decisions.

For example, in the store management of the retail industry, how to maximize the profit of a single store, we need to analyze the sales data and inventory information of each product, and formulate a reasonable sales and purchase plan for each product. Some products are unsalable. Price reduction promotion, some products are more popular, need to make advance purchase based on the forecast of future sales data, these are inseparable from a large amount of data analysis.

Data analysis needs to aggregate data from multiple business systems, such as data that needs to be integrated with a trading system, data that needs to be integrated with a warehousing system, and so on. At the same time, it needs to save historical data and conduct large data volume range queries. Traditional databases are oriented to a single business system, and mainly implement transaction-oriented addition, deletion, modification, and checking, which can no longer meet the data analysis scenario, which prompted the emergence of the concept of data warehouse.

Take an e-commerce example. In an e-commerce scenario, there is a database dedicated to storing order data, and another database storing member-related data. To build a data warehouse, we must first synchronize the data of different business systems into a unified data warehouse, and then organize the data in the form of subject domains.

The subject domain is a high-level abstraction of the business process. Commodities, transactions, users, and traffic can all be regarded as a subject domain. You can understand it as a catalog of the data warehouse. The data in the data warehouse is generally stored in partitions according to time, and is generally retained for more than 5 years. The data in each time partition is written in addition, and a record cannot be updated.

In summary, compared with our common relational database, data warehouse has the following differences:

Good at: analysis-oriented (batch writing, AppendOnly, partial minimum IO maximum throughput). The database is transaction-oriented (high throughput continuous write, small read, focus on consistency).
Data format: non-standardized Schema. The database is a highly standardized static schema.

1.1 Real-time data warehouse
The real-time data warehouse is very similar to the offline data warehouse. The background of its birth is mainly the increasing demand for real-time data services by enterprises in recent years. The data model inside will also be divided into several layers like the middle station: ODS, CDM, ADS. However, the overall requirements for real-time performance are extremely high, so general storage will consider using Kafka, such as log base MQ, and the calculation engine will use FLink, Storm, such stream computing engines.

2. Data Lake

In the Internet age, there are two most important changes.

One is that the scale of data is unprecedented. A successful Internet product can have a daily life of over 100 million, just like the well-known Toutiao, Douyin, Kuaishou, and NetEase Cloud Music, which generate hundreds of billions of user behaviors every day. Traditional data warehouses are difficult to expand and cannot carry such a large amount of data at all.
The other is that data types have become heterogeneous. In the Internet era, in addition to structured data from business databases, there are also front-end embedded data from App and Web, or back-end embedded logs of business servers. These data are generally It is semi-structured or even unstructured. The traditional data warehouse has strict requirements on the data model. Before the data is imported into the data warehouse, the data model must be defined in advance, and the data must be stored according to the model design.

Therefore, the limitation of data scale and data type makes traditional data warehouses unable to support business intelligence in the Internet era.

In 2005, Hadoop was born. Compared with traditional data warehouses, Hadoop has two main advantages:

Fully distributed, easy to expand, low-cost machines can be used to build a cluster with strong computing and storage capabilities to meet the processing requirements of massive data;
Weaken the data format. After the data is integrated into Hadoop, no data format can be retained. The data model is separated from the data storage. When the data (including the original data) is used, it can be read according to different models to meet the needs of heterogeneous data. Flexible analysis needs. The data warehouse pays more attention to data that can be used as a basis for facts.

With the maturity of Hadoop and object storage, the concept of a data lake was put forward in 10 years: a data lake is a repository or system that stores data in its original format (this means that the bottom layer of the data lake should not be connected to any storage coupling).

Correspondingly, if the data lake is not well managed (lack of metadata, defining data sources, formulating data access and security policies, and moving data, compiling data catalogs), it will become a data swamp.

In terms of product form, data warehouses are often independently standardized products. The data lake is more like an architectural guide-a series of peripheral tools are needed to realize the data lake required by the business.

`3. Big data platform`

For a data development, when completing a requirement, a common process is to first import the data into the big data platform, and then develop the data according to the requirement. After the development is completed, data verification and comparison should be carried out to confirm whether it meets expectations. The next step is to publish the data online and submit the schedule. Finally, it is the daily task operation and maintenance to ensure that the task can produce data normally every day.

At this time, the industry proposed the concept of a big data platform to improve the efficiency of data research and development, lower the threshold of data research and development, and enable data to be processed quickly on a device assembly line.

`4. Data center`

The application of large-scale data has gradually exposed some problems.

In the early stage of business development, in order to quickly realize business requirements, chimney-style development has led to fragmentation of data between different business lines and even different applications of the same business line. The same indicators of the two data applications are inconsistent with the results displayed, which leads to a decline in the trust of the operation in the data. If you are an operator, when you want to look at the sales of products, you find that there are two values on the two reports, both of which are called sales indicators. How do you feel? Your first reaction must be that the data is wrong, you I dare not continue to use this data.

Another problem of data fragmentation is the waste of R&D efficiency caused by a large number of repeated calculations and developments, the waste of computing and storage resources, and the increasing cost of big data applications.

If you are an operation, when you want a piece of data, the development tells you that it will take at least a week. You must be wondering if it is too slow, can it be faster?
If you are a data developer, when faced with a large amount of demand, you are definitely complaining, too much demand, too few people, and you can't finish your work.
If you are the owner of a business, when you see that your monthly bills increase exponentially, you must think that it is too expensive. Can you save a little bit?

The root of these problems is that data cannot be shared. In 2016, Alibaba took the lead in proposing the slogan of "Data in Taiwan". The core of the data center is to avoid repeated calculations of data, improve data sharing capabilities through data service, and empower data applications. Previously, data was nothing, and intermediate data was difficult to share and could not be accumulated. Now that the data center is built, what is needed? The speed of data application research and development is no longer limited by the speed of data development. Overnight, we can incubate many data applications based on scenarios, and these applications make data generate value.

`4.1 Data center model`

In the process of building a central station, the following key points are generally emphasized:

Efficiency, quality, and cost are the keys to determining whether data can support a good business. The goal of building a data center is to achieve high efficiency, high quality, and low cost.
Data processing only once is the core of building a data center, which is essentially to achieve the sinking and reuse of public computing logic.
If your company has more than 3 data application scenarios and data products are still being developed and updated, you must seriously consider building a data center.

Then let's take a look at Alibaba's practice of data center.

As mentioned above, data once is the core of building a data center, which is essentially to achieve sinking and reuse of common computing logic. one Data Center mentioned various 060ccaddb8f027 ideas, such as:

OneData: Only one copy of public data is saved
OneService: exposed through a service interface

`4.1.2 Data Service`

The main purpose of the data service is to expose the data. If there is no data service, the data is directly exported to the other party, which is very inefficient and unsafe.

In long-term practice, Ali has gone through four stages:

DWSOA
OpenAPI
SmartDQ
OneService

`4.1.2.1 DWSOA`

Very simple and rude, it is to expose the business side's data requirements through SOA services. Driven by requirements, a requirement develops one or several interfaces, writes interface documents, and opens them to business parties.

Business requirements are of course very important, but if you don’t take the technical side as a starting point, the maintenance cost will be extremely high in the long term-there are many interfaces and the reuse rate is low; and the whole process is completed from the development of an interface to the test and online. It will take at least one day, but sometimes the requirement is only to add one or two field attributes, and a set of processes must be followed, which is inefficient.

`4.1.2.2 OpenAPI`

The obvious problem in the DWSOA stage is the chimney-style development, which leads to many interfaces that are difficult to maintain. Therefore, it is necessary to find ways to reduce the number of interfaces. At that time, Alibaba conducted research and analysis on these requirements internally, and found that the implementation logic is basically to fetch the data from the DB, and then encapsulate the result to expose the service, and many interfaces can actually be merged.

OpenAPI is the second stage of data services. The specific method is to aggregate the data according to its statistical granularity, and form a logical table with data of the same dimension, using the same interface description. Take the member dimension as an example: make all member-centric data into a logical wide table , as long as it is querying member granularity data. Just call the member interface. Through a period of implementation, the results show that this method effectively converges the number of interfaces.

`4.1.2.3 SmartDO`

However, the dimensionality of data is not as controllable as the developers imagined. As time goes by, everyone uses in-depth data and analyzes the dimensionality of data more and more. At that time, OpenAPI produced nearly 100 interfaces: at the same time, Brings a lot of maintenance workload of object-relational mapping.

So Ali's classmates added a layer of SQL-like DSL to OpenAPI. Turn all simple query services into an interface. So far, all simple query services have been reduced to only one interface, which greatly reduces the maintenance cost of data services. The traditional way to check the problem needs to go through the source code and confirm the logic: SmartDQ only needs to check the workload of SQL, and it can be opened to the business side to provide services externally by writing SQL, and the service provider itself maintains the SQL, which can also be regarded as the service trend. A milestone in DevOps.

Although the logic table already exists in the OpenAPI stage, it is more appropriate in the SmartDQ stage, because SmartDQ really plays the role of the logic table. The SQL provider only needs to care about the structure of the logical table, not how many physical tables the bottom layer consists of, or even whether these physical tables are HBase or MySQL, single table or sub-database sub-table, because SmartDQ has already encapsulated cross-differentiation Structure data source and distributed query function. In addition, the data department field changes relatively frequently, this kind of underlying change should be regarded as one of the worst changes for the application layer. The design of the logical surface layer avoids this pain point well, and only changes the mapping relationship of the physical fields in the logical table, and it takes effect immediately, completely unaware of the caller.

Interfaces are easy to get up and down, and even one interface is bound to a group of people (business side, interface development and maintenance personnel, caller). Therefore, the data service interface provided to the outside must be as abstract as possible, the number of interfaces should be as convergent as possible, and finally, the maintenance workload should be reduced as much as possible while ensuring the quality of service. Now SmartDQ provides more than 300 SQL templates, and each SQL bears the needs of multiple interfaces, and we only use 1 student to maintain SmartDQ.

`4.1.2.4 OneService`

The fourth stage is the unified data service layer (that is, OneService).

You may have questions in your mind: SQL cannot solve complex business logic. It is true that SmartDQ actually only meets the simple query service needs. There are several types of scenarios we encounter: personalized vertical business scenarios, real-time data push services, and timed task services. Therefore, OneService mainly provides a variety of service types to meet user needs, namely OneService-SmartDQ OneService-Lego, OneService-iPush, OneService-uTiming.

OneService-SmartDQ: Meet the simple query service requirements.
OneService-Lego: A plug-in approach, one type of need to develop a plug-in and make it into a microservice, using Docker for isolation to avoid mutual influence between plug-ins.
OneService-iPush: Provides two methods: web socket and long polling. Its application scenarios are mainly real-time live broadcast on the merchant side.
OneService-uTiming: Provides two modes of instant task and timed task. Its main application scenario is to meet the needs of users to run tasks with large data volumes.

`4.1.3 Technical details`

`4.1.3.1 SmartDO`

The metadata model of SmartDQ, in simple terms, is the mapping of logical tables to physical tables. From bottom to top are:

Data source: SmartDQ supports cross-data source query, and the bottom layer supports access to multiple data sources, such as MySQL, HBase, OpenSearch, etc.
Physical table: A physical table is a table in a specific data source. Each physical table needs to specify which columns the primary key consists of, and the statistical granularity of the table can be known after the primary key is determined.
Logical table: A logical table can be understood as a view in the database, a virtual table, or a large wide table composed of several physical tables with the same primary key. SmartDQ displays only logical tables to users, thereby shielding the storage details of the underlying physical tables.
Subject: The logic table is usually mounted under a subject for management and search.

Query database

The bottom layer of SmartDQ supports a variety of data sources, and there are two main sources of data:

Real-time public layer calculations directly write the calculation results to HBase
Synchronize the offline data of the public layer to the corresponding query library through the synchronization job

Service layer
Metadata configuration. Data publishers need to go to the metadata center for metadata configuration, establish the mapping relationship between physical tables and logical tables, and the service layer will load the metadata into the local cache for subsequent model analysis.
Main processing module. A query usually goes through the following steps from the beginning to the result return:
- DSL parsing: parse the user's query DSL to construct a complete query tree.
- Logical query construction: traverse the query tree and transform it into a logical query by looking up the metadata model.
- Physical query construction: by looking up the mapping relationship between the logical table and the physical table in the metadata model, the logical query is transformed into a physical query.
- Query split: If the query involves multiple physical tables and splits are allowed in the query scenario, split the Query into multiple SubQuery.
- SQL execution: execute the split DB. SubQuery is assembled into SQL statements and handed over to the corresponding ones.
- Result merge: merge the returned results of DB execution and return to the caller.

Other modules. In addition to some necessary functions (such as logging, permission verification, etc.), some modules in the service layer are specifically optimized for performance and stability.

`4.1.3.2 IPush`

The Push application product is a middleware platform for different message sources such as TT and MetaQ, through customized filtering rules, to push messages to Web, wireless and other terminals. The iPush core server is implemented based on the network communication framework Netty 4 of the high-performance asynchronous event-driven model, combined with the use of Guava cache to realize the storage of local registration information, the communication between Filter and Server is realized by Thrift asynchronous invocation of efficient services, and the message is based on Disruptor with high performance The asynchronous processing framework (it can be considered as the fastest message framework) message queue, Zookeeper monitors the server status in real time while the server is running, and uses Diamond as a unified control trigger center.

`4.1.3.3 Lego`

Lego is designed as a service container that supports medium and highly customized data query requirements and supports plug-in mechanisms. It only provides a series of infrastructure such as log, service registration, Diamond configuration monitoring, authentication, data source management, etc. The specific data service is provided by the service plug-in. The Lego-based plug-in framework can quickly realize personalized needs and release it online.

Lego is implemented using a lightweight Node.JS technology stack, which is suitable for handling high-concurrency and low-latency IO-intensive scenarios. Currently, it mainly supports online services such as user identification and coding, user identification, user portrait, crowd perspective, and crowd circle selection. The bottom layer selects Tair, HBase, and ADS to store data according to the characteristics of needs.

`4.1.3.4 uTiming`

uTiming is a task scheduling application based on the cloud, providing batch data processing services. uTiming-scheduler is responsible for scheduling offline tasks that execute SQL or specific configurations, but it does not directly expose task scheduling interfaces to users. Users use data supermarket tools or Lego API to create tasks.

`4.1.4 Data Management`

Faced with the explosive growth of data, how to build an efficient data model and system, organize and store these data in an orderly and structured manner, avoid duplication of construction and data inconsistency, and ensure the standardization of data. It has always been big data The direction that the system construction is constantly pursuing.

OneData is the method system and tool for Alibaba's internal data integration and management. Under this system, Alibaba’s big data engineers build a unified, standardized, and shareable global data system, avoid data redundancy and duplication, avoid data chimneys and inconsistencies, and give full play to Alibaba’s massive and diverse big data Unique advantages in sex. With the help of this unified data integration and management method system, Alibaba's classmates built Alibaba's data public layer, and can help similar big data projects to be implemented quickly. Due to space reasons, the following focuses on the model design of OneData.

`4.1.4.1 Guiding theory`

The design concept of the data public layer of Alibaba Group follows the dimensional modeling idea, please refer to Star Schema- The Complete Reference and The Dαtα Warehouse Toolkit-The Definitive Guide to Dimensional Modeling . The dimensional design of the data model is mainly based on the dimensional modeling theory, based on the dimensional data model bus architecture, to build consistent dimensions and facts.

`4.1.4.2 Model hierarchy`

Alibaba's data team divides the table data model into three layers

Operational Data Layer (ODS)
Common dimensional model layer (CDM): including detailed data layer (DWD) and summary data layer (DWS)
Application Data Layer (ADS)

Operational Data Store (ODS, Operational Data Store) : Store operating system data in the data warehouse system almost without processing.

Synchronization: Incremental or full synchronization of structured data to MaxCompute.
Structured: Unstructured (log) structured processing and storage to MaxCompte.
Cumulative history and cleaning: save historical data and clean data according to data business requirements and audit and audit requirements.

Common Data Model (CDM, Common Data Model) : Store detailed fact data, dimension table data and public index summary data, among which detailed fact data and dimension table data are generally processed and generated based on ODS layer data; common index summary data is generally based Dimension table data and detailed fact data are processed and generated.

The CDM layer is further subdivided into the Data Warehouse Detail (DWD) layer and the Data Warehouse Summary (DWS) layer. The dimensional model method is used as the theoretical basis, and some dimensional degradation techniques are used more to reduce the dimensionality to In the fact table, reduce the association between the fact table and the dimension table, and improve the ease of use of the detailed data table: At the same time, in the summary data layer, strengthen the dimensional degradation of indicators, and adopt more wide-table methods to build the public indicator data layer, the main function as follows:

Combine related and similar data: use detailed and wide tables, reuse associated calculations, and reduce data scanning.
Unified processing of public indicators: Based on the OneData system, build statistical indicators with naming conventions, consistent calibers, and unified algorithms to provide public indicators for upper-level data products, applications, and services; establish a logical summary table.
Establish a consistent dimension: establish a consistent data analysis dimension table to reduce the risk of inconsistent data calculation caliber and algorithm.

Application Data Store (ADS, Application Data Store) : store personalized statistical index data of data products, processed and generated according to the CDM layer and ODS layer.

Customized index processing: non-public type, complexity (index type, ratio type, ranking type index)
Application-based data group leader: wide table market, horizontal table to vertical table, trend indicator string.

The data call service preferentially uses the common dimensional model layer (CDM) data. When there is no data in the common layer, it is necessary to evaluate whether the common layer data needs to be created. When there is no need to build a common common layer, the operation data layer (ODS) can be used directly data. The application data layer (ADS), as a product-specific personalized data, generally does not provide external data services, but ADS as the serviced party also needs to abide by this agreement.

`4.1.4.3 Basic principles`

High cohesion and low coupling: Which records and fields a logical or physical model consists of should follow the principles of high cohesion and low coupling in the most basic software design method. Mainly considered from two perspectives of data service characteristics and access characteristics: design data with similar or related businesses and the same granularity as a logical or physical model; put together data with high probability of simultaneous access, and separate data with low probability of simultaneous access store.
The core model and the extended model are separated: the core model and the extended model system are established. The fields included in the core model support commonly used core services, and the fields included in the extended model support the needs of personalization or a small number of applications, and the fields of the extended model cannot be excessively invaded into the core model , So as not to destroy the indirectness and maintainability of the core model.
The sinking and unitary of common processing logic: the more common the processing logic at the bottom, the more it should be encapsulated and implemented at the bottom of the data scheduling dependency. Do not expose the common processing logic to the application layer implementation, and do not allow multiple common logic to exist at the same time
Cost and performance balance: Appropriate data redundancy can be exchanged for query and refresh performance, but excessive redundancy and data replication should not be used.
Data can be rolled back: If the processing logic is unchanged, if the data is run multiple times at different times, its data result is definite and unchanged.
Consistency: Fields with the same meaning must be named the same in different tables, and the names in the specification must be used.
Clear and understandable naming: Table naming needs to be clear and consistent, indicating that it is easy for consumers to understand and use.

`5. The integration of the base: the lake and the warehouse are integrated`

It is called Lake House abroad, which is also a guiding structure. The main purpose is to open up the data in the lake and the warehouse and flow freely: important data in the data lake can be transferred to the data warehouse and used directly by the data warehouse; and the unimportant data in the data warehouse can also be directly exported to the data In the lake, low-cost long-term preservation, for future data mining use.

The flow of data was mentioned above. Then it will involve the data entering the lake, exiting the lake, and surrounding the lake.

The more data you accumulate, the more troublesome it will be to move—that is, data gravity. In order to solve this problem, AWS proposed smart lake warehouse.

I think there are similarities between Taiwan and Taiwan in this data. Interested friends can click on the link below to take a look.

`6. References`

"The Road to Big Data: Alibaba's Big Data Practice"
Data Lake vs Data Warehouse
's real-time data warehouse construction practice based on Flink
Harness the power of your data with AWS Analytics

Big data study notes 1: data warehouse, data lake, data center

1. Data Warehouse

2. Data Lake

`3. Big data platform`