【clickhouse column】The difference and connection between database and data warehouse

Starting from this article, the author intends to write a series of "clickhouse column", its full name is Click Stream, Data WareHouse , referred to as ClickHouse. From the "Data WareHouse" in its full name, we can see that clickhouse is positioned as a data warehouse. So what's the difference between a "data warehouse" and a "database"? It's important to understand this. By understanding the difference between the two, you can properly use clickhouse for its appropriate use case.

1. OLTP and OLAP

Before understanding the difference between "data warehouse" and "database", we need to explain two terms, namely: OLTP and OLAP.

OLTP (on-line transaction processing): usually refers to a relational database oriented to traditional application services. Users operate the real-time "add, delete, modify and query" data in the database through the web interface. Including the core basic transaction processing logic, users have high performance requirements. After the user clicks on the interface, the minimum response time is required to be within 5 seconds (usually within 3 seconds), and it needs to support a relatively high degree of user concurrency. OLTP data operations are usually oriented to one or several small pieces of data, such as a small amount of data such as a user placing an order to operate the user's shopping cart, payment records, and points records.
OLAP (On-Line Analytical Processing): The application is mainly to perform complex data analysis operations, focus on decision support, and display intuitive and easy-to-move data analysis results through graphical reports. The requirements for response time are relatively loose, and the data analysis process usually does not support high user concurrency, but the data analysis results support high user concurrent access. OLAP is usually oriented to batch data operations. Data is imported and analyzed in batches. OLAP systems are usually used in conjunction with ETL (extract, transform, and load) systems.

Understand the above two data, the rest is much simpler, the database is usually oriented to OLTP operations, and the data warehouse is usually oriented to OLAP operations. OLTP focuses on preserving and changing the current state of data, while data warehouse focuses on preserving historical archives of data. For example: user bank transfer, OLTP database focuses on managing the remaining amount in the user's current account, and the data consistency of the amount in the other party's account during the transfer process; while the OLAP data warehouse focuses on recording who made the transfer, how much money was transferred, and the money transferred. where did it go. When did the user accustomed to transfer money in history, at the beginning of the month or at the end of the month? How many times do you transfer money a month?

Second, the characteristics of data warehouse

The following are some typical characteristics of a data warehouse:

Focus on the process of recording data changes, not the current state of the data.
Read more and write less
large wide table
Data batch operations, no or very few updates
Transactions are not supported

Some friends with relatively little work experience will read these articles and say, "What are the characteristics of this, these are all shortcomings!". Not updating or rarely updating, reading more and writing less are all scenario limitations, large-width tables destroy the database design paradigm, and what kind of database is it that does not support transactions? In fact, in the OLAP scenario, these are precisely the characteristics of its special design to ensure the performance of data analysis. I will give you a few examples:

For example, a cloud vendor collects server operating indicators on a periodic basis, such as memory usage, CPU usage, and so on. These indicators are collected in batches and stored in batches, and once stored, they will not be modified. Usually, a table is not created for memory indicators and a table for CPU usage, but a table is created for servers in the same computer room. This table contains various indicators in the time dimension. For example, when querying a server with memory usage > 80 and CPU usage > 70, there will be no associated query between two tables, and only one wide table will be queried, and the performance of data analysis will be improved by leaps and bounds. Transactions are not supported, usually OLAP systems do not support transactions, because transactions will affect the performance of data operations to a certain extent. After the data is stored in the database, it is necessary to continuously analyze and mine these indicators, that is, read more and write less, and basically write in batches once, followed by data read operations.
Another example: real-time stock trading data, focusing on the process of recording data changes, rather than the current state of the data. Once all historical data of all stocks is entered into the data warehouse, it will not be modified. Can carry out quantitative trading analysis of stocks.
Another example: user product click data, user operation behavior data, user web page browsing time data, etc. These data are all data required for user analysis and will not be modified once stored. It can analyze the user's buying and selling willingness behavior.

In fact, there are many data of this type. The characteristics of this kind of data are: the amount of data is large and will not change after it is generated (the data on that time scale will not change). Therefore, data warehouses are usually oriented to the data scenario where historical data with high throughput is archived and no update and deletion operations are performed. After data archiving, it is usually only oriented to data query analysis.

Third, the combination of database and data warehouse

Usually a larger application service system has both a database and a data warehouse. The database performs online transaction processing for users and handles real-time operations of the user interface. The data of the data warehouse is oriented to the decision-making management layer, providing data and graphical reports, and providing a variety of data analysis and decision-making.

The above figure is a typical application service scenario where the database and data warehouse coexist

Internet users generate user behavior through application services, and perform OLTP operations on the database
The application service sends the behavior of the user's operation to the message queue, and the message queue imports the data into the data warehouse
The data of the database can be extracted, processed, transformed and integrated into the data warehouse through ETL
The decision analysis system is mainly for data analysis of data warehouses, the data analysis results can be fed back to the database, and the ability to view data analysis results is provided to Internet users through application services
The decision analysis system also provides data analysis and decision support capabilities for decision managers of application services.

【clickhouse column】The difference and connection between database and data warehouse

1. OLTP and OLAP

Second, the characteristics of data warehouse

Third, the combination of database and data warehouse

Recommended reading

字母哥博客

引用和评论

高频时序数据的储存与统计方案

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性