Real-time data warehouse introductory training camp: Real-time data warehouse helps real-time Internet decision-making and precision marketing
Introduction to "Real-time Data Warehouse Introductory Training Camp" consists of Alibaba Cloud researcher Wang Feng, Alibaba Cloud senior product expert Liu Yiming and other real-time computing Flink version and Hologres technology/product front-line experts came together to build the training camp's courses System, carefully polished the content of the course, and directly hit the pain points encountered by the current students. Analyze the architecture, scenarios, and practical applications of real-time data warehouses from the shallower to the deeper, and 7 high-quality courses will help you grow from a small white to a big cow in 5 days!
This article is compiled from the live broadcast "Real-time Data Warehouse Helps Real-time Internet Decision-Making and Precision Marketing-Oneness"
Video link: https://developer.aliyun.com/learning/course/807/detail/13886
In recent years, real-time data warehouse has been a hot topic on the Internet. The main application scenarios are mainly in real-time decision-making and marketing. These fields have high requirements for the precision and real-time of data analysis, and they are in the forefront of technology. .
Online business and refined operations promote real-time and interactive data warehouses
Let's take a look at some trends in the development of data analysis in the past few years, as shown in the figure above.
It can be seen that the basic trend of data analysis is to evolve from batch processing to interactive analysis and stream computing. 10 years ago, big data was more about dealing with the problem of scale, processing massive amounts of data through parallel computing technology. At that time, we were doing more data cleaning, data model design, and no need for analysis. too much.
Today, our big data team has basically become a data analysis team. The precipitation of data models, the ability to support interactive analysis, the efficiency of support for query response delays, and the requirements for QPS will become more and more. Data analysis is not just about saving the data and then analyzing it. There are also many scenarios where calculations are preceded, and there are logic before calculations. For example, during Double 11, how many transactions are there in how many seconds, such a typical scenario is not the issue of transaction data first and then the calculation volume, it must be the process of real-time calculation results that occur with the transaction.
Therefore, interactive analysis and stream computing are almost a parallel process, and these two new scenarios have many different requirements for the technology behind them. Batch processing is more about parallelism. In the field of interactive analysis, we have begun to have a lot of pre-computing, memory computing, indexing and other technologies, so this also promotes the evolution of the entire big data technology.
In summary, data analysis supports more and more online businesses. Online businesses include any time we turn on our mobile phone, the products recommended on the mobile phone screen, and the advertisements displayed. These all need to return results within a few milliseconds, relying on data. Smart recommendations, if not recommended, the click-through rate and conversion rate must be very low.
Therefore, our data services are supporting more and more online services. Online services have very high requirements on query delay, QPS, and precision. This also promotes the evolution of data warehouses to real-time and interactive directions.
Alibaba's typical real-time data warehouse scenario
There are many usage scenarios for data warehouses in Alibaba, such as the real-time GMV large screen on Double 11. GMV is just a conclusive number. In fact, for data analysts, this work is just beginning. We need to analyze downwards, what kind of product, what channel, what kind of people, what kind of promotional methods, to achieve such a conversion effect, what conversion effect has not been achieved, and so on. These analyses are actually very fine-grained and are the result of refined analysis.
After analysis, we will do some labeling for all products and people. Through labeling, we can guide online applications to recommend, analyze, circle and select, etc., so there are a lot of data in Taiwan’s business. .
There is also a type of business that is partial to monitoring services. Sudden jitter and growth of orders, jitters in network quality, and some monitoring of live video streaming are also typical application scenarios for real-time data warehouses.
The "complexity" of big data real-time data warehouse system
In the past, we have consulted many companies when building real-time data warehouses, and Alibaba has also gone through a very complicated route.
Above is the architecture diagram I drew. When I first saw it, I was very excited. At that time, I was a big data architect. Being able to draw so many arrows is a very good thing. But when we really did such a system, I found that the development efficiency and operation and maintenance efficiency of this system was very maddening.
This system evolves from the upper left corner. Messages are collected from message middleware, and then an offline processing process. At that time, we didn't have many real-time technologies. First, we had to solve the problem of scale. Through offline processing, we will turn the processed result set into a small result set and store it in MySQL and Redis. This is a typical processing service dual system.
After the data is turned into small data, the upper-level applications can be provided externally, including report applications, recommended applications, and so on. After that, data analysis needs to be real-time, and this "T+1" offline processing alone cannot meet the demand. The more real-time the data, the more contextual and the greater the value. At this time, many computing front-end technologies were adopted, such as Flink. Directly consume events in Kafka through Flink, and do calculations in an event-driven manner. Since Flink is distributed, it has very good scalability. It can be pre-calculated and event-driven, which can delay end-to-end delay. Do it to the extreme.
Then we will store the result set in a medium. The result set processed by Flink is a very streamlined structure. It is generally based on the kv6 structure and placed on systems such as HBase and Cassandra. Such system pairs are provided The large-screen report is the fastest. For example, for the large screen of Double 11, you must not wait for tens of millions or hundreds of millions of records to be traded before doing statistics, otherwise the query performance will not be satisfied. Therefore, at the beginning, we will process it into an original data set with a granularity of each channel per second. When the original data set is analyzed on a large screen, it can be transformed from a large data set to a very small data set. , The performance reaches the extreme.
Now we see that there are processing scales and processing speeds, both of which seem to meet certain needs on the surface, but in fact the cost is not small. If you want to calculate fast enough, you need to pre-calculate, so the flexibility of the data model will be reduced, because we have aggregated all the data into a result set through Flink.
For example, if some business logic is not defined at the beginning, such as the initial analysis of the aggregation of a certain three dimensions, if you want to analyze the aggregation of the fourth dimension later, because it has not been calculated in advance, the analysis cannot be performed, so here is the sacrifice Flexibility.
At this time, some technologies that are more flexible than HBase and have better real-time performance than Hive will be adopted, such as ClickHouse and Druid. Data can be written in real-time, and a certain amount of interactive analysis and self-service analysis capabilities are provided after being written.
We will find that data processing disperses the same piece of data in three different media, including offline processing media, near real-time processing media, and full real-time processing media. Three mediums often require three teams to maintain. As the personnel of the three teams change over time, the data processing logic must be adjusted more or less. As a result, we will find the same indicator through three different channels The calculated results are inconsistent, and this happens in almost every company.
This is only a superficial problem, and it is more painful for the analyst. He needs to use different data, access different systems, learn different interfaces, and even have different access control mechanisms. This is very different for the analyst. convenient. Therefore, many companies have to build a set of so-called data centers, through which to shield the differences in the underlying physical engines. In this case, Federal computing technologies such as Presto and Drill will be adopted.
Federal computing technology has a development history of more than 20 years. From the earliest data information system integration to data virtualization, the technology has been developing. This technology is a set of technology that does not move data and moves calculations. It sounds very good, but when it is actually used in production, you will find that the query experience of this system is unpredictable. We don’t know whether to query the data through the system. Fast or slow, because it pushes down the query to different engines. If it is pushed down to Hive, it will not be so fast. If it is pushed down to ClickHouse, it may be faster. Therefore, for an analyst, he does not know what the expectation of this system is, which is called unpredictability. For example, it may take 5 seconds to open the report before, and it may take 5 minutes for another time. This situation will make the analyst not know whether it is normal or abnormal by 5 minutes. If you run on Hive, it is called normal, if you run on Drill, it is called abnormal. Unpredictability also makes it difficult for this system to enter the field of production, so we can't do it at this time, and we have to do a collection of data and turn it into a small result set for analysis.
We can see that the entire link is actually composed of multiple components nested layer by layer, layer by layer dependence, in addition to the fact that the maintenance of different teams will cause inconsistent data calibre, the cost of data maintenance will also become very high.
We often encounter situations where we see that a certain number on the report may be incorrect. For example, one day this indicator suddenly increases or declines a lot. At this time, no one can confirm whether there is a problem with the data quality or the processing logic. , Or the data synchronization link is wrong because the intermediate link is too long. If a field is modified at the data source and a new data is added, every link must be re-run. Therefore, if the architecture is running normally, there is no problem, but once there is a problem with the data quality or the data scheduling problem, the ring depends on , Which makes the entire operation and maintenance cost very high.
At the same time, talents who understand so many technologies are difficult to find and expensive. It often happens that a company has worked hard to cultivate such talents, and then they are dug away by other big companies. Such talents are very difficult to recruit and train.
The above is the current status of this architecture.
The complexity of big data today is reminiscent of the 1960s
The above architecture reminds me of the 1960s, when the database was basically not born at that time, and the relational database was born in the late 1970s.
How did we manage the state of data in the 1960s?
In the 1960s, every programmer had to write state maintenance by himself. Some people use file systems, but using file systems alone will be very discrete and difficult to maintain because data is related. At this time, there are some systems with a network structure. One file can jump to another file. However, the network structure is relatively complicated to manage because of loops, nesting, and so on. In addition, there is a hierarchical structure, which is also a way of describing data types.
Therefore, it can be seen that the programmers in the 1960s managed the data state by themselves, which was actually very expensive.
In the 1980s, we basically won't do this anymore, because we know that all the state is stored in the database as much as possible, and also because the relational database makes this a lot easier. Although we have many kinds of relational databases, they are basically based on SQL interfaces, which makes our entire data storage, query, and management costs drop sharply.
This matter has undergone many changes in the past 20 years. From the era of big data to NoSQL to NewSQL, various technologies, such as Hadoop, Spark, Redis, etc., were born, which allowed the data field to flourish.
But I always feel that this place implies some problems, which may not necessarily be the case in the future. At present, big data is developing vigorously but not uniformly, so I am wondering whether in the future, big data technology can be relatively condensed to solve problems in a more descriptive way, instead of letting every programmer learn different things. Components, learn dozens of different interfaces, and configure thousands of parameters. In order to improve development efficiency, it may be necessary for us to further consolidate and simplify these technologies in the future.
Real-time data warehouse core requirements: timeliness
Let's go back and take a look at what are the business needs of real-time data warehouses without these technical components, and what kind of data architecture can be required for these business needs.
What is real time? Many people think that real-time analysis means that the time from when data is generated to when it can be analyzed is short enough. In fact, this is not entirely accurate. I think the real-time data warehouse is divided into two kinds of real-time, one is called end-to-end delay, the other real-time can also be called punctual, that is, when we really analyze the data, we can get valid data and get In conclusion, this can be called punctuality.
The first kind of real-time is a partial technology concept, but does our business necessarily need such low latency?
Under normal circumstances, business does not make decisions every second. When we open the report, we look at the moment. At this moment, the data we care about may be the past day, the last hour, 5 minutes ago, or one time. Seconds ago. But experience tells us that if 99% of data analysis can achieve a 5-minute delay, it can basically meet the data analysis needs of the business. This is the case in the analysis scenario. If it is an online service scenario, this is definitely not enough.
So most companies may not require that real-time, because the cost behind it is different.
Because once the end-to-end delay is required, a lot of pre-calculation business logic must be required. You cannot wait for the data to be saved before querying. Because the delay is very short, we must aggregate, process, and Only by widening can the calculation amount of data analysis be small enough, and the delay can be short enough.
Therefore, if you are pursuing end-to-end delay, you need to do a lot of pre-calculation work. As we mentioned earlier, if all the calculations are pre-existing, flexibility will be lost, so there is a cost in this matter. Relatively speaking, if you are pursuing a punctual system, you can post part of the calculation logic in exchange for a flexible analysis scenario.
For example, when double 11 is just for the sake of delay, then we will only save the last GMV. This is a kind of scene, and the matter is over. However, this matter does not meet the company's requirements. The company must issue a detailed report, and it needs to know when it is selling well and when it is not selling well. Therefore, in this case, all pre-calculation methods are definitely not suitable. More detailed data is needed to be saved. We can do more interactive analysis, correlation analysis, and exploratory analysis. The requirements behind the set of systems are different.
Relatively speaking, I think what most companies want is actually a punctual system. It needs to have the ability to calculate post-positioning, and have the ability to write in real time and analyze it after writing, even if the efficiency of analysis is not so high. Must have the ability of flexible analysis.
Real-time data warehouse core requirements: data quality
Data quality is a very important part of real-time data warehouse construction. As mentioned earlier, if data quality is not pursued but timeliness is only pursued, a result set will be pre-processed through calculation at the beginning. This result set tells us that GMV has reached 100. 100 million, most bosses can’t believe it, because behind this 10 billion may be a problem with the data quality, or the calculation logic may be wrong, so the system must be able to guarantee the data quality.
There are two aspects to data quality. One is how often the quality problem is discovered, and the other is how often the quality problem is corrected. These two solutions are not the same.
If you want to find data quality problems, you need to make the state of the calculation process persistent, and hope that the data warehouse engine can have details and summary data can be placed on the market, and these data can be checked. In this case, when the boss asks why the indicator has increased so much, and which customer is the reason for the increase, you can use these detailed data to check the cause, and if you find an error when analyzing the data, whether it can be corrected, this is also a very critical question. .
Some systems can only be viewed and cannot be changed, which is a common problem with many big data systems. Big data systems are very good at handling scale issues, but handling small issues, such as correcting data, is particularly difficult. Because each correction requires a large data block as a unit, or there is no primary key, the entire file is replaced, so it is really difficult to update.
Compared with finding a problem and fixing a problem, I hope that a system can have the ability to correct data problems.
To fix the problem is to simply update the status of the data when the data quality is discovered, such as single row data update, single column data update, batch update, etc. There is a very simple way to refresh the data. The status of data refresh often happens. For example, there is a problem with the upstream data quality, and the processing logic is incorrectly written. All data refreshing needs to be done.
Secondly, I also hope that only one system can be corrected when the data is corrected.
Just now we saw that after a piece of data has an error in its data source, it has to circulate repeatedly in 4 to 5 links in the backend, which means that once the data is wrong, data correction is required in 4 to 5 links. At work, there are a lot of redundancies and inconsistencies in the data. If you want to correct it, every link must be corrected. This is also a particularly complicated matter. Therefore, we hope that the status of the data warehouse should only be saved in one place as much as possible. In this case, I only amend one place.
Real-time data warehouse core requirements: cost optimization
The cost is divided into three parts, namely development cost, operation and maintenance cost and labor cost.
Development costs indicate how long we want business requirements to be online. To do the same thing, do you need a team or a person, a week or a day, so the development cost is a matter that many companies are very concerned about, if it is not developed, it means that many business needs are suppressed or Inhibition.
We hope that IT students no longer need to be exhausted to respond to the requirements of the business team. It often happens that after the data is actually processed, the business team reports that the data is processed incorrectly. After the IT students finally correct it, the business team says that the event is over and the data they want to see is no longer meaningful.
Therefore, a good data warehouse system must be decoupled from technology and business. The technical team guarantees the stable and reliable operation of the data platform, and the business team should try to self-service the data access, and generate interesting reports by dragging and dropping. This is We consider a good system. Only in this way can the development threshold be lowered, so that more people can retrieve the data by themselves, and realize the reusability of data assets and self-service business development.
At the same time, in order to speed up development efficiency, the link must be short enough.
Just like the architecture diagram we saw at the beginning, if there are four or five links in the middle, any link must be configured and scheduled, and any link must be monitored after any error, then the development efficiency must be very heavy. If the development link is short enough, the data transmission is reduced, and the development efficiency will be much higher. Therefore, we need to improve development efficiency, hoping to achieve decoupling of technology and business, and to achieve a sufficiently short data link.
The simple translation of operation and maintenance costs means that there were too many clusters in the past and too much money was spent.
We need to open four or five clusters, and repeat scheduling and monitoring. Therefore, if there is an opportunity to re-select a new data warehouse in the future, we should consider how to reduce costs, use fewer clusters, and less operation and maintenance to provide more capabilities, including some flexible capabilities. At the end of the month or promotional activities, when the company has requirements for the amount of calculation analysis, it can do some flexible expansion and contraction to adapt to different calculation load changes. This is also a very important ability that can save the company's operation and maintenance costs.
The third cost is labor costs, including recruitment costs and learning costs.
Big data is a relatively complex system. Those who have done Hadoop technology should know that thousands of parameters and more than a dozen components depend on each other. Any node downtime may cause various problems in other systems.
In fact, the cost of learning and operation and maintenance are relatively high. The example of the database just mentioned is a good template. To reduce the cost of development, you can use a descriptive language, that is, SQL. Through SQL, the entire development threshold can be drastically reduced. Most of the students have already taken courses such as databases in their undergraduate studies, so the way of using SQL is more efficient than those that need to learn API and SDK. In this way, talents will be easier to find in the market, and the company will also devote more energy from the operation and maintenance of the underlying platform to the value mining of upper-level data.
It will be more convenient to interface with other systems through standard SQL, whether it is a development tool or a BI display tool.
The above are some technical requirements derived from business requirements.
The first generation of real-time data warehouse: database stage
Next, let's take a look at whether some of the past real-time data warehouse development technologies can meet the above needs.
The first generation of real-time data warehouse technology is called the database stage, which is a typical Lamda stage.
This architecture basically has a business requirement, there is a set of data links, the data link is divided into two parts, real-time and offline. The data is collected in the message middleware, part of it is real-time, and the result set is processed and stored in MySQL/HBase. The other part is processed offline through Hive/Flink, and stored in MySQL/HBase. It turns big data into small data, and then provides services to it.
Many people have analyzed the problem of this architecture. The two architectures have redundant data and inconsistent data. This is relatively straightforward, but the more important issue here is the chimney-style development.
When the business side puts forward a new requirement, the programmer has to find out which data source the data comes from, which third-party data sources it should associate with, what kind of aggregation and processing to be done, and then generate a result set, and finally We will find that this result set or hundreds of reports generated, 80% of which have great redundancy with each other. Some business departments look at these three indicators, and another department looks at two of them. There may be only one field in the middle. This is a common situation. The original data is the same, but you have one more statistical field and one less, but we have to re-development end-to-end, which seriously reduces development efficiency. Operation and maintenance is also difficult. Thousands of reports have been developed. We don’t know if anyone is using these reports, and we don’t dare to go offline easily.
In addition, once the fields of any data source increase or decrease, it must be adjusted, operated, and modified. This is almost a disaster. It’s not a big problem if it is developed by one person. We have seen a lot of development teams. Four or five classmates frequently write scripts every day. Then some of these people leave their jobs and some join them. In the end, no one dares to delete the code of the old colleagues, and finally it becomes Thousands of scripts are scheduled, and errors are checked every day. It may be a script error or a data quality error, which makes the operation and maintenance cost very high.
Therefore, this chimney-style development is quick to get started, but it is not sustainable in actual operation and maintenance. Our solution to the chimney problem is to settle the shared part, which enters the second stage of the real-time data warehouse: the traditional data warehouse stage.
The second generation of real-time data warehouse: traditional data warehouse stage
Data warehouse is a very good concept, which is to precipitate out those calculation indicators that can be reused. Therefore, there are three layers in the data warehouse, including DWD, DWS, and ADS. Through the precipitation of layers, the shared parts are lowered, and the differences are moved upwards to reduce the problem of repeated construction. This is also a set of basic methodology that has been precipitated by the data warehouse after decades.
In this way, Kafka drives Flink, and Flink does some dimensional table associations during the calculation process. Dimension table association is basically to associate Flink to a KeyValue system to widen the dimension table. After widening, the result will be rewritten into another topic in Kafka, and then aggregated and summarized twice to generate some DWS Or ADS, and finally store the result in the OLAP/HBase system.
Why part of the result storage in this place is an OLAP system and the other is an HBase system?
OLAP system is a very good table structure for data warehouse layering, but this type of system cannot support online applications. Online applications require tens of thousands of queries per second for QPS. The query mode is relatively simple, but the requirements for QPS are very high. It is difficult for most warehouse systems to meet this requirement. Therefore, at this time, the system has to be stored in HBase to provide milliseconds. The ability to respond.
This system solves the previous chimney problem, but we see that there is still data redundancy in the OLAP/HBase system. The same piece of data, the same piece of logic, still has redundancy in the data warehouse and KeyValue system. The company's business team will choose according to different SLAs. If it is very sensitive to delay, put the result in HBase. If it is not sensitive to delay but has requirements for query flexibility, it will be put in OLAP.
On the other hand, the HBase system is still not convenient for development, because this is a set of KeyValue system with relatively simple results. All queries need to be accessed based on the Key, and the Value part does not have a Scheme. Once a business unit finds that there is a problem with the data quality, there is no way to simply check the value of a certain row or column, and cannot coordinate and update it at any time. This is some limitations of a Schema Free system, and metadata management is relatively difficult. convenient.
The third generation of real-time data warehouse: analysis service integration stage
The third stage of real-time data warehouse is the analysis service integration stage. This stage has been realized internally in Alibaba, and most of the external companies are also exploring the road.
There are two differences between this stage and the previous stage. On the one hand, it is the unification on the server side, whether it is an OLAP system or a point-check system. Through a system unification, data fragmentation is reduced, interface inconsistency is reduced, and a copy of data in different systems is reduced. Passing back and forth between them realizes the effect of unified storage, which makes the inspection and correction of our data status easier. The interfaces are unified into one system, and the interfaces of security, access, control, and application development can be unified, and some optimizations have been made on the server side.
On the other hand, the data processing link has also been optimized, and there is no Kafka in it. In fact, the existence of Kafka is an option, because some systems have certain event-driven capabilities, such as Hologres, which has built-in Binlog event-driven capabilities, so Hologres can be used as a Kafka. Real-time convergence of DWD, DWS, and ADS layer by layer through Binlog and Flink is also a very good capability.
Through such an architecture, only Flink and Hologres are the components, and there are no other external systems in between, and they can still be driven by real-time links. The data is not split, and all data is stored in the database.
The key benefit is that the development link is shortened, which reduces the cost of debugging. The second is that the detailed state of the data is stored, because it is difficult for HBase to store the detailed state. If the details are saved in the service system, the cost of data error checking and repairing becomes very low. In addition, the number of components is reduced, and the operation and maintenance costs are also reduced accordingly.
Alibaba double 11 real-time business practice
Alibaba's Double 11 Day is done through the above methods, and the throughput in the Double 11 scenario is almost the largest in the world.
It can be seen that the data is sent to the message middleware through the real-time channel. First of all, it is usually click stream data, and then some dimension table widening operations need to be done through Flink. What ID is recorded in the clickstream and what kind of SKU is clicked. At this time, the SKU should be used as a wide expression, and it should be translated into what product, what category, what group of people, and what channel was clicked on. After widening these dimension tables, they are stored in Hologres, where Hologres plays the role of a real-time data warehouse. Another part of the data will be synchronized to the offline data warehouse MaxCompute at regular intervals. MaxCompute plays the role of a historical global data concept. We often count consumer behavior changes in the past period of time. The processing timeliness of this part of the data is not high, but the amount of data is very large and is executed in MaxCompute.
During analysis, Hologres and MaxCompute are linked together through federated queries, so there is no need to put data in one place, and users can still reduce data redundancy by doing federated queries in real time and offline.
Apache Flink-the industry de facto standard for real-time computing
Apache Flink has become the de facto industry standard for real-time computing and is used in various industries. Alibaba is the leader in this regard. There is a successful commercial company behind open source technology, and the commercial company behind Flink is Alibaba.
The real-time calculation part is very simple, it is Flink, but it is not easy to choose the warehouse system, because there are so many options, which can be divided into three categories, namely transaction system, analysis system and service system.
The transaction system is the system that generates data. It has many business systems. It is not easy to analyze directly on this system. Because in direct analysis, the load is likely to affect the stability of the online system, and the performance cannot be guaranteed, because these systems are optimized for random read and write, and basically rely on row storage.
For statistical analysis scenarios, the IO overhead is very large. It is basically prepared for the DBA to ensure the stability of the online system. So the first thing we do is to separate the transaction system and the analysis system, and synchronize these data to the analysis system.
The analysis system has made a lot of optimizations specifically for analysis. We will use modeling methods such as column storage, distribution, aggregation, and construction of semantic layers to simplify and enrich the analyzed data model, and then improve the performance of data analysis. This is analysis-oriented Division system.
Another system is called the Serving system. In the past it was mainly NoSQL, but now there are HBase, Cassandra, Redis. I also think of this type of system as a type of analysis, which is also a data access system. It is only relatively simple to access data. The number is accessed through the primary key. However, this type of system is also a scenario driven by data. It supports online applications and also has high data throughput and update capabilities.
It can be seen that after the data is generated from the source, there are generally two ways out, one is to enter the analysis system, and the other is to enter the service system to support online applications. We will find that data actually produces a lot of fragmentation in different systems, which means data movement, data relocation, data dependence, and inconsistent interfaces. At this time, we began to think of ways to innovate.
The first innovation is at the boundary between the service system and the analysis system. We think about whether a system can do both analysis and service. This idea is ideal, but it is indeed suitable in some scenarios.
However, such a system also has some limitations. The first limitation is that the bottom line of the system must guarantee transactions. Transaction is a very expensive thing for computing power. It can be seen that most analysis systems do not support transactions, because transactions bring a lot of locks. If there are distributed locks, the overhead will become greater, so This type of system has a certain ability, but also sacrifices a part of the performance, so it is not suitable for high concurrency scenarios.
Another limitation is that the data model is not suitable. There may be hundreds of tables generated on TP, but the analyst is not willing to look at hundreds of tables. He prefers to aggregate hundreds of tables into several large tables. The fact table and dimension table, this way is more in line with the requirements of the analysis semantic layer.
To sum up, it is difficult for this system to be suitable for both the TP system and the AP system in the data model, so I think the HTAP system has relatively large limitations.
The other end of the innovation is that we think about whether a system can support analysis scenarios, check it flexible enough, but also support online service scenarios, support high concurrency, support fast query, and support data update. This is also very important. A scene.
If we use one system to support these two scenarios, the cost of data migration and data development will be reduced a lot.
We see many common parts in these two innovative systems. For example, the data is basically read-only, and there is basically no requirement for locks. All data needs to be processed and abstracted. You can see the commonality between the analysis system and the service system defined here. It is very big and has the opportunity to do innovation.
Common real-time data warehouse architecture selection reference
Next, let's take a look at the combination of Flink and other common data warehouse technologies on the market to make real-time data warehouses, and perform performance analysis from various dimensions.
For example, the MySQL system provides a good interface. MySQL is more convenient to develop and supports various functions, but in fact its scalability and data flexibility will be much worse. Because Flink and MySQL just turn the data into a result set for use, it lacks a lot of intermediate states, the data needs to be relocated, and the cost of correction is very high.
We continue to think, MySQL is not scalable, HBase is very scalable, and it can handle large data scale problems. However, the ability of HBase data refresh is relatively weak. It is not convenient to update the data of a certain row/column at any time. The flexibility of the model is not very good, because it is basically a KV system, and it is very inconvenient if it is not checked according to the Key.
Then there is ClickHouse, which is very popular in recent years. It has a very fast query speed and is very suitable for wide table scenarios. Data analysis is possible if only a wide table is available, but we know that data analysis alone is often not enough to rely on a wide table. It requires a lot of table associations, nesting, and window calculations, so ClickHouse is somewhat reluctant for this scenario. ClickHouse does not support standard SQL, and many functions and associated operations are not supported. Especially in the scenario of table association, the performance degradation is more obvious. In addition, ClickHouse also has great limitations in metadata management. It lacks a distributed metadata management system, so the cost of operation and maintenance is relatively high.
In addition, there are some technologies of the data lake. Flink plus Hudi/Iceburg provides a certain data update capability for the big data platform, and still maintains the problem of data scale. Therefore, we say that it performs better in data correction, but in terms of query performance, this type of system does not do too much optimization, because to achieve good performance, it needs a certain index, and it needs to do enough with the query engine. More in-depth optimization. Hudi/Iceburg basically stays at the renewability of the storage layer, and more is the optimization of metadata management, so the performance is relatively general.
Finally, Hologres, relatively speaking, has excellent performance in data model flexibility, analysis self-service, scalability, data correction capabilities, operation and maintenance capabilities, etc. It is a good system. It supports schemes of complete tables, users can do join/nesting of any tables, and it also supports second-level query responses.
If it is an online system that requires tens of thousands or hundreds of thousands of high QPS responses, Hologres can also support it. In addition, it is a fully distributed and scalable architecture that can be flexibly scaled with load changes, data supports flexible update, and SQL Update and Delete are used directly to update data. It is a hosted service, and the elastic scalability is operated through a white screen interface, so the operation and maintenance is relatively simple, and the development is a standard SQL interface, so programs that can write JDBC and ODBC can be used. Do development.
In summary, I think Hologres has the advantages of other systems and also makes up for some shortcomings. It is currently the most recommended real-time data warehouse architecture selection.
The next-generation big data warehouse concept HSAP: integration of analysis and service
HSAP：Hybrid Serving & Analytical Processing
The idea behind Hologres’ design is called HASP, which is to support both Hybrid Serving and Analytical Processing loads at the same time. It hopes to achieve unified storage. When data is written, whether it is batch writing or real-time writing, it can be used to make writing The efficiency is high enough.
Secondly, when object services are used, the service interface is unified. Whether it is an internal interactive analysis scene or an online enquiry scene, it can be output through the same data service interface. By fusing these two scenes together, development efficiency is improved.
- Hologres = Better OLAP + Better Serving + Cost Reduced
Hologres, affiliated to Ali's self-developed big data brand MaxCompute, is a cloud-native distributed analysis engine that supports high-concurrency and low-latency SQL queries for PB-level data, real-time data warehouses, and interactive analysis of big data.
In our definition, Hologres is a better OLAP. In the past, the OLAP system was fast enough and flexible enough. This system must have.
Secondly, it is a Better Serving. In the past, the system can support the high-throughput write, update, query, and QPS capabilities of more than 100,000.
Cost Reduce does not mean that this system must be very cheap, but that our learning costs and operation and maintenance costs can be drastically reduced through this system. These are the benefits we bring to the system.
Cloud native real-time data warehouse solution: real-time computing Flink version + Hologres
At the architectural level, it is a relatively flexible architecture.
After the data is accessed in real time, the processing layer is divided into two links, one is called detail layer processing, and the other is called aggregation layer processing.
When the detail layer is processed, cleaning, association, and conversion are also done, and the calculation is completed through Flink. After processing, you can save the detailed results directly. If you are dealing with order data, this is basically enough, because the data volume of order data is already relatively large for a company with a scale of tens of millions.
This type of detailed data directly provides report services to the outside world, and does not need to do a lot of secondary aggregation processing. In the process of cleaning and processing, the correlation of dimension tables is often widened. This kind of enquiry scene is also very suitable in Hologres. What used to do with HBase/Redis in the past can be done by building a table in Hologres. In the detailed processing stage, you can either widen the row table or save the detailed data. Both scenarios can be supported: one reading and one writing.
If it is behavioral data or clickstream data, the amount of data is often larger and the value of the data is lower. At this time, if all the details are stored, the analysis efficiency is relatively low. At this time, you can do a second aggregation pre-processing. Processing into some light summary layers is light processing into DWS layer, which can be saved or processed to ADS layer, processed into a final ADS result set for a certain business scenario, or saved. Such scenarios become smaller data, which can support higher QPS.
Therefore, for the above, this data warehouse system provides us with more flexibility. For interactive analysis, try to save the details as much as possible. If you do online check, you can process this table into a table that can be queried according to the primary key. This brings a lot of flexibility to development.
Processing is not necessarily all through stream computing. In some cases, detailed data can also be archived through data. Both in the database and in batch scheduling, you can continue to do the second aggregation pre-processing.
Real-time data warehouse scenario 1: ad hoc query
Hologres defines three ways to realize real-time data warehouse, the first is called ad hoc query.
Ad hoc query is the kind of scenario where you don't know what the query mode looks like, anyway, save the data first, and then provide as much flexibility as possible.
Therefore, at this time, we will suggest that the data of the operation layer (ODS layer) should be cleaned and associated with simple data quality as much as possible, and then stored in the detailed data, without doing too much secondary processing and summary. Because it is not clear how to analyze the data at this time, you can build a lot of views through view encapsulation to provide services to the outside world.
View is a good encapsulation of logic, encapsulating complex calculations such as the association and nesting of the bottom table in advance, but it is not solidified in advance at this time, so if there is any update to the original data, we can refresh a layer of data at any time. You only need to update the data in one place. You don't need to update all the aggregation tables above, because there is no aggregation table above, and the aggregation is the view.
Therefore, the flexibility of this calculation is very high, but its query performance is not necessarily the best, because the view calculation is very large, and each time the view is checked, the underlying original data must be associated and nested, so The amount of calculation is relatively large, which is suitable for scenarios that do not require high QPS and require relatively high flexibility.
Real-time data warehouse scenario 2: minute-level quasi real-time
The second scenario is called minute-level quasi real-time. Compared with the previous scenario, this scenario requires higher QPS, and the query is relatively more solid.
At this point, we will take the part of the view just now, and I will materialize it into a table. The logic is the same as the logic just now, but it becomes a table. After turning into a table, the amount of data to be queried will be much smaller, which is a relatively simple way to improve query performance. You can use the DataWorks scheduler to achieve quasi-real-time scheduling with minute-level scheduling, and generate a schedule in 5 minutes or 10 minutes. Batches can meet the business scenario requirements of most companies.
Through this set of methods, development becomes very simple. When any link or any data has errors, as long as the DataWorks schedules to re-run, the operation and maintenance becomes very simple.
Real-time data warehouse scenario 3: Real-time statistics of incremental data
There are also some scenarios that are very sensitive to data delay, and the data must be processed when it is generated. At this time, through incremental calculation, Flink is used to aggregate the detail layer, summary layer and other layers in advance, and then save the result set after aggregation. Providing services to the outside world is the way of incremental calculation.
The difference from the traditional incremental calculation method is that we recommend that the intermediate state be persisted. The advantage is that the flexibility of subsequent analysis is improved, and the cost of correcting data errors will drop sharply. This is also the way we have seen many companies use. Previously, the data was stringed together in real time through Kafka Topic. Once the quality of the intermediate data is problematic, it is difficult to correct the Kafka data, and it is also difficult to check where the data is faulty. The cost of error is very high.
After synchronizing each topic data to Hologres, once there is a problem with the data, it corrects the table in the table database, and then refreshes the data in the database to realize the data correction.
Hologres real-time data warehouse scene 3 scene selection principles
In actual business, we can choose according to different situations. The selection principles for the three scenarios of Hologres real-time data warehouse scenario are as follows:
- For pure offline computing, MaxCompute is preferred;
- If the real-time requirements are simple, the amount of data is not large, and only incremental data is needed to count the results, it is suitable for scenario 3: real-time statistics of incremental data;
- If there are real-time requirements but the real-time requirements are not very high, but development efficiency is the priority, the minute-level quasi-real-time solution is preferred, suitable for scenario 2: minute-level quasi-real-time;
- Real-time requirements are very high, ad hoc query needs, and resources are relatively sufficient, suitable for scenario 1: ad hoc query.
Alibaba customer experience system (CCO) data warehouse real-time transformation
CCO Service Operation System: The digital operation capability determines the service experience of consumers and businesses.
Alibaba customer experience system (CCO) real-time data warehouse transformation
Real-time data warehouse has experienced three stages: database -> traditional data warehouse -> real-time data warehouse.
- Business difficulties:
1) High data complexity, omni-channel purchase, order placement, payment, after-sales, 90% real-time data;
2) Large amount of data, logs (10 million/second), transactions (millions/second), consultation (10,000/second);
3) Rich scenes, real-time monitoring of large screens, real-time interactive analysis of data products, and ToC online applications.
- Technology Architecture:
DataHub + real-time computing Flink version + Hologres + MaxCompute
1) The overall hardware resource cost is reduced by 30+%;
2) High-throughput real-time writing: supports the write requirements of 10 million rows per second and hundreds of thousands per second of column storage;
3) Simplify real-time links: development and reuse for the public layer;
4) Unified service: simultaneously supports multi-dimensional analysis and high QPS service-based query scenarios;
5) MC-Hologres query service: The average query latency on Double 11 in 2020 is 142ms, and 99.99% of queries are within 200ms;
6) Support the construction of 200+ real-time data large screens, and provide stable data query services for nearly 300+ small two.
Analysis of e-commerce marketing activities
Next, give an example of marketing.
In the past, marketing activities were all planned one month in advance, including where to put, what kind of advertising, to whom, what kind of vouchers, etc., but this matter has higher requirements in the Internet field. . For example, in scenes such as Double 11, there is more and more demand for real-time strategy adjustment, and the activity may only last for one day. We need to understand the transaction ranking, transaction composition, inventory situation, and conversion rate of each traffic channel in real time. For example, when you open a web page, you will recommend a certain product at first, but over time you will find that the conversion rate of the product is very low. At this time, we need to adjust our marketing strategy.
Therefore, in this real-time marketing scenario, the requirements for real-time data analysis are very high. It is necessary to understand each channel, each type of population, each type of product conversion rate/transaction rate, etc. in real time, and adjust the marketing strategy in real time according to the situation, and finally improve Conversion rate/transaction rate.
The general architecture of Alibaba's internal computing is shown above. The product builds using QuickBI, interactive analysis uses Hologres, and the data processing part uses Flink and MaxCompute. This is a relatively classic architecture.
Real-time calculation of Flink version + RDS/KV/OLAP solution
The real-time calculation Flink version + RDS/KV/OLAP architecture is an early solution. All the calculation logic is processed by Kafka and then aggregated into a result set for storage. Its limitation is that the development efficiency and resource consumption are very large. .
We know that data analysis may have N dimensions, such as time, region, product, population, category, etc., among which different dimensions can be further divided, for example, the categories are divided into first-level categories, second-level special, and third-level categories. Population portraits include A correlation analysis of any combination of dimensions such as consumption power, education level, and region will produce certain conclusions.
Therefore, we said that the amount of calculation in the past was very large because it had to calculate 2 ⁿ combinations in order to save all the angles that may be analyzed in advance, so that when the analyst reads the report, no matter what combination of dimensions he chooses, The corresponding calculation results can be found, but the amount of calculation is an astronomical number. It is impossible for a programmer to write so many calculations. If there is no calculation, there is no result set.
In addition, if we calculate the combination of three dimensions at the beginning, suddenly the boss feels that these three dimensions are not enough to judge what happened today, and want to add another dimension. But we did not write this logic in advance. At this time, the online data has been consumed, and there is no way to recalculate it, or the cost of recalculation is very high, so this is a method that consumes very much resources. And after we finished the development, we calculated thousands of intermediate tables. We can't be sure whether these result sets/combinations are used by anyone. We can only figure it out first and store a temporary table in the database, which is very costly.
Real-time calculation of Flink version + Hologres interactive query plan
As shown above, there is actually no essential change in the structure after the reform, but the logic of processing. The biggest change is that many intermediate processing processes are replaced by views. Views do calculations in the database, placing part of the pre-calculation tasks behind the calculations.
The original data is still through the message middleware, analyzed and deduplicated through Flink, and then the detailed data is stored. Detailed data no longer do a lot of secondary summary processing, but through many logical views, no matter how many dimensions you want to see, you can write SQL statements at any time, the view can be online at any time, it does not affect the original data. In this case, no matter what data we want to check, all the analysis load is not wasted, and the development efficiency will become very high. From the past to calculate 2 ⁿ, until now, only one detail is saved. Some light summary layers are made for business demand scenarios, DWD and DWS are logically encapsulated, and views can be built, which greatly improves development efficiency.
Operation and maintenance costs require that the architecture has good elastic scalability, which is also the ability of Hologres. In a scenario where the traffic is ten times larger on Double 11, it can be flexibly expanded at any time, which is very convenient.
Facilitate rapid business decision-making and regulation
This has brought great benefits. In the past, complicated calculation logic procedures were required. From data generation to output results, it may take several hours of calculation process. This means that it takes us several hours before we can judge what happened a few hours ago. If you want to see the results, you will have to wait a few hours to adjust the business logic, and the flexibility of the entire business is greatly reduced.
After using the Flink + Hologres architecture, data is generated and analyzed in real time, and real-time business decision-making and control can be made at any time, and the analysis flexibility is very high. For example, many customers feedback that at first they thought that certain products would sell well, but in the evening they found that they were completely different from expectations. On the contrary, explosive products were unexpected products. These are problems that cannot be seen in the reports planned in advance. There are many requirements for flexible launching of new services in real scenarios. If the new services can be launched within a few minutes or an hour, these flexible scenarios can be solved. , This is also a big gain brought by Hologres real-time data warehouse.
Efficient real-time development experience
We can see that for large-screen development, there are dozens of different indicators in the large-screen. In the past, this type of system was more complicated when it was developed. It was necessary to use different data sources to do different aggregations, and to be able to generate this through different scheduling. The data.
After using Hologres, the development can be completed in 2 person days, because there is no need for so much scheduling behind the scenes, just write a few SQL statements, which greatly improves the development efficiency.
It can support more than 50 billion records in a single day, the peak write QPS is 100W+/s, and the query response delay is less than 1 second. It does a very good job in both the amount of data written and the analysis experience.
Three phases of Hologres data warehouse development
Hologres not only builds real-time data warehouses, but also has different calculation methods such as quasi real-time, full real-time, and incremental real-time. In different stages of the company's data warehouse construction, Hologres has different usage methods, including exploration methods, development methods and mature methods.
The exploration method is a scenario where there are more requirements for flexible analysis, the data model is not too fixed, and the analyst is not sure how to use the data. Therefore, at this time, the construction of data warehouse mainly focuses on the aggregation of detailed layers. First, the company's data must be centralized. If the data is not centralized, the analysis efficiency cannot be guaranteed. Focus on ODS construction, supplemented by DWD, do a certain quality assurance for the ODS layer, do the basic work of data cleaning, correlation, and widening, generate some DWD detail layer data, and then directly provide it on top of the detailed data Analysis, we must make the ADS layer thin, and don't do a lot of scheduling and aggregation on it. Because the method of analyzing the data is still uncertain at this time, the following layer construction is the main thing. The detailed data can be stored in Hologres, and it can be stored in columns/rows.
The second stage is called the rapid development stage. The main feature at this time is generally that the company starts to consider the data center, and begins to reserve roles such as data product managers and data analysts. At this time, the company’s index system began to accumulate, and the company’s reusable data assets were accumulating. DWD is no longer satisfied. We must continue to build some reus