The real-time data warehouse aims at end-to-end low latency, SQL standardization, rapid response to changes, and data unification. The best practice summarized by the Meituan Waimai Data Intelligence Group is: a universal real-time production platform and a universal interactive real-time analysis engine cooperate with each other to meet real-time and quasi-real-time business scenarios. The two have a reasonable division of labor and complement each other to form an easy-to-develop, easy-to-maintain and efficient assembly line, taking into account development efficiency and production costs, and satisfying the diverse needs of the business with a better input-output ratio.
01 Real-time scene
There are many real-time data in Meituan takeaway scenarios, mainly in the following aspects:
- Operational level : For example, real-time business changes, real-time marketing effects, business conditions of the day and analysis of business trends of the day's time-sharing, etc.
- Production level : For example, whether the real-time system is reliable, whether the system is stable, and monitor the health of the system in real time.
- C-end user : For example, search recommendation ranking requires the production of characteristic variables such as real-time behavior and characteristics to recommend more reasonable content to users.
- control side : Real-time risk identification, anti-fraud, abnormal transactions, etc., are all scenarios where a large amount of real-time data is used.
02 Real-time technology and architecture
1. Real-time computing technology selection
At present, there are still many open source real-time technologies on the market. The more common ones are Storm , Spark Streaming, and Flink . Technical students should deploy them according to the company's specific business when making selections.
Meituan Food Delivery relies on Meituan's overall basic data system construction. In terms of technology maturity, the company mainly used Storm in the past few years. At that time, Storm was irreplaceable in terms of performance stability, reliability and scalability. But as Flink becomes more and more mature, it has surpassed Storm in terms of technical performance and framework design advantages. From a trend perspective, just like Spark replacing MR, Storm will gradually be replaced by Flink. Of course, there will be a process for migrating from Storm to Flink. We currently have some old tasks still running on Storm, and we are constantly advancing task migration.
For the specific Storm and Flink please refer to the table above.
2. Real-time architecture
① Lambda architecture
Lambda is a relatively classic architecture. In the past, there were not many real-time scenes, mainly offline. When real-time scenes are added, the technology ecology is different due to the difference in timeliness between offline and real-time. The Lambda architecture is equivalent to attaching a real-time production link, an integration at the application level, two-way production, independent of each other. In business applications, it has naturally become a way to be adopted.
There are some problems in dual-channel production, such as processing logic Double, development and operation and maintenance will also be Double, resources will also become two resource links. Because of the above problems, a Kappa architecture was evolved.
② Kappa architecture
From the perspective of architecture design, Kappa is relatively simple, with unified production, and a set of logic is produced offline and in real time at the same time. However, there are relatively large limitations in actual application scenarios. There are few cases in the industry that directly use the Kappa architecture to produce and land, and the scenarios are relatively single. These problems will also be encountered on the Meituan Takeaway. We will also have some thoughts of our own, which will be explained in the following chapters.
03 Business Pain Points
First of all, we encountered some problems and challenges in the food delivery business. In the early stages of the business, in order to meet business needs, Case By Case usually completes the needs first. Business has relatively high real-time requirements. From the perspective of timeliness, there is no opportunity for intermediate layer precipitation. In this scenario, the business logic is usually directly embedded. This is a simple and effective method that can be thought of. This development model is also common in the early stages of business development.
As shown in the figure above, after we get the data source, we will go through data cleaning, expansion, business logic processing through Storm or Flink, and finally direct business output. Taking this link apart, the data source will repeatedly quote the same data source, and subsequent operations such as cleaning, filtering, and dimension expansion must be repeated. The only difference is that the business code logic is different. If there is less business, this model is acceptable, but when the follow-up business volume increases, the situation of who develops the operation and maintenance will appear, and the maintenance workload will become more and more. Large, the operation cannot form a unified management. Moreover, everyone is applying for resources, leading to rapid expansion of resource costs, and resources cannot be used intensively and effectively. Therefore, it is necessary to think about how to construct real-time data as a whole.
04 Data characteristics and application scenarios
So how to build a real-time data warehouse? First of all, we need to disassemble, what data are there, what scenes are there, and what are the common features of these scenes? There are two major categories for takeaway scenarios, log type and business type.
- log classes : the amount of data is particularly large, semi-structured, relatively deep nesting. Log data has a big feature. Once the log stream is formed, it will not change. Collect all the logs of the platform by burying points, and collect and distribute them uniformly. Just like a tree, the root of the tree is very large. In the front-end application, it is equivalent to the process of branching from the root to the branch (the decomposition process from 1 to n). If all businesses find data from the root, it seems that the path is the shortest, but the burden is too heavy and data retrieval efficiency is low. Log data is generally used for production monitoring and user behavior analysis. The timeliness requirements are relatively high. The time window is generally 5min or 10min, or up to the current state. The main applications are real-time large screens and real-time features, such as every user The click behavior can immediately perceive the waiting demand.
- Business class : Mainly business transaction data. Business systems are generally self-contained and distributed in the form of Binlog logs. Business systems are all transactional, mainly using paradigm modeling. The feature is structured, the main body is very clear, but there are many data tables, and multiple tables are needed to express the complete business, so it is an n to 1 integrated processing process.
For real-time processing of business, the main difficulties are as follows:
- The multi-state nature of the >delivery, the business database is changed on the original basis, and Binlog will generate a lot of change logs. And business analysis pays more attention to the final state, which results in the problem of data withdrawal calculation, such as placing an order at 10 o'clock and canceling at 13 o'clock, but it is hoped that the cancellation order will be subtracted at 10 o'clock.
- Business integration : Business analysis data generally cannot be expressed by a single subject, and often many tables are associated to obtain the desired information. The merging and alignment of data in a real-time stream often requires large cache processing and is complicated.
- analysis of 1612c971a7947e is batch, and the processing process is streaming : For a single data, analysis cannot be formed, so the analysis object must be batch, and the data processing is one by one.
Log and business scenarios generally exist at the same time and are intertwined. Whether it is Lambda architecture or Kappa architecture, a single application will have some problems. Therefore, it makes more sense to choose architecture and practice for the scene.
05 Real-time data warehouse architecture design
1. Real-time architecture: exploration of stream-batch combination
Based on the above questions, we have our own thinking. Cope with different business scenarios through the combination of stream and batch.
As shown in the figure above, the data is collected uniformly from the log to the message queue, and then to the ETL process of the data flow. The construction of the basic data flow is unified. Afterwards, for the real-time features of logs, real-time large-screen applications adopt real-time stream computing. Real-time OLAP batch processing is used for Binlog business analysis.
What are the pain points of streaming analytics business? For the paradigm business, both Storm and Flink need a large amount of external memory to achieve business alignment between data streams, which requires a lot of computing resources. And due to the limitation of external storage, a window restriction strategy must be implemented, and some data may eventually be discarded. After the calculation, it is generally stored in Redis for query support, and KV storage has more limitations in dealing with analytical query scenarios.
How to realize real-time OLAP? Is there a real-time calculation engine with its own storage that can flexibly calculate within a certain range when real-time data comes, and has a certain data carrying capacity, while supporting analysis of query responses? With the development of technology, the current MPP engine is developing very rapidly, and the performance is also rapidly improving, so there is a new possibility in this scenario. Here we are using the Doris engine.
This idea has also been practiced in the industry and has become an important direction for exploration. Alibaba's real-time OLAP solution based on ADB, etc.
2. Real-time data warehouse architecture design
From the perspective of the entire real-time data warehouse architecture, the first consideration is how to manage all real-time data, how to effectively integrate resources, and how to construct data.
In terms of methodology, real-time and offline are very similar. The offline data warehouse was also a Case By Case in the early stage. When the data scale reaches a certain amount, it will consider how to manage it. Layering is a very effective way of data governance, so when it comes to how to manage real-time data warehouses, the first consideration is also layered processing logic. The specific content is as follows:
- data source : At the data source level, offline and real-time data sources are the same. They are mainly divided into log types and business types. Log types include user logs, DB logs, and server logs.
- Real-time detail layer : In the detail layer, in order to solve the problem of repeated construction, a unified construction must be carried out. The offline data warehouse model is used to build a unified basic detail data layer, which is managed according to the theme. The purpose of the detail layer is to provide downstream Directly usable data, so the basic layer needs to be processed uniformly, such as cleaning, filtering, and dimension expansion.
- Summary layer : The summary layer can directly calculate the result through the concise operator of Flink or Storm, and form a summary indicator pool. All indicators are processed in the summary layer. Everyone manages and builds in accordance with a unified standard to form a reusable Summarize the results.
To sum up, from the perspective of the construction of the entire real-time data warehouse, first, the hierarchy of data construction must be established first, first build the framework, and then define the specifications, to what extent each layer is processed, and what method is used for each layer, When the specifications are defined, it is convenient to carry out standardized processing in production. Due to the need to ensure timeliness, when designing, there should not be too many levels. For scenarios with high real-time requirements, you can basically follow the data flow on the left side of the figure. For batch processing requirements, you can import from the real-time detail layer to the real-time OLAP. In the engine, based on the OLAP engine's own calculation and query capabilities, fast retracement calculations are performed, as shown in the data flow on the right side of the above figure.
06 Real-time platform construction
After the architecture is determined, what we will consider later is how to build a platform. The real-time platform-based construction is carried out in addition to the real-time data warehouse management.
First, abstract the functions and abstract the functions into components, so that standardized production can be achieved, and the systemic guarantee can be built in more depth. For the cleaning, filtering, confluence, dimension expansion, conversion, encryption, and screening of the basic processing layer And other functions can be abstracted out, and the basic layer builds a directly usable data result stream in this componentized way. This will cause a problem. The needs of users are diverse. In order to satisfy this user, how to be compatible with other users, so there may be redundant processing. From the storage dimension in terms of historical real-time data does not exist, does not consume too much memory, this redundancy is acceptable, can increase productivity by way of redundancy, it is a space for time thought Applications.
Through the processing of the basic layer, all the data is deposited in the IDL layer, and written to the basic layer of the OLAP engine at the same time, and then the real-time summary layer calculation is carried out. Based on Storm, Flink or Doris, multi-dimensional summary indicators are produced to form a unified summary layer , For unified storage and distribution.
When these functions are available, system capabilities such as metadata management, index management, data security, SLA, and data quality will gradually be built.
1. Real-time basic layer function
The construction of the real-time base layer needs to solve some problems. The first is the problem of repeated reading of a stream. A Binlog call is in the form of a DB package. The user may only use one of the tables. If everyone wants to use it, there may be a problem that everyone has to connect to the stream. The solution can be deconstructed according to different business, restored to the basic data flow layer, made into a paradigm structure according to the needs of the business, and integrated thematic construction according to the modeling method of the data warehouse.
Secondly, it is necessary to encapsulate components, such as basic layer cleaning, filtering, dimension expansion and other functions, through a very simple expression portal, allowing users to write logic. The data conversion link is more flexible, such as converting from one value to another. For this kind of custom logic expression, we also open custom components. Custom scripts can be developed through Java or Python for data processing.
2. Real-time feature production function
Feature production can be logically expressed through SQL grammar, and the bottom layer is logically adapted, and transparently transmitted to the computing engine, shielding users from relying on the computing engine. Just like for offline scenarios, large companies rarely develop through code at present, unless there are some special cases, so they can basically be expressed through SQL.
At the functional level, the idea of indicator management is integrated. Atomic indicators, derived indicators, standard calculation calibers, dimension selection, window settings and other operations can be configured in a unified way, so that the production logic can be analyzed and packaged in a unified manner.
There is another problem. The same source has written a lot of SQL. Each submission will start a data stream, which is a waste of resources. Our solution is to achieve the production of dynamic indicators through the same stream, without stopping the service. You can add indicators dynamically.
Therefore, in the process of real-time platform construction, more consideration is how to use resources more effectively and in which links can be used more economically. This is a matter of more consideration in engineering.
3. SLA construction
SLA mainly solves two problems, one is the end-to-end SLA, and the other is the SLA of operation productivity. We adopt the method of burying points + reporting. Since the real-time stream is relatively large, the burying points should be as simple as possible and not too many things can be buried. The business can be expressed. The output of each job is reported to the SLA monitoring platform. Through a unified interface, the required information is reported at each job point, and finally the end-to-end SLA can be counted.
In real-time production, because the links are very long, all links cannot be controlled, but the efficiency of their own operations can be controlled, so the operation SLA is also indispensable.
4. Real-time OLAP solution
question
- Binlog business restoration is complicated. : There are many business changes, which require changes at a certain point in time. Therefore, sorting and data storage are required, which consumes a lot of memory and CPU resources.
- Binlog business association is complicated. : In stream computing, the association between stream and stream is very difficult to express the business logic.
solution
It is solved by an OLAP engine with computing power. It is not necessary to logically map a stream, but only needs to solve the problem of real-time and stable data storage.
What we use here is Doris as a high-performance OLAP engine. Since the results generated by the business data and the results need to be derived calculations, Doris can use the Unique model or the aggregation model to quickly restore the business, and the business can be summarized at the same time. The aggregation of layers is also designed for reuse. The application layer can be physical or logical view.
This mode focuses on solving business withdrawal calculations. For example, when business status changes, the value needs to be changed at a certain point in history. In this scenario, the cost of stream calculation is very high, and the OLAP mode can solve this problem very well.
07 Real-time application cases
Finally, a case is explained. For example, if a merchant wants to give users a discount based on the number of orders placed in the user's history, the merchant needs to see how many orders have been placed in the history, the historical T+1 data must be available, and today's real-time data must also be available. This scenario is typical Lambda architecture. We can design a partition table in Doris. One is the historical partition and the other is the today partition. The historical partition can be produced offline. Today's index can be calculated in real-time and written into the today partition. A simple query is performed when querying. Summary.
This kind of scenario looks relatively simple, but the difficulty lies in the fact that many simple problems will become complicated after the amount of merchants increases. In the future, we will also use more business input to precipitate more business scenarios, abstract out to form a unified production plan and functions, and support diversified business needs with minimized real-time computing resources. This is also what we need to achieve in the future. the goal of.
Read more technical articles from the
the front | algorithm | backend | data | security | operation and maintenance | iOS | Android | test
| in the public account menu bar dialog box, and you can view the collection of technical articles from the Meituan technical team over the years.
| This article is produced by the Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "the content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activity, please send an email to tech@meituan.com to apply for authorization.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。