New progress in the construction of Meituan&#39;s real-time data warehouse platform based on Flink

Abstract: This article is compiled from the speech of Yao Dongyang, the head of the real-time data warehouse platform of Meituan, at the Flink Forward Asia 2021 real-time data warehouse special session. The main contents include:
Current status of platform construction
Problems encountered and solutions
future plan

Click to view live replay & speech PDF

1. Status Quo of Platform Construction

Meituan first introduced the Flink real-time computing engine in 2018. At that time, the concept of real-time data warehouse was not very popular, and the platform only provided lifecycle management and monitoring alarms for Flink Jar tasks.

In 2019, we noticed that the main application scenario of real-time computing is to solve the problem of low timeliness of offline data warehouses. The offline data warehouse is relatively mature, and it is very simple to develop through SQL, while the real-time part of the data warehouse is mainly developed through the Flink DataStream API, the threshold is relatively high, and compared with the offline data warehouse development method is more fragmented. Therefore, we began to investigate the solution of real-time data warehouse, the goal is to lower the development threshold, and try to promote FlinkSQL, and finally named Meituan's real-time data warehouse platform NAU.

In 2020, the Meituan real-time data warehouse platform was officially launched. It provides the business with a FlinkSQL job development entry, and is mainly responsible for two aspects:

First, align common data sources in real-time data warehouses with offline table concepts, and manage them with data models;
Secondly, it provides efficient tools for FlinkSQL development, such as verification and debugging functions.

However, in the actual promotion process, we found that the business threshold for FlinkSQL operation and maintenance is still relatively high. Therefore, we will focus on the operation and maintenance center for the next work.

The pain points of FlinkSQL job operation and maintenance mainly focus on two aspects: the problem of disconnection of stateful SQL job deployment and the problem of abnormal location of SQL jobs. To this end, we address the first problem by asynchronizing Checkpoint persistence and state generation, and the second by providing automatic diagnostics of jobs. At present, the platform construction of the entire real-time data warehouse has been initially completed. In the future, we will continue to refine the development and operation and maintenance capabilities, and continue to promote the evolution of the company's business data warehouse structure, such as the integration of stream-batch production and the integration of production services. integration.

The real-time data warehouse has basically covered the company's entire business, providing support for more than 100 business teams, such as Meituan Select, Meituan grocery shopping, finance, cycling and other businesses. More than 7000 real-time data models are hosted, mainly Kafka table and KV table models. There are 4000+ FlinkSQL jobs running online, and the newly added real-time SQL jobs account for more than 70%. From the data point of view, FlinkSQL can already solve most of the stream processing problems of Meituan's real-time data warehouse.

Next, take the two real-time data warehouse production links in the Meituan business as an example to share the practical application of FlinkSQL.

Application scenario 1 is a real-time production link based on FlinkSQL + OLAP. There are two real-time data sources for this business link, namely, the change events of the business DB and the log events of the business service. These events will first be collected into Kafka, and then the DB events will be distributed to the new Kafka according to the table name. The data of DB and log will also be unified in format at this layer and complete the ODS layer of real-time data warehouse. Then the business will use FlinkSQL to clean and correlate the data of the ODS layer, generate the subject wide table of the real-time data warehouse, and finally write it to the OLAP query engine for real-time analysis. For scenarios with low timeliness requirements, some businesses also configure minute-level scheduling on the OLAP engine to reduce the pressure on the same query.

The difference between Application Scenario 2 and Scenario 1 is that the topic wide table data of the business real-time data warehouse is not directly written to the OLAP query engine, but continues to be written to Kafka, and FlinkSQL is used to aggregate indicators at the APP layer. Metric data is written to application-layer storage such as OLAP, DB, or KV. This method is more suitable for docking data services, because it takes into account the timeliness of data and high-QPS queries.

The above picture shows the architecture of the real-time data warehouse platform, which is divided into five modules: integration, development, operation and maintenance, governance, and security.

The integration module is mainly concerned with the management of the data model, including Kafka and KV model management. The management content includes the schema information and connection information of the data source.

The development module mainly focuses on FlinkSQL transforming business requirements, such as providing version management to record the iterative process of business requirements, providing FlinkSQL verification and debugging to ensure that the developed SQL correctly expresses business logic, and supports the use of custom Flink UDFs for business Functions and custom Format parsing allow FlinkSQL to be extended to meet more business requirements scenarios.

The operation and maintenance module focuses on the deployment and runtime monitoring of SQL jobs. In terms of monitoring, we provide monitoring alarms, exception logs, and job diagnosis for SQL jobs, which can help businesses quickly discover and locate job exceptions; in terms of deployment, we provide snapshot management, AB deployment, and parameter tuning of SQL jobs to help The business solves the problem when the SQL job changes.

The governance module focuses on the data quality and resource cost of real-time data warehouses. By building DQC monitoring of real-time data warehouses, it helps businesses discover abnormal values/abnormal fluctuations in upstream data or output data; The business can quantify the production cost of real-time data warehouses to facilitate cost management.

The security module mainly focuses on the management and control of the data flow, provides the management of the read and write permissions of the data source and the restricted domain mechanism, and ensures the security of the company's business data.

2. Problems encountered and solutions

In the process of actually promoting FlinkSQL, we also faced many challenges.

2.1 Dual-stream association large state problem

The first is the large state problem of dual-stream association. FlinkSQL's dual-stream association will retain the historical data of the left and right streams to correlate with each other. The longer the time interval that needs to be associated, the more historical data will be saved, and the larger the state will be. For example, to correlate the order placing event and the refund event of an order and ensure the correctness of the calculation result, it is necessary to consider the interval between the occurrence of these two events, which may be a month or even longer.

On the left side of the above figure is a stateful SQL job with two streams. Mem and Disk in the figure constitute the TaskManager node of the SQL job. The SQL job status backend uses RocksDB, and the status is persisted on the HDFS file system. At the beginning, we tried to set the status of the SQL job to keep for one month, but the SQL job would become unstable, and there would be problems such as memory overrun, status reading performance degradation, etc., which can only be alleviated by increasing the number of TMs and memory size of the job. .

Even so, there are still two pain points in the business. First, it is difficult to initialize associated data. At present, the company's Kafka data source has restrictions on historical backtracking, so the business cannot build a complete historical state. Even if Kafka supports longer backtracking, the efficiency of state initialization is still a problem. Secondly, the memory resource overhead is high, especially when multiple SQL jobs are associated with the same data source, corresponding memory resources need to be allocated for each SQL job, the status of different SQL jobs is isolated, and the same associated data between jobs Cannot be reused.

For the above problems, we propose a solution for the separation of hot and cold correlations. Assuming that the data related to two days ago is relatively infrequent and the state rollback will not exceed two days, then the data two days ago can be defined as cold data, and the data within two days is hot data.

solution

As shown in the figure above, the SQL job on the left sets the state retention time, and only retains the hot data of T+0 and T+1 two days, while the cold data of T+2 and older is stored from Hive every day through batch tasks. Synchronized to external memory KV. During association, if the hot data in the state does not exist, then the cold data is associated by accessing the external memory KV. On the right is another SQL job that needs to be associated with the same data source, which shares the cold data in the outer KV with the SQL job on the left.

For the first pain point, because the state is controlled within two days, when the SQL job goes online, the amount of data initialized by the linked data is controlled. For the second pain point, because most of the data two days ago are stored in the outer KV, different SQL jobs can query the outer KV, which can save a lot of memory resources.

2.2 SQL change state recovery problem

The second question is how to restore the state after stateful SQL logic changes? FlinkSQL supports stateful incremental computing. The state is the historical accumulation of incremental computing. In fact, there are many situations in which business needs to modify logic. The right side of the above figure lists some common SQL changes, such as adding aggregate indicators, modifying Indicator caliber, adding filter conditions, adding data flow association, adding aggregation dimension, etc.

For example, if more service dimensions are added to the business, the dimensions of analysis need to be expanded on data products, so FlinkSQL needs to be modified to add aggregation dimensions. However, after the above SQL logic changes, it cannot be restored from the previous state, because the historical state cannot guarantee the integrity of the changed SQL, and even after restoration, the correctness of subsequent calculations cannot be guaranteed 100%. In this case, in order to ensure the correctness of the data, the business needs to recalculate from the historical backtracking process. The backtracking process will lead to online interruption, but the business does not want to sacrifice too much timeliness.

solution

For this problem, we give three solutions.

Solution 1 : Dual link switching. The key to this solution is to build an identical real-time link as a backup link. When changing stateful SQL, you can backtrack on the backup link, recalculate historical data, and verify the result data of the backup link after the backtracking is completed. , to ensure that there is no problem, and then switch the read table at the data service layer at the most downstream of the link to complete the entire change process.

Solution 2 : Bypass state generation. The difference from dual-link switching is that what is changed here is a single job on the link. The idea is to temporarily start a bypass job to backtrack, build a new logical state, and restart the online job after verifying the data is complete. Complete simultaneous switching of SQL and state.

Solution 3 : Historical state migration. The ideas of the first two methods are similar. They are recalculated based on historical data to construct a new state. However, this idea is to migrate a new state based on the historical state. Although the new state constructed by this method cannot guarantee the integrity, in some cases, the business is acceptable. At present, by transforming the State Process API, we allow the Join and Agg operators to add new columns while the SQL operators and their upstream and downstream relationships remain unchanged.

The above three methods have their own advantages, and the above three solutions can be compared horizontally from the four dimensions of universality, resource cost, online interruption, and waiting time.

Universality refers to the range of SQL changes supported under the premise of ensuring that the data is correct. The first two methods are recalculated and the state is complete, so it is more universal than solution 3.

The resource cost refers to the additional Flink or Kafka resources required to complete the SQL change. Method 1 requires the construction of the entire link and requires more Flink and Kafka resources, so the cost is the highest.

Online disconnection refers to the length of time that downstream data is delayed during the change process. Method 1 is to switch at the data service layer, and there is almost no disconnection; Method 2 The length of disconnection depends on the speed of job recovery from the state; Method 3 In addition to state restoration, the speed of state migration also needs to be considered.

The waiting time refers to the time required to complete the entire change process. The first two methods need to be recalculated, so the waiting time is longer than that of method 3.

The figure above is the platform automation process for method 2. The process is divided into seven stages. The execution time of the change process is long, which may take dozens of minutes. Users can feel the progress and status of the change through the process bar and the execution log of each stage in the diagram. We also do automatic indicator checks for users. For example, in the second stage of bypass data backtracking, we will check the backlog indicators of job consumption Kafka to determine whether the backtracking is complete, and automatically create a new logical state after completion. For another example, in the sixth stage, when the original job is started from the bypass job, the Kafka Offset indicator will be compared to compare the consumption progress of the two jobs to ensure that no less data is sent after the online job restarts.

2.3 FlinkSQL debugging tedious problems

The third problem encountered is that FlinkSQL is cumbersome to debug and has many operation steps. The business needs to create additional jobs and Kafka, and store the exported results. In addition, the input structure is complex. In order to debug a certain input scenario in a targeted manner, the business needs to write code to construct the message and write it to the data source, and even control the order in which the messages from multiple different data sources arrive. As you can see on the left side of the above figure, in order to debug FlinkSQL, you need to manually build a debugging link isolated from the line, and then write the Mock data.

solution

The solution to the above problem is: one-click file-based debugging. First, the business can edit Mock data online on the Web side. Mock data is a bounded message sequence. Its initialization can be sampled from the line first, and then modified by the business. After the business builds the mock data, it will persist the mock data of the SQL job to the S3 file object system on the right. The business clicks debugging on the web side, and the debugging task initiated on the left will be executed in a single process on a server isolated from the line. During execution, the previously uploaded mock data will be obtained from S3, and the multi-source messages specified by the mock data can be After the execution is completed, the output result will also be persisted to S3, and finally query S3 on the Web side and present it to the business.

In more cases, the business does not need to modify the Mock data, and only needs to do two steps of sampling and execution. In addition, we also support some advanced functions of debugging, such as supporting the order and interval of control messages.

The above picture is a debugging tool based on the above solution. The business will create multiple test cases for the SQL job, including the Mock data of the Source and the expected result of the Sink. After the debugging is performed, the passing of all test cases will be checked. The condition for passing is to ensure that the table after the result stream Merge is consistent with the expected table data.

2.4 SQL job exception location problem

The fourth problem is exception localization for FlinkSQL jobs. Job exception refers to the backlog of job consumption Kafka. In order to solve this problem, it is necessary to locate the cause of the backlog. When locating the cause, the attribution path is more complicated, and the threshold for investigation is relatively high. In addition, due to the lack of systematic precipitation in the attribution path, the positioning time is also relatively long. With the increasing number of SQL jobs, the workload will be huge if we rely entirely on manual troubleshooting.

solution

The solution to the above is to implement automated exception diagnosis for SQL jobs. The running metrics of SQL jobs are reported through Flink Reporter and persisted to TSDB for historical query. At the same time, the operation log of the SQL job will also be persisted. The alarm service will monitor the Kafka Offset indicator reported by the SQL job according to the rules. When the consumed Offset falls behind the produced Offset, it will determine that there is a consumption backlog in the bit job, and then issue an alarm and deliver it For abnormal events, the diagnostic service will monitor the abnormal events of the alarm service.

When an exception occurs, analyze the cause of the exception based on the job logs and job indicators within the exception time window. The diagnosis service can add rules to accumulate experience in manual troubleshooting. For example, if Restart occurs, abnormal information will be extracted from the log according to the keyword. If Restart does not occur, the bottleneck node will be found according to the back pressure indicator, and then the cause of the bottleneck will be analyzed based on GC indicators, data skew, flame graph, etc., and finally Make tuning suggestions.

The figure above shows an example of diagnosing dirty data for a business message. The running overview column in the figure will give the diagnosis of the SQL job at each time checkpoint. Green indicates that the operation is normal, and red indicates that the job is abnormal. Through this timeline, you can clearly see the time point when the abnormality occurred. The cause of the abnormality, details and suggestions can be seen in the diagnostic result column. For example, in this case, the reason is that there is dirty data in the business message. In the details, you can see the original message content that caused the job exception. In the suggestion, the business will be prompted to configure the dirty data processing strategy.

3. Future planning

In the future, the planning of Meituan's real-time data warehouse platform mainly includes the following two aspects.

First of all, it is the integrated development, operation and maintenance of stream and batch. We are about to integrate data lake storage in the real-time data warehouse platform, and open the batch operation of FlinkSQL, so as to achieve the unification of stream and batch at the storage and computing layers to improve work efficiency.
Secondly, it is the automatic tuning of jobs, which continues to improve the accuracy of job diagnosis and the efficiency of job restarts.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

New progress in the construction of Meituan's real-time data warehouse platform based on Flink

1. Status Quo of Platform Construction

2. Problems encountered and solutions

2.1 Dual-stream association large state problem

2.2 SQL change state recovery problem

2.3 FlinkSQL debugging tedious problems

2.4 SQL job exception location problem

3. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

New progress in the construction of Meituan&#39;s real-time data warehouse platform based on Flink

1. Status Quo of Platform Construction

2. Problems encountered and solutions

2.1 Dual-stream association large state problem

2.2 SQL change state recovery problem

2.3 FlinkSQL debugging tedious problems

2.4 SQL job exception location problem

3. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

New progress in the construction of Meituan's real-time data warehouse platform based on Flink