Kuaishou builds real-time data warehouse scenario-based practice based on Flink

This article is compiled from the topic "Kaishou builds real-time data warehouse scene based practice based on Flink" shared by Li Tianshuo, a data technology expert in Kuaishou on May 22, Beijing Station Flink Meetup. The content includes:
Kuaishou real-time calculation scenario
Kuaishou real-time data warehouse structure and safeguard measures
Kuaishou scenario problems and solutions
future plan

1. Kuaishou real-time computing scenario

The real-time computing scenarios in the fast hand business are mainly divided into four parts:

company-level core data: includes company operating market, real-time core daily report, and mobile version data. It is equivalent that the team will have the company's market indicators, and each business line, such as video-related and live broadcast-related, will have a core real-time board;
real-time indicators of large-scale events: is the real-time large screen. For example, the Spring Festival Gala of Kuaishou, we will have an overall big screen to see the overall status of the activity. A large-scale event will be divided into N different modules, and we will have different real-time data billboards for different gameplay of each module;
data operations section: operational data includes two aspects, one is the creator, and the other is content. For creators and content, on the operation side, such as the launch of a big V event, we want to see some information such as the real-time status of the live broadcast room and the traction of the live broadcast room to the market. Based on this scenario, we will make a variety of real-time large-screen multi-dimensional data, as well as some data on the market.
In addition, this piece also includes the support of operational strategies. For example, we may discover some hot content and hot creators in real time, as well as some current hot situations. We output strategies based on these hot spots, which are also some support capabilities we need to provide;
Finally, it also includes the display of C-side data. For example, there are creator centers and anchor centers in Kuaishou. There will be some closing pages such as anchor closing pages, and part of the real-time data of closing pages is also done by us.
real-time features: contains search recommendation features and real-time advertising features.

2. Kuaishou real-time data warehouse structure and safeguard measures

1. Goals and difficulties

1.1 Goal

First of all, because we are a digital warehouse, we hope that all real-time indicators have offline indicators to correspond to. The overall data difference between real-time indicators and offline indicators is required to be within 1%, which is the minimum standard.
The second is data delay. The SLA standard is that the data delay of all core report scenarios during the event cannot exceed 5 minutes. These 5 minutes include the time after the job is suspended and the recovery time. If it exceeds, it means that the SLA is not up to standard.
Finally, stability. For some scenarios, for example, after the job restarts, our curve is normal, and there will be no obvious abnormalities in the indicator output due to job restart.

1.2 Difficulties

The first difficulty is the large amount of data. The magnitude of the overall daily ingress flow data is about trillions. In events such as Spring Festival Gala, the peak QPS can reach 100 million per second.
The second difficulty is that component dependencies are more complicated. Maybe some of this link relies on Kafka, some relies on Flink, and some relies on KV storage, RPC interface, OLAP engine, etc. We need to think about how to distribute in this link, so that these components can be normal Work.
The third difficulty is the complexity of the link. Currently we have 200+ core business operations, 50+ core data sources, and more than 1,000 overall operations.

2. Real-time data warehouse-hierarchical model

Based on the above three difficulties, take a look at the data warehouse architecture:

As shown above:

There are three different data sources at the bottom layer, namely client log, server log and Binlog log;
The public base layer is divided into two different levels, one is the DWD layer, which is used for detailed data, and the other is the DWS layer, which is used for public aggregated data. DIM is the dimension we often say. We have a theme pre-layering based on offline data warehouse. This theme pre-layering may include traffic, users, equipment, video production and consumption, risk control, social networking, etc.
- The core work of the DWD layer is standardized cleaning;
- The DWS layer associates dimensional data with the DWD layer, and generates some general granularity aggregation levels after the association.
On top is the application layer, including some large-scale data, multi-dimensional analysis models and business thematic data;
The top is the scene.

The overall process can be divided into three steps:

The first step is to do business digitization, which is equivalent to connecting business data;
The second step is data assetization, which means to clean the data a lot, and then form some regular and orderly data;
The third step is the commercialization of data. It can be understood that data can feed back the business at the real-time data level and provide some empowerment for the value construction of business data.

3. Real-time data warehouse-safeguard measures

Based on the above layered model, take a look at the overall safeguard measures:

The guarantee level is divided into three different parts, namely quality guarantee, timeliness guarantee and stability guarantee.

Let's look at the quality assurance of the blue part first. Regarding quality assurance, we can see that in the data source stage, we have done out-of-order monitoring of the data source, which is based on our own SDK collection, as well as the consistency calibration of the data source and offline. The calculation process of the R&D stage has three stages, namely the R&D stage, the online stage and the service stage.
- The development stage may provide a standardized model, based on this model, there will be some benchmarks, and do offline comparison verification to ensure that the quality is consistent;
- The online stage is more about service monitoring and indicator monitoring;
- In the service stage, if there are some abnormal situations, first pull up the Flink status. If there are some scenarios that do not meet the expectations, we will do offline overall data repair.
The second is the timeliness guarantee. Regarding the data source, we also monitor the latency of the data source. There are actually two things in the development stage:
- The first is stress testing. For regular tasks, the peak traffic of the last 7 days or the last 14 days will be used to see if there is any task delay;
- After passing the stress test, there will be some tasks online and restart performance evaluation, which is equivalent to what the restart performance looks like after CP recovery.
The last one is stability guarantee, which will be done more in large-scale events, such as handover drills and graded guarantees. We will limit the current based on the previous pressure test results. The purpose is to ensure that the operation is still stable even if the limit is exceeded, and there will be no instability or CP failure. Later we will have two different standards, one is the cold standby double computer room, and the other is the hot standby double computer room.
- The cold standby double computer room is: when a single computer room hangs up, we will pull it up from the other computer room;
- Hot standby dual computer rooms: Equivalent to deploying the same logic once in each of the two computer rooms.

The above are our overall safeguard measures.

3. Kuaishou scene problems and solutions

1. PV/UV standardization

1.1 Scene

The first problem is PV/UV standardization. Here are three screenshots:

The first picture is the warm-up scene of the Spring Festival Gala, which is equivalent to a kind of gameplay. The second and third pictures are screenshots of the red envelope event and live broadcast room on the day of the Spring Festival Gala.

In the course of the activity, we found that 60~70% of the demand is to calculate the information in the page, such as:

How many people came to this page, or how many people clicked to enter this page;
How many people came to the event;
How many clicks and exposures a certain widget on the page has.

1.2 Scheme

Abstract this scenario is the following SQL:

Simply put, it is to filter conditions from a table, then aggregate according to the dimension level, and then generate some Count or Sum operations.

Based on this scenario, our initial solution is shown on the right side of the figure above.

We used the Early Fire mechanism of Flink SQL to fetch data from the Source data source, and then did DID bucketing. For example, the purple part at the beginning is divided into buckets according to this. The reason for bucketing first is to prevent hot spots in a certain DID. After bucketing, there will be something called Local Window Agg, which is equivalent to adding the same type of data after the data is bucketed. After Local Window Agg, the Global Window Agg is combined according to the dimensions. The concept of combining is equivalent to calculating the final result according to the dimensions. The Early Fire mechanism is equivalent to opening a day-level window in the Local Window Agg, and then outputting it to the outside every minute.

During this process, we encountered some problems, as shown in the lower left corner of the above picture.

There is no problem in the normal operation of the code, but if the overall data is delayed or the historical data is traced, such as Early Fire once a minute, the data volume will be relatively large when the history is traced, which may lead to the 14:00 traceability History, I directly read the data at 14:02, and the point at 14:01 was lost. What will happen after it is lost?

In this scenario, the upper curve in the figure is the result of Early Fire retrospective historical data. The abscissa is minutes, and the ordinate is the page UV as of the current moment. We found that some points are horizontal, meaning that there is no data result, and then a sharp increase, then horizontal, and then another sharp increase, and this curve is The expected result is actually the smooth curve at the bottom of the figure.

In order to solve this problem, we used the solution of Cumulate Window, which is also involved in Flink version 1.13, and the principle is the same.

The data opens a large sky-level window, and a small minute-level window is opened under the large window. The data falls into the minute-level window according to the Row Time of the data itself.

After the watermark advances the event_time of the window, it will be triggered once. In this way, the problem of backtracking can be solved. The data itself falls in the real window, and the watermark advances and triggers after the window ends.
In addition, this approach can solve the problem of disorder to a certain extent. For example, its out-of-order data itself is a non-discarded state, and the latest accumulated data will be recorded.
The last is semantic consistency, which will be based on the event time. When the disorder is not serious, the consistency with the offline calculated result is quite high.

The above is a standardized solution for PV/UV.

2. DAU calculation

2.1 Background introduction

The following describes the DAU calculation:

We have a lot of monitoring of the active equipment, new equipment and return equipment of the entire market.

Active devices refer to the devices that have been there that day;
New equipment refers to equipment that has been here that day and has not been in history;
Reflow equipment refers to equipment that has been there on the same day and has not been in N days.

However, we may need 5~8 such different topics to calculate these indicators during our calculation process.

Let's take a look at how logic should be calculated in the offline process.

First, we first count the active devices, merge these together, and then do a day-level de-duplication under a dimension, and then go to the associated dimension table. This dimension table includes the first and last time of the device, which is the first visit and the last time of the device as of yesterday. The time of the visit.

After getting this information, we can perform logical calculations, and then we will find that the newly added and reflowed devices are actually a subtag in the active device. The new equipment is to do a logical processing, and the reflow equipment is to do 30 days of logical processing. Based on this solution, can we simply write a SQL to solve this problem?

In fact, we did this at the beginning, but we encountered some problems:

The first problem is: There are 6 to 8 data sources, and the caliber of our market is often fine-tuned. If it is a single job, it must be changed during each fine-tuning process, and the stability of the single job will be very poor;
The second problem is: the amount of data is trillions, which will lead to two situations. The first is that the stability of a single job of this magnitude is very poor, and the second is the KV storage used when reassociating dimension tables in real time. Any of these RPC service interface, it is impossible to guarantee service stability in the scenario of trillions of data;
The third problem is: we have relatively high requirements for delay, requiring a delay of less than one minute. The entire link must avoid batch processing. If there is a single point of performance problem, we must ensure high performance and scalability.

2.2 Technical scheme

In response to the above problems, introduce how we did it:

As shown in the example above, the first step is to deduplicate the three data sources of ABC at the minute level according to the dimension and DID. After deduplication, three data sources with minute level deduplication are obtained, and then they are Unioned together. Then perform the same logical operation.

This is equivalent to the entry of our data sources from trillions to tens of billions. After the minute-level deduplication is performed, a day-level deduplication is performed, and the generated data sources can be changed from tens of billions to billions.

In the case of billions of levels of data, we will associate data as a service. This is a more feasible state, which is equivalent to the RPC interface of the user profile. After the RPC interface is obtained, the target topic is finally written. This target topic will be imported into the OLAP engine to provide a number of different services, including mobile version services, large-screen services, indicator kanban services, etc.

This program has three advantages, namely stability, timeliness and accuracy.

The first is stability. Loose coupling can be simply understood as when the logic of data source A and the logic of data source B need to be modified, they can be modified separately. The second is that the task can be expanded, because we split all the logic into very fine-grained. When there are traffic problems in some places, it will not affect the following parts, so it is relatively simple to expand. In addition, there are also services after The setting and status are controllable.
The second is timeliness. We achieve millisecond delay and are rich in dimensions. Overall, there are 20+ dimensions for multi-dimensional aggregation.
Finally, accuracy. We support data verification, real-time monitoring, and unified model export.

At this time we encountered another problem-disorder. For the three different jobs above, there will be a delay of at least two minutes for each job restart. The delay will cause the downstream data source Union to be out of order.

2.3 Delay calculation scheme

How do we deal with the above disordered situation?

We have three solutions in total:

The first solution is to use "did + dimension + minutes" for deduplication, and set Value to "have you been here". For example, if the same did, if one arrives at 04:01, it will output the result. Similarly, 04:02 and 04:04 will also output the results. But if it comes again at 04:01, it will be discarded, but if it comes at 04:00, the result will still be output.
This solution has some problems, because we save by minute, and the size of the state stored for 20 minutes is twice the size of the state stored for 10 minutes. The size of the latter state is a bit uncontrollable, so we changed to Solution 2.
For the second solution, our approach involves a hypothetical premise, which is to assume that there is no disorder in the data source. In this case, the key is stored as "did + dimension", and the value is "timestamp", and its update method is shown in the figure above.
04:01 A piece of data comes, and the result is output. At 04:02 there is a piece of data. If it is the same did, it will update the timestamp, and then still output the result. 04:04 is the same logic, and then the timestamp is updated to 04:04. If a piece of 04:01 data comes later, and it finds that the timestamp has been updated to 04:04, it will discard this data.
This approach greatly reduces some of the states needed by itself, but there is zero tolerance for disorder, and no disorder is allowed. Because we can't solve this problem, we came up with solution 3.
Scheme 3 is based on the timestamp of scheme 2, adding a ring buffer similar to that in which disorder is allowed in the buffer.
For example, when a piece of data comes at 04:01, the result is output; when a piece of data comes at 04:02, it will update the timestamp to 04:02 and record that the same device came at 04:01. If there is another piece of data at 04:04, it will make a displacement according to the corresponding time difference, and finally use this logic to ensure that it can tolerate a certain disorder.

Take a comprehensive look at these three options:

Solution 1 Under the condition of tolerating 16 minutes of disorder, the state size of a single job is about 480G. Although the accuracy is guaranteed in this situation, the recovery and stability of the job is completely uncontrollable, so we still gave up this plan;
Option 2 is a state size of about 30G, which is tolerant of out-of-order 0, but the data is not accurate. Because we have very high requirements for accuracy, we also abandoned this solution;
The status of Scheme 3 is compared with Scheme 1. Although its status has changed, the increase is not much, and the overall effect can be achieved with the same effect as Scheme 1. Solution 3 The time to tolerate disorder is 16 minutes. If we update a job normally, 10 minutes is enough to restart. Therefore, solution 3 was finally selected.

3. Operation Scenario

3.1 Background introduction

The operation scenario can be divided into four parts:

The first is the support of large data screens, including the analysis data of single live broadcast rooms and the analysis data of the big market. It needs to be delayed in minutes, and the update requirements are relatively high;
The second is the support of the live broadcast board. The data of the live broadcast board will be analyzed in a specific dimension, supported by a specific group of people, and require relatively high dimensional richness;
The third is the data strategy list. This list mainly predicts popular works and hot items. It requires hourly data, and the update requirements are relatively low;
The fourth is the display of real-time indicators on the C-side. The query volume is relatively large, but the query mode is relatively fixed.

Let's analyze some different scenarios generated by these 4 different states.

There is basically no difference between the first three, but in the query mode, some are specific business scenarios, and some are general business scenarios.

For the third and fourth types, it has relatively low requirements for updating, and relatively high requirements for throughput, and the curve in the process does not require consistency. The fourth query mode is more of single-entity queries, such as query content, which indicators will be available, and require relatively high QPS.

3.2 Technical solution

For the 4 different scenarios above, how did we do it?

First look at the basic detail layer (on the left in the figure). The data source has two links, one of which is the consumption stream, such as live broadcast consumption information, and watch/like/comment. After a round of basic cleaning, then do dimensional management. The upstream dimension information comes from Kafka. Kafka writes some content dimensions and puts them in KV storage, including some user dimensions.
After these dimensions are correlated, they are finally written into the DWD fact layer of Kafka. Here, in order to improve performance, we have done the operation of the second level cache.
In the upper part of the figure, we read the data of the DWD layer and then do a basic summary. The core is window dimension aggregation to generate 4 different granularities of data. They are the market multi-dimensional summary topic, the live broadcast room multi-dimensional summary topic, the author multi-dimensional summary topic, and the user multi-dimension. Aggregate topics, these are data of common dimensions.
In the lower part of the figure, based on these general dimension data, we then process the data of the individual dimension, which is the ADS layer. After obtaining these data, there will be dimensional expansion, including content expansion and operation dimension expansion, and then aggregation, such as e-commerce real-time topic, institutional service real-time topic, and big V live broadcast real-time topic.
There is an advantage to dividing into two links like this: one place deals with the general dimension, and the other place deals with the personalized dimension. The requirements for the universal dimension guarantee will be higher, and the personalized dimension will do a lot of personalized logic. If these two are coupled together, you will find that tasks often have problems, and it is not clear which task is responsible for what, and such a stable layer cannot be constructed.
In the picture on the right, we ended up using three different engines. To put it simply, the Redis query uses the C-side scenario, and the OLAP query uses the large screen and business Kanban scenario.

Four, future planning

There are three scenarios mentioned above. The first scenario is the calculation of standardized PU/UV, the second scenario is the overall solution of DAU, and the third scenario is how to solve it on the operation side. Based on these contents, we have some future plans divided into 4 parts.

The first part is to improve the real-time guarantee system:
- On the one hand, do some large-scale activities, including Spring Festival Gala and follow-up normalized activities. Regarding how to guarantee these activities, we have a set of norms to build a platform;
- The second is the formulation of graded guarantee standards. There will be a standardized description of which operations are what guarantee level/standard;
- The third is the ability to promote the solution of the engine platform, including some engines for Flink tasks. On this, we will have a platform on which to promote standardization and standardization.
The second part is the construction of real-time data warehouse content:
- On the one hand, it is the output of scenario-based solutions, for example, there will be some generalized solutions for activities, instead of developing a new set of solutions for each activity;
- On the other hand, the content data hierarchy is precipitated. For example, in the current data content construction, there are some lack of scenarios in terms of thickness, including how the content can better serve upstream scenarios.
The third part is the scenario-based construction of Flink SQL, including continuous promotion of SQL, SQL task stability and SQL task resource utilization. In the process of estimating resources, we will consider, for example, what kind of QPS scenarios, what kind of solutions SQL uses, and what conditions can be supported. Flink SQL can greatly reduce human efficiency, but in this process, we want to make business operations easier.
The fourth part is the exploration of batch-flow integration. The real-time data warehouse scenario is actually doing offline ETL calculation acceleration. We will have many hour-level tasks. For these tasks, some logic can be put into the stream processing to solve the problem. This is for the offline data warehouse SLA system. The improvement is huge.

Kuaishou builds real-time data warehouse scenario-based practice based on Flink

1. Kuaishou real-time computing scenario

2. Kuaishou real-time data warehouse structure and safeguard measures

1. Goals and difficulties

1.1 Goal

1.2 Difficulties

2. Real-time data warehouse-hierarchical model

3. Real-time data warehouse-safeguard measures

3. Kuaishou scene problems and solutions

1. PV/UV standardization

1.1 Scene

1.2 Scheme

2. DAU calculation

2.1 Background introduction

2.2 Technical scheme

2.3 Delay calculation scheme

3. Operation Scenario

3.1 Background introduction

3.2 Technical solution

Four, future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

小米基于 Apache Paimon 的流式湖仓实践