Research and development practice of Kuaishou real-time data warehouse security system

Abstract: This article organizes the speech of Li Tianshuo, a technical expert of the Kuaishou real-time computing data team, at the Flink Forward Asia 2021 real-time data warehouse special session. The main contents include:
Business characteristics and real-time data warehouse security pain points
Kuaishou real-time data warehouse security system architecture
Real-time guarantee practice for Spring Festival activities
future plan

Click to view live replay & speech PDF

1. Business characteristics and pain points of real-time data warehouse security

The biggest business feature of Kuaishou is the large amount of data . The daily ingress traffic is trillions. For such a large flow entry, a reasonable model design needs to be done to prevent excessive consumption of repeated reading. In addition, in the process of data source reading and standardization, the extreme squeezing performance ensures the stable execution of ingress traffic.
The second feature is the diversity of demands . The requirements of the Kuaishou business include the scene of large active screens, 2B and 2C business applications, internal core Kanban, and real-time search support. Different scenarios have different requirements for security. If the link classification is not performed, there will be a phenomenon of confusing applications with high and low priorities, which will have a great impact on the stability of the link. In addition, since the core of the Kuaishou business scenario is to be the IP of the content and creators, this requires us to build a common dimension and a common model to prevent repeated chimney construction, and quickly support the application scenario through the common model.
The third feature is that the activity scene is frequent, and the activity itself has high demands . The core demands are mainly three aspects: being able to reflect the traction ability of the company's market indicators, being able to analyze real-time participation, and adjusting the gameplay strategy after the event starts, such as quickly sensing the effect of the event through real-time monitoring of the red envelope cost. Activities generally have hundreds of indicators, but only 2-3 weeks of development time, which requires high stability.
The last feature is the core scene of Kuaishou . One is the core real-time indicators provided to executives, and the other is real-time data applications provided to the C-side, such as Kuaishou Store, Creator Center, etc. This requires extremely high data accuracy, and problems need to be sensed and dealt with at the first time.

The above elements constitute the necessity of Kuaishou real-time data warehouse construction and security scenarios.

In the initial stage of real-time data warehouse assurance, we draw on the assurance process and specifications of the offline side, and divide it into three stages according to the life cycle: R&D stage, production stage and service stage.

In the R&D phase , model design specifications, model development specifications and published checklists are constructed.
The production phase mainly builds the underlying monitoring capabilities, monitors timeliness, stability, and accuracy, and performs SLA optimization and governance improvement according to the monitoring capabilities.
The service phase clarifies the service standards and assurance levels for upstream connection, as well as the value assessment for the entire service.

However, compared to offline, the real-time learning cost is quite high. After completing the above construction, there are still several problems in each settlement:

R&D stage : The learning curve of Flink SQL is higher than that of Hive SQL, and it is easy to introduce hidden dangers in the development stage. In addition, in the real-time computing scenario, it is also unknown whether the consumption can be quickly consumed when the activity peaks. Finally, the repeated consumption of the DWD layer is also a great challenge to the resources on the real-time side. Resource issues need to be considered when selecting data sources and dependencies.
Production stage : state without a cleanup mechanism will cause the state to become larger and jobs to fail frequently. In addition, high-priority and low-priority deployments require computer room isolation, so they need to be arranged before going online and adjusted after going online. The cost will be much higher than offline.
Service phase : For a real-time task, the most unacceptable problem is that the job process fails and restarts, resulting in data duplication or curve drop. In order to avoid such problems, a standardized solution is required, and the high probability of offline can ensure data consistency after restart.

From an abstract point of view, compared with offline data warehouses, real-time data warehouses still have several difficulties in guaranteeing, which are embodied in the following aspects:

High timeliness . Compared with the offline execution time, in real-time situations, it is necessary to intervene in operation and maintenance with a delay of minutes, which requires high timeliness.
complexity . It is mainly reflected in two aspects: on the one hand, the data cannot be imported and checked, and the data logic verification is more difficult; on the other hand, most of the real-time is stateful, and the state may not be completely saved when a service problem occurs, there will be many Unreproducible bug.
Data traffic is large . The overall QPS is relatively high, and the ingress traffic level is in the 100 million level.
Problem randomness . The time point of the real-time data warehouse problem is more random, and there is no rule to follow.
Development capabilities vary . How to ensure a unified development plan for common scenarios and prevent uncontrollable problems caused by different development plans.

2. Kuaishou real-time data warehouse security system architecture

Based on the difficulty of the above guarantee, we have designed two ideas to solve the problem, which are mainly divided into two aspects:

On the one hand, it is a positive guarantee idea based on the development life cycle, ensuring that each life cycle has specifications and program guidance, and standardizing 80% of the regular requirements.
On the other hand, it is a reverse guarantee idea based on fault injection and scenario simulation. Through scenario simulation and fault injection, it is ensured that the safeguard measures are truly implemented and meet expectations.

2.1 Positive guarantee

The overall idea of positive assurance is as follows:

In the development stage , we mainly do demand research, and standardize how to develop the base layer and how to develop the application layer in the development process, which can solve 80% of the general requirements, and the remaining 20% of the individual requirements can be met through program review. Standardized solutions are precipitated in personalized needs.
In the testing phase , quality verification, offline side comparison, and stress testing resource estimation are mainly performed. In the self-test phase, the overall accuracy is ensured through offline and real-time consistency comparison, server kanban and real-time result comparison.
The go-live phase mainly focuses on preparing plans for important tasks to go live, confirming actions before going live, deployment methods during going live, and inspection mechanisms after going live.
In the service phase , monitoring and alarming mechanisms are mainly aimed at the target to ensure that the service is within the SLA standard.
The last stage is the offline stage , which mainly does resource recovery and deployment restoration.

Kuaishou's real-time data warehouse is divided into three levels:

First, the DWD layer . The logic side of the DWD layer is relatively stable and rarely personalized. The logic modification is divided into three different formats of data: client, server and Binlog data.
- The first operation is to split the scene. Since the real-time data warehouse does not have the logic of partitioning the table, the purpose of scene splitting is to generate sub-topics to prevent repeated consumption of data of large topics.
- The second operation is field standardization, which includes the standardization of latitude fields, the filtering of dirty data, and the operations of one-to-one mapping between IP and latitude and longitude.
- The third is the dimension association of processing logic. The association of general dimensions should be completed at the DWD layer as much as possible to prevent excessive downstream traffic dependence from causing excessive pressure on the dimension table. Usually, the dimension table is provided by KV storage + L2 cache.
Second, the DWS layer . There are two different processing modes: one is the DWS layer based on dimensional and minute-level window aggregation, which provides the support of the aggregation layer for downstream reusable scenarios; the other is the single-entity granularity DWS layer data, such as in the original log Aggregated data at the granularity of core users and devices can greatly reduce the associated pressure of large data volumes at the DWD layer, and can be reused more effectively. The data at the DWS layer also needs to be dimensionally expanded. Due to the large amount of data in the DWD layer, the scenarios of dimension association cannot be fully covered. Therefore, the QPS of the dimension association is too high and requires a certain delay, which needs to be completed at the DWS layer.
Third, the ADS layer . Its core is to rely on the data of DWD layer and DWS layer to perform multi-dimensional aggregation and finally output the result.

Based on the above design ideas, it is not difficult to find that the logic of stream splitting, field cleaning standardization and dimension association for DWD and DWS are all for different formats but with the same logic. The basic logic can be developed into a templated SDK, and the same SDK API methods are used for the same logic later. This has two advantages. Repeated logic does not need to copy the code again, and some optimization experience and lessons are also precipitated in the template.

For ADS layer data, we have developed many solutions based on business requirements, such as how to calculate multi-dimensional PV/UV, how to calculate the list, how to express the SQL of the indicator card, and how to produce the scenario with a drawback in the distribution class.

SQL itself is fast and efficient, and can simplify development time on a large scale, but its execution efficiency has certain disadvantages compared to API. Therefore, for the high-traffic scenarios of the basic reservoir and DWS layer, we still use the API for development and application. Layers are developed through SQL.

In most of Kuaishou's activities, the most important indicators for business are the cumulative curve of the number of participants and money received in certain dimensions, and we hope to produce a curve that calculates from 0 points to the current moment every minute. The development of such indicators covers 60 % of the active side demand. So what are the difficulties in the development process?

Deduplication of data with conventional rolling window + custom state calculation has a disadvantage: if the window is out of order, it will cause serious data loss and affect the accuracy of the data. If you want the data to be more accurate, you have to suffer a larger data delay, and if you want to have a lower delay, there may be inaccurate data. In addition, under abnormal circumstances, there may be scenarios where the data is backtracked from a certain point in time. In the backtracking scenario, increasing the throughput will result in the loss of intermediate results due to taking the maximum timestamp.

In order to solve this problem, Kuaishou developed a solution for progressive windows, which has two parameters, the day-level window and the output minute step. The overall calculation is divided into two parts. First, a day-level window is generated, the data source is read and divided according to the key, the data with the same key is divided into the same container, and then the watermark is advanced according to the event time, exceeding the corresponding The window step size will trigger the window calculation.

As shown in the figure above, the data with key=1 is assigned to the same task. After the task watermark is updated to the small window generated by the step size, the calculation results of the bitmap and pv will be merged and sent to the downstream data, according to the servertime. window, and is triggered by the watermark mechanism. In the global window, when the cylinder is combined, the results of the cylinders will be accumulated and deduplicated, and the final result will be output. In this way, if there is out-of-order and late-arriving data, the data will not be discarded, but the time node after the delay will be recorded, which better ensures the accuracy of the data, and the overall data difference is reduced from 1% to 0.5%.

On the other hand, when the watermark exceeds the step size window, the calculation is triggered, and the curve delay can be controlled to be completed within one minute, which better guarantees the timeliness. Finally, controlling the window output of the step size through the watermark can ensure that each point of the step size window is output, and the output curve guarantees the smoothness to the greatest extent.

The above figure is a specific SQL case, which is a process of dividing the tube according to the deviceID and then building the cumulate window. The window has two parts, one is the calculation parameter accumulated by day, the other is the parameter of dividing the window by watermark, and the outer layer will aggregate and calculate the indicators generated by different cylinders.

In the launch phase, the first is to ensure the timeliness, including time, operators, plan content, operation records and checkpoints.

Before the activity, the deployment task ensures that there are no computing hotspots, check whether the parameters are reasonable, observe the job status and the cluster status;
During the activity, check whether the indicator output is normal, the task status inspection, and the fault response and link switching when encountering problems;
After the event, log off the event tasks, recycle event resources, restore link deployment, and perform recovery.

The link here is to import the Kafka data source to the ODS, DWD, and DWS layers. For C-end users, it will be imported into KV storage. For analysis scenarios, it will be imported into ClickHouse, and finally the data service will be generated. We divide the tasks into 4 levels, p0 ~ p3.

The P0 task is an active large screen. The C-side application's SLA requirements are second-level delay and 0.5% error, but the overall guarantee time is relatively short. Generally, the activity cycle is about 20 days, and New Year's Eve activities are completed within 1 to 2 days. Our solution to the delay is to implement multi-computer room disaster recovery for both Kafka and OLAP engines, and to deploy hot standby dual-computer rooms for Flink.
For tasks at the P1 level, we deploy Kafka and OLAP engines in dual computer rooms. On the one hand, the dual computer room deployment can be used for disaster recovery and escape. On the other hand, the configuration of the online computer room is better, and it is rare that a machine failure causes the job to restart.
For tasks at the P2 and P3 levels, we deploy them in the offline computer room. If there are some resource vacancies, the P3 task will be stopped first, and resources will be freed up for other tasks.

The service phase is mainly divided into 4 levels:

First, SLA monitoring mainly monitors the quality, timeliness and stability of the overall output indicators.
Second, link task monitoring mainly monitors task status, data source, processing process, output results, and IO, CPU network, and information of underlying tasks.
Third, service monitoring mainly includes service availability and latency.
Finally, there is the underlying cluster monitoring, including the CPU, IO and memory network information of the underlying cluster.

The goal of accuracy includes the following three parts: offline real-time indicator consistency is used to ensure that the overall data processing logic is correct, OLAP engine and application interface consistency is used to ensure that the service processing logic is correct, indicator logic error alarm is used To ensure that the business logic is correct.

The accuracy alarm is further divided into 4 aspects, accuracy, volatility, consistency and completeness. Accuracy includes some comparisons on the active and standby link sides, and whether the dimension drill-down is accurate; volatility measures the fluctuation range of continuous indicators to prevent anomalies caused by large fluctuations; consistency and integrity ensure consistent output through enumeration and indicator measurement And there is no defect.
There are also three goals of timeliness, interface delay alarm, OLAP engine alarm and interface table Kafka delay alarm. Split to the link level, it can be analyzed from three aspects of input, processing and output of Flink tasks: the input core focuses on delay and out-of-order conditions to prevent data discarding; the processing core focuses on data volume and performance indicators of processing data; output Then pay attention to the amount of output data, whether to trigger current limiting, etc.
There are two stability goals, one is the stability of the service and OLAP engine, the batch stream delay, and the other is the recovery speed of Flink jobs. Whether the Flink job can recover quickly after failover is also a great test for the stability of the link. Stability mainly focuses on the load of job execution, the state of corresponding service dependencies, the load of the overall cluster, and the load of a single task. We alarm through the target, monitor the sub-targets of the target disassembly, and build an overall monitoring and alarm system.

2.2 Reverse Guarantee

Development and testing of normal online activities is difficult to simulate the real online environment and stress test progress. Therefore, the key point of reverse protection is to test whether the activity can withstand the flood peak under the expected traffic flow, and how to deal with the failure.

The core idea is to simulate the real scene of active flood peaks through stress testing exercises. First, the resource distribution of each job and the arrangement of the cluster where the job is located are determined through single-job stress testing. The full-link stress testing ensures that cluster resources are used at a certain level and the consumption peak is stable, not too large or too small. Secondly, for disaster recovery construction, some safeguards are proposed mainly for job failures, consumption delays, and computer room failures. Then, by means of rehearsal, ensure that these means can be used normally and can achieve the expected effect. Finally, review and improve link risks based on the expectations and goals of the exercise.

We built our own stress test link. The top is the normal link, and the bottom is the stress test link. First, read the data of the online topic as the initial data source of the pressure measurement link, and use the rate limit algorithm to control the flow. For example, if there are 4 tasks and you want to get 10,000 QPS, then the QPS generated by each task will be limited to 2,500, and the crowd package will be used to modify the corresponding user and the generated timestamp in the process of data generation, simulating the real number of users on that day. .

After reading the data source topic of the stress test and generating a new topic through job processing, how to judge whether the stress test really passed, there are three criteria:

First, ensure that the job input read latency is in the millisecond level and that the job itself has no back pressure.
Second, the CPU utilization does not exceed 60% of the overall resources, ensuring that the cluster has spare buffers.
Third, the calculation results are consistent with the crowd package, proving that the logic is correct.

After a single job stress test, we can get a lot of information to guide the follow-up work. For example, it can be proved that the activity can guarantee the SLA under the expected traffic, and it can discover the performance bottleneck of the operation, guide the optimization to achieve the corresponding standard and scene benchmark, and facilitate the resource deployment of the low-quality operation.

After completing the single job stress test, it is still impossible to judge whether all jobs are fully started. For the overall CPU, IO, and memory pressure of the Flink computer room, we can start each job according to the target pressure test value and observe the performance of the overall job and the cluster.

So how do you judge whether the full-link stress test is passed? There are also three criteria:

First, ensure job input read latency is milliseconds with no backpressure.
Second, the overall CPU utilization does not exceed 60%.
Third, the calculation results are finally consistent with the crowd package.

After the full-link stress test, it can be proved that the activity can guarantee the SLA under the expected traffic peak, ensure the resource scheduling of the job under the action of QPS, determine the resources and deployment parameters required by each job in advance, and ensure that each data source The upstream maximum flow information provides the basis for subsequent current limiting protection.

There are two ways of troubleshooting:

One is a single job failure drill, including Kafka topic job failure, Flink job failure, and Flink job CP failure.
The second is more systematic failures, such as link failures, such as single equipment room failures, how to ensure normal output, and how to avoid the avalanche effect when the active traffic exceeds expectations? How long does it take to recover if a job lag exceeds an hour?

Disaster recovery construction is divided into two parts, link failure disaster recovery and link capacity assurance.

The core of link fault recovery and disaster recovery is to solve the problem of long recovery time and service stability in a single computer room and a single job failure. Kafka itself can be used for disaster recovery in two computer rooms. The generated traffic will be written to Kafka in two computer rooms. When a single computer room fails, the traffic will be automatically switched to another computer room, and the Flink operation will not be aware of it. On the other hand, after the failure of the computer room is restored, the state of the Kafka computer room can be automatically detected and added to the traffic.

Likewise, disaster recovery strategies also apply to OLAP engines. For Flink tasks, we have deployed dual links in hot backup. The main and backup links have the same logic. When a computer room fails, the application-side OLAP engine can be directly switched to another link for use, ensuring that the application side is unaware of the failure. of.

The guarantee of link capacity is to solve two problems: if the active traffic exceeds expectations by a lot, how to guarantee the stability? If a lag is generated, how long does it take for the evaluation to catch up with the consumption delay?

According to the results of the previous full-link stress test, the maximum flow of each task entry can be obtained, and this flow value is used as the maximum flow limit value of the job. When the active flow exceeds the expected high, the data source side will trigger the read limit value. flow, the Flink job will be executed according to the maximum load of the stress test. At this time, although the job consumption is delayed, it can protect the normal operation of other jobs in the link. And after the flood peak is over, you can calculate the time required for the job to return to normal based on the lag data and ingress traffic. This is the core measure for link fault recovery and capacity assurance.

3. Real-time guarantee practice for Spring Festival activities

The Spring Festival activities have the following requirements:

High stability , massive data requires the link to remain stable as a whole or be able to recover quickly if a fault occurs.
High timeliness , under 100 million-level traffic, the large-screen indicator card is required to have a second-level delay and a curve-level delay of 1 minute.
High accuracy , in the case of complex links, the difference between offline and real-time indicators does not exceed 0.5%.
High flexibility , able to support multi-dimensional analysis application scenarios during activities.

The overall plan for the Spring Festival activities is divided into forward and reverse safeguards.

The basis of positive safeguards is the monitoring and alarm system, which is divided into two parts. On the one hand, the construction of SLA target alarm for timeliness, accuracy and stability. Another aspect is the construction of a link-based monitoring system, including link monitoring, link-dependent service availability monitoring, and cluster resource monitoring.

On the basis of the monitoring system, the positive safeguard measures are mainly to standardize the development stage, the testing stage and the launch stage. 80% of the requirements in the development phase are resolved through standardized templates, while 20% of the remaining requirements can be reviewed to address risk issues. In the testing phase, the logical accuracy is ensured by means of comparison, and in the online phase, staged deployment and task inspection are performed.

Reverse safeguards require building two fundamental capabilities. The first is the stress measurement capability, which is mainly to identify task performance bottlenecks through single-job stress testing, so as to better guide optimization; through full-link stress testing to determine whether a job can withstand the flood peak, and to provide a data foundation for disaster recovery capabilities. Disaster tolerance is mainly through multi-room deployment, current limiting, retry, and downgrade to ensure that there is a corresponding solution in the event of a fault.

Finally, through the method of fault drill, on the one hand, fault location of each component is introduced, and on the other hand, the situation of traffic peak is simulated, so as to ensure that the stress test and disaster recovery capabilities are truly implemented.

Finally, in the online stage, the pre-, middle, and post-operation steps of the guarantee activity are traceable through the timeline plan. After the activity, the project is reviewed, and problems are found and fed back to the capacity building of the guarantee system in both positive and negative directions.

The practice of the Spring Festival activities has achieved great success. In terms of timeliness, in the face of hundreds of millions of traffic peaks, the large-screen core link indicator card has a second-level delay, and the curve type delays within one minute. The amount of data processed by a single task is over a trillion level, and it is second-level during peak traffic periods. Delay. In terms of accuracy, the difference between the offline and real-time tasks of the core link is within 0.5%, and there is no data quality problem during the promotion activities. Effective use of FlinkSQL progressive window development greatly reduces the accuracy loss caused by window loss, and the data difference is reduced from 1% to 1%. 0.5%. In terms of stability, the core link relies on the establishment of dual-computer room disaster recovery, Flink cluster hot-standby dual-link deployment, second-level switching in case of problems, and the accumulation of stress testing and disaster recovery capabilities, laying the foundation for the construction of future activity assurance systems.

4. Future planning

Based on the thinking of existing methodologies and application scenarios, we have also extended our future plans.

First, safeguard capacity building . A standardized script plan is formed for stress testing and fault injection, and the plan execution is automated through platform capabilities. After the stress test, the problem can be intelligently diagnosed, and the experience of some experts in the past can be accumulated.
Second, batch flow integration . In the past activity application scenarios, batch and stream are two completely separate systems. We have practiced the integration of stream and batch in some scenarios, and are promoting the overall platform construction, improving the overall development efficiency by unifying SQL. And the off-peak use of the machine can reduce the work pressure.
Third, real-time data warehouse construction . By enriching the content level of real-time data warehouses, as well as the precipitation of development components and the means of SQLization, the improvement of development efficiency is achieved, and the goal of cost reduction and efficiency improvement is finally achieved.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

Research and development practice of Kuaishou real-time data warehouse security system

1. Business characteristics and pain points of real-time data warehouse security

2. Kuaishou real-time data warehouse security system architecture

2.1 Positive guarantee

2.2 Reverse Guarantee

3. Real-time guarantee practice for Spring Festival activities

4. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成