Technical Practice | How to implement a general aggregation indicator calculation framework based on Flink - 网易云信技术小站

1 Introduction

As a PaaS service, NetEase Yunxin needs to monitor online businesses in real time and sense the health conditions of the service such as "heartbeat", "pulse", and "blood pressure" in real time. Obtaining the heartbeat log of SDK, server, etc. through the collection service is a very large and disorderly data set. How can we effectively use this data? What the service monitoring platform has to do is to perform real-time analysis of massive data, aggregate the core indicators of "heartbeat", "pulse", and "blood pressure" that characterize the service, and visually display them to relevant students. The core capabilities are: real-time analysis and real-time aggregation .

In the previous "160d1b4ae0bcde NetEase Cloud Trust Service Monitoring Platform Practice ", we introduced the overall framework of the NetEase Cloud Trust service monitoring platform around four links: data collection, data processing, monitoring alarms, and data applications. 160d1b4ae0bce2 This article is a further detailed description of NetEase

Real-time aggregation based on detailed data sets to produce an aggregation indicator. Commonly used implementation methods in the industry are Spark Streaming and Flink SQL/Stream API. Regardless of the method, we need to specify the data source, data cleaning logic, aggregation dimension, aggregation window size, aggregation operator, etc. by writing code. Such complicated logic and code, whether it is development, testing, or maintenance of subsequent tasks, requires a lot of manpower/material costs. What we programmers have to do is to simplify the complex and realize the great tricks.

This article will explain how NetEase Yunxin implements a common aggregation indicator calculation framework based on Flink's Stream API.

2 Overall architecture

As shown in the figure above, it is our complete processing link based on Flink's self-developed aggregation indicators. The modules involved include:

source : periodically load the aggregation rules, and create Kafka Consumers on demand according to the aggregation rules, and continue to consume data.
process : Including grouping logic, window logic, aggregation logic, chain calculation logic, etc. As you can see from the figure, we are divided into two in the aggregation stage. What is the purpose of this? What are the benefits? who have done distributed and concurrent computing will encounter a common enemy: 160d1b4ae0be8a data skew. will be more obvious for the head customers in our PaaS service, so the tilt is very serious. The mystery of the aggregation in two stages will be explained in detail below.
sink : is the data output layer, currently output to Kafka and InfluxDB by default, the former is used to drive subsequent calculations (such as alarm notifications, etc.), and the latter is used for data display and query services.
reporter : The whole link counts the running status of each link, such as input/output QPS, calculation time, consumption accumulation, late data volume, etc.

The design and implementation ideas of these modules will be introduced in detail below.

3 source

Rule configuration

In order to facilitate the production and maintenance of aggregated indicators, we abstracted the key parameters involved in the indicator calculation process and provided a visual configuration page, as shown in the following figure. The usage of each parameter will be introduced below in combination with specific scenarios.

Rule loading

During the running of the aggregation task, we will periodically load the configuration. If it is detected that added a new , we will create a kafka-consumer thread to receive upstream real-time data streams. In the same way, for that has failed , we will close the consumer thread and clean up the related reporters.

Data consumption

For aggregated indicators with the same data source, we share a kafka-consumer. After the records are pulled and parsed, collect() is called for each aggregated indicator for data distribution. If the data filtering rule of the indicator (configuration item ⑤ ) is not empty, data filtering needs to be performed before data distribution, and data that does not meet the conditions is directly discarded.

4 process

Overall calculation process

The core code for aggregate computing based on Flink's Stream API is as follows:

SingleOutputStreamOperator<MetricContext> aggResult = src
        .assignTimestampsAndWatermarks(new MetricWatermark())
        .keyBy(new MetricKeyBy())
        .window(new MetricTimeWindow())
        .aggregate(new MetricAggFuction());

MetricWatermark(): Obtain the timestamp of the input data according to the specified time field (configuration item ⑧), and drive the watermark of the calculation flow forward.
MetricKeyBy(): Specify the aggregation dimension, similar to the groupby in MySQL, obtain the value of the aggregation dimension from the data according to the grouping field (configuration item ⑥), and concatenate the group key.
MetricTimeWindow(): The window size for aggregate calculation is specified in the configuration item ⑧. If the timing output is configured, we create a sliding window, otherwise we create a rolling window.
MetricAggFuction(): Realize the calculation of various operators specified by the configuration item ②, the implementation principle of each operator will be described in detail below.

Secondary polymerization

For large data aggregation calculations, data skew is a problem that has to be considered. Data skew means that there is a hot spot in the aggregation key specified by the grouping field (configuration item ⑥) configured in the rule. Our computing framework considered how to solve the problem of data skew at the beginning of its design, which is to split the aggregation process into 2 stages:

Phase 1: Randomly break up the data for pre-aggregation.
Phase 2: The pre-aggregation result of the first phase is used as input for the final aggregation.

Specific implementation: Determine whether the parallelism parameter (configuration item ⑦) is greater than 1, if parallelism is greater than 1, generate a random number between [0, parallelism) as a randomKey, and group by keyBy() in the first stage of aggregation The key obtained by the field (configuration item ⑥) is spliced with the randomKey to generate the final aggregate key, thus realizing the random fragmentation of the data.

Aggregation operator

As a platform-based product, we provide the following common aggregation operators. Due to the use of secondary aggregation logic, each operator adopts corresponding calculation strategies in the first and second stages.

operator	Stage 1 aggregation	Stage 2 aggregation
min/max/sum/count	Directly perform pre-aggregation calculations on input data and output pre-aggregation results	Perform a secondary aggregation calculation on the pre-aggregation results of the first stage, and output the final result
first/last	Compare the timestamp of the input data, record the minimum/maximum timestamp and the corresponding value, and output the <timestamp, value> data pair	Perform a second calculation on the <timestamp,value> data pair, and output the final first/last
avg	Calculate the sum of the group and the number of records, and output the <sum,cnt> data pair	Sum the <sum,cnt> data pairs separately, and then output: total sum / total cntcount
median/tp90/tp95	Count the distribution of input data and output NumericHistogram	Perform a merge operation on the input NumericHistogram, and finally output the median/tp90/tp95
count-distinct	RoaringArray of output record bucket information and bitmap	Perform a merge operation on RoaringArray, and finally output an accurate deduplication counting result
count-distinct (approximate)	Output base count object HyperLoglog	Perform a merge operation on HyperLoglog, and finally output the approximate deduplication count result

For operators whose calculation results are affected by all data, such as count-distinct (de-duplication counting), the conventional idea is to use the de-duplication feature of set to put all statistical data in a Set, and finally output the Set in the getResult of the aggregate function The size. If the amount of statistical data is very large, the Set object will be very large, and the time consumed by I/O operations on this Set will be unacceptable.

For MapReduce-like big data computing frameworks, performance bottlenecks often appear in the I/O of large objects in the shuffle phase, because data needs to be serialized/transmitted/deserialized, and Flink is no exception. Similar operators are median and tp95.

For this reason, these operators need to be specially optimized. The idea of optimization is to minimize the size of the data objects used in the calculation process. Among them:

median/tp90/tp95: Refer to the approximate algorithm of hive percentile_approx, which records the data distribution through NumericHistogram (a non-isometric histogram), and then obtains the corresponding tp value through interpolation (median is tp50).
Count-distinct: Using the RoaringBitmap algorithm to mark the input samples by compressing the bitmap, the final accurate deduplication counting result is obtained.
Count-distinct (approximate): Using the HyperLoglog algorithm, the deduplication counting result of approximated to This algorithm is suitable for deduplication counting of large data sets.

Post-processing

The post-processing module is to reprocess the output data of the second-stage aggregation calculation, and has two main functions:

composite index calculation : The original statistical index is combined and calculated to obtain a new combined index. For example, to count the login success rate, we can first calculate the denominator (number of logins) and numerator (number of successful logins), and then divide the numerator by the denominator to obtain a new combination index. The configuration item ③ is used to configure the calculation rules of the combined index.
relative index calculation 160d1b4ae10891: In the alarm rules, it is often necessary to judge the relative change of a certain index (year-on-year/ We can use Flink's state to easily calculate year-on-year/month-on-month indicators. Configuration item ④ is used to configure relative indicator rules.

Handling of abnormal data

The abnormal data mentioned here can be divided into two categories: late data and early data.

late data :
- For severely late data (larger than the allowedLateness of the aggregation window), it is collected through sideOutputLateData, and statistically reported through the reporter, so that visual monitoring can be performed on the monitoring page.
- For slightly late data (less than the allowedLateness of the aggregation window), a recalculation of the window will be triggered. If each piece of late data triggers a recalculation of the first stage window, and the recalculation result is transmitted to the second stage of aggregation calculation, it will lead to repeated statistics of some data. In order to solve the problem of repeated statistics, we have carried out special processing in the first stage of aggregation Trigger: window triggering uses FIRE_AND_PURGE (calculation and cleaning) to clean up the data that has already participated in the calculation in time.
The data in advance 160d1b4ae109b3: This part of the data is often caused by the inaccurate clock of the data reporting end. Human intervention is required when calculating the timestamp of these data to avoid affecting the watermark of the entire calculation flow.

5 sink

The metrics calculated by aggregation are output to Kafka and the time series database InfluxDB by default.

Kafka-sink: Use the indicator identifier (configuration item ①) as the topic of Kafka, and send the aggregation result. After the downstream receives the data stream, it can be processed further, such as the production of alarm events.
InfluxDB-sink: The indicator identifier (configuration item ①) is used as the table name of the time series database, and the aggregation results are persisted for API data query and visual report display.

6 reporter

In order to monitor the operation of various data sources and aggregated indicators in real time, we have realized the full link monitoring of aggregated computing through the combination of InfluxDB+Grafana: such as input/output QPS of each link, calculation time consumption, consumption accumulation, late data volume, etc.

7 Conclusion

At present, through this general aggregation framework, NetEase Yunxin 100+ index calculations of different dimensions are carried, and the benefits brought by it are also considerable:

Improved efficiency: The page configuration method is adopted to realize the production of aggregated indicators, and the development cycle is shortened from days to minutes. Students without data development experience can also complete the configuration of indicators by themselves.
Simple maintenance and high resource utilization: 100+ indicators only need to maintain 1 flink-job, and resource consumption is also reduced from 300+ CUs to 40 CUs.
Transparent operation process: With the help of full link monitoring, it is clear at a glance which computing link has a bottleneck and which data source has a problem.

about the author

Sheng Shaoyou, a senior development engineer for NetEase Yunxin data platform, is engaged in data platform related work, and is responsible for the design and development of service monitoring platform, data application platform, and quality service platform.

For more technical dry goods, please pay attention to [Netease Smart Enterprise Technology+] WeChat public account

Technical Practice | How to implement a general aggregation indicator calculation framework based on Flink

1 Introduction

2 Overall architecture

3 source

Rule configuration

Rule loading

Data consumption

4 process

Overall calculation process

Secondary polymerization

Aggregation operator

Post-processing

Handling of abnormal data

5 sink

6 reporter

7 Conclusion

about the author

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

物化视图详解：数据库性能优化的利器

Apache Flink 2.0.0: 实时数据处理的新纪元

flink的窗口计算方式

Flink Shuffle 技术演进之路

Hologres实时湖仓能力入门实践

MPP 架构解析：原理、核心优势与对比指南