Evolution and practice of China Unicom&#39;s real-time computing platform

Abstract: The content of this article is compiled from the speech of Mu Chunjin, a big data technology expert from China Unicom, at the Flink Forward Asia 2021 platform construction session. The main contents include:
Real-time computing platform background
Real-time computing platform evolution and practice
Flink-based cluster governance
future plan

FFA 2021 Live Replay & Presentation PDF Download

1. Background of real-time computing platform

First, a little background on real-time computing platforms. The business system of the telecommunications industry is very complex, so it has a lot of data sources. At present, the real-time computing platform has access to more than 30 data sources, which are relatively small compared to the total data types. Even so, our data volume has reached the trillion level, with a daily data increment of 600TB, and the types and sizes of data sources we access continue to grow. Users of the platform come from companies in 31 provinces across the country and various subsidiaries of China Unicom Group. Especially during holidays, a large number of users will subscribe to the rules. If users want to obtain data, they need to subscribe on the platform. We will encapsulate the data source into standardized scenarios. At present, we have 26 standardized scenarios, supporting the subscription of more than 5,000 rules.

There are three main categories of subscription scenarios.

There are 14 user behavior scenarios, such as the location scenario, which is about how long the user stays after entering a certain area, and the terminal login, which is about which network the user is connected to, 4G or 5G network, and roaming , voice, product ordering and other scenarios;
There are 6 user usage scenarios, such as how much traffic the user uses, how much the account balance is, whether there is arrears and downtime, etc.;
There are about 10 scenarios for users to access the Internet, such as business processing, recharge and payment, and new household access to the Internet.

For real-time computing platforms, the requirements for real-time performance are very high. There is a delay of about 5~20 seconds from data generation to entering our system. After normal processing by the system, there is a delay of about 3~10 seconds. The maximum delay we allow is 5 minutes, so we must do a good job in real-time computing. monitoring of end-to-end delays.

The data that meets the requirements for user-defined scenarios must be delivered at least once. This has strict requirements and cannot be missed. Flink satisfies this requirement well; the accuracy of the data needs to reach 95%, and the platform delivers it in real time. A copy of the data will be saved on HDFS. Every day, we will extract part of the subscription and offline data, and perform data generation and data quality comparison according to the same rules. If the difference is too large, we need to find the reason and ensure the quality of the subsequent data.

This sharing is more about an in-depth practice of Flink in the telecommunications industry: how to use Flink to better support our needs.

A general platform cannot support our special scenarios. For example, taking location-based scenarios as an example, we will draw multiple electronic fences on the map. When the user enters the fence and stays in the fence for a period of time, it meets the requirements of the user's gender and age. , occupation, income and other user characteristics, such data will be issued.

Our platform can be considered as a real-time scene middle-end, which cleans and processes data in real time, encapsulates it into complex scenarios that support multiple condition combinations, intensively provides standard real-time capabilities, and at the same time is closer to the business and provides it to business parties. Easy-to-use, low-threshold access method. The business party can subscribe to our scenario only after calling the standardized interface and passing the gateway authentication and authentication. After the subscription is successful, some connection information of Kafka and the schema of the data will be returned to the subscriber. The filter conditions of the user's subscription are matched with the data stream. After the match is successful, the data will be sent in the form of Kafka. The above is the process of interaction between our real-time computing platform and downstream systems.

2. Evolution and practice of real-time computing platform

Before 2020, our platform was implemented using Kafka + Spark Streaming, and it was a third-party platform for purchasing manufacturers. We encountered many problems and bottlenecks, making it difficult to meet our daily needs. Now many companies, including us, are undergoing digital reform, and the proportion of self-developed systems is getting higher and higher. Coupled with demand-driven, self-developed, flexibly customizable, and controllable systems are imminent. So in 2020, we began to contact Flink and realized the real-time computing platform based on Flink. In the process, we also realized the charm of open source, and we will participate more in the community in the future.

We had a lot of problems with our previous platform. The first is the third-party black box platform. We use the manufacturer's third-party platform and rely too much on external systems; and under large concurrency, the load of external systems will be very high, and individual requirements cannot be flexibly customized; Kafka's load is also particularly high. Because each rule subscription corresponds to multiple topics, with the increase of rules, the number of topics and partitions will increase linearly, resulting in high latency. Each subscription scenario will correspond to multiple real-time streams, and each real-time stream will occupy memory and CPU. Too many scenarios will lead to increased resource consumption and high resource load. Furthermore, the volume of support is small, and the number of subscriptions to support scenarios is limited. For example, the number of user subscriptions increases sharply during the festivals, and emergency firefighting is often required, which cannot meet the growing demand; in addition, the monitoring granularity is not enough, it is impossible to flexibly customize monitoring, and end-to-end monitoring cannot be carried out. .

Based on the above problems, we have fully developed a real-time computing platform based on Flink, and optimized the customization according to the characteristics of each scenario to maximize the efficiency of resource use. At the same time, we use the state of Flink to reduce external dependencies, reduce the complexity of the program, and improve the performance of the program. The optimization of resources is realized through flexible customization, and resources are greatly saved under the requirement of the same volume. At the same time, in order to ensure the low delay rate of the system, we carried out end-to-end monitoring, such as increasing the data backlog, delay, and data transmission monitoring.

The architecture of the entire platform is relatively simple. It adopts the operation mode of Flink on Yarn, and only relies on HBase externally. The data is accessed by Kafka and sent by Kafka.

Flink's cluster is built independently. It has 550 servers exclusively, and is not mixed with offline computing, because it has high requirements for stability, and needs to process 1.5 trillion data per day, with a data increment of nearly 600TB.

The main reason for our in-depth customization of the scene is the large amount of data, the subscription of the same scene is very large, and the conditions of each subscription are different. When reading a piece of data from Kafka, the data must match multiple rules, and only after matching will it be sent to the topic corresponding to the rule. So no matter how many subscriptions there are, data is only read from Kafka once, which can reduce the consumption of Kafka.

The mobile phone will be connected to the base station to make a call or surf the Internet. The data of the same base station will be compressed according to a certain duration window and fixed message, such as a window of three seconds, or the message will be triggered when it reaches 1000, so that the message received downstream will be There is an order of magnitude reduction. Then there is the fence matching, the pressure on the external system is based on the size of the base station, not on the number of messages. Then, the state of Flink is fully utilized. When personnel enter and stay, the state will be stored. The RocksDB state backend reduces external dependencies and simplifies the complexity of the system. In addition, we also realize that the association of billion-level labels does not depend on external systems. After data compression, fence matching, entry residency, and tag association, we start to formally match the rules.

After the user subscribes to the scenario, the subscribed rules will be synchronized to the real-time computing platform in the form of Flink CDC, which can ensure low latency. Since the entry and retention of the crowd will be stored in the state, the amount of data in the state backend based on RocksDB is relatively large. We will analyze the state data to troubleshoot problems, such as whether the user is in the fence.

We customized the HASH algorithm to achieve the association of billion-level tags without relying on external systems.

Under large concurrency, if each user needs to associate with an external system to obtain tag information, the pressure on the external system will be very high, especially when we connect such a large amount of data, the cost of relying on external system construction is also very high. High, these tags are offline tags, and the data is relatively stable, for example, there are daily changes and monthly changes. So we use a custom hash algorithm for users, such as a mobile phone number, according to the hash algorithm, it is assigned to the task_0 instance with index 0, and then the mobile phone number in the tag file is also calculated according to the same hash algorithm. The algorithm is assigned to 0_tag numbered 0.

The task_0 instance obtains its own index number in the open method, that is, index=0, then splices out the tag file name 0.tag, and loads the file into its own memory. After the Task_0 instance receives the mobile phone number, it can obtain the label data of the mobile phone number from the local memory, which will not cause redundancy and waste of memory, improve system performance, and reduce external dependencies.

When a label is updated, the open method will also automatically load the new label and refresh it into its own memory.

The picture above shows the end-to-end latency monitoring we did. Because our business has high latency requirements, we mark the event time, such as the time of entering and leaving Kafka, where the event is the message. For the delay monitoring of operators, we calculate the delay according to the marking time and the current time. Here, we do not calculate the delay after each message arrives, but adopt the sampling method.

The backlog and disconnection are also monitored, which is judged by collecting Kafka offsets for comparison before and after. In addition, there is monitoring of data delay. Using the time of the event and the current time to calculate the delay, the data delay of the upstream system can be monitored.

The image above is a graph of end-to-end latency monitoring and backpressure monitoring. It can be seen that the end-to-end delay is normally between 2 and 6 seconds, which is also in line with our expectations, because the positioning conditions are more complicated. We also monitor the backpressure, and locate the backpressure generated by each operator by monitoring the usage rate of the operator's input channel. Locate the specific operator, and then investigate the cause to ensure the low latency of the system.

The above picture shows the offset of each topic partition in the Kafka cluster, and the location where each consumer consumes to locate its disconnection and backlog.

First, a source is formulated to obtain the topic list and consumer group list, and then these lists are sent to the downstream. The downstream operators can collect the offset value in a distributed manner, which also uses the characteristics of Flink. Finally, it is written to Clickhouse for analysis.

The daily monitoring of Flink mainly includes the following categories:

The monitoring and alarming of Flink jobs are connected to Unicom's unified alarm Tianyan platform;
The running status of the job and the abnormal time-consuming of checkpoint;
Operator delay, back pressure, flow, number of bars;
Monitoring of taskmanager CPU, memory usage, JVM GC and other indicators.

3. Cluster governance based on Flink

We also built our cluster governance platform based on Flink. The background of building this platform is that our total cluster nodes have reached more than 10,000, a single cluster has a maximum of 900 nodes, a total of more than 40 clusters, a single copy of the total data volume has reached 100 PB, and 600,000 jobs are running every day , the maximum number of files on the NameNode of a single cluster has reached 150 million.

With the rapid development of the company's business, the demand for data is becoming more and more complex, the computing power required is also increasing, the scale of the cluster is also increasing, and the data products it carries are also increasing. Big challenge:

The large number of files puts a lot of pressure on the nameNode and affects the stability of the storage system.
There are many small files, which leads to the need to scan more files when reading the same amount of data, resulting in more NameNode RPCs.
There are many empty files, more files need to be scanned, resulting in more RPCs.
The average file is relatively small, and it also reflects that the number of small files is relatively large from a macro perspective.
In production, files will continue to be generated, and the files output by the job need to be tuned.
There is a lot of cold data, and there is a lack of cleaning mechanism, which wastes storage resources.
The resource load is high, and the expansion cost is too high, and the expansion cannot be supported for too long.
The long operation time affects the delivery of the product.
Jobs consume a lot of resources and take up too many CPUs, cores, and memory.
The job has data skew, resulting in a very long execution time.

In response to these challenges, we built a Flink-based cluster governance architecture. By collecting resource queue information, parsing the NameNode metadata file Fsimage, collecting computing engine jobs and other information, and then doing HDFS portraits, job portraits, and data lineages for the cluster , redundant computing portrait, RPC portrait, and resource portrait.

Resource portrait: We will collect the situation of multiple resource queues in multiple clusters at the same time, such as its IO, metric, etc., at the minute level, and can view the resource usage trends of the entire cluster and sub-queues in real time.

Storage Profile: We conduct global and multi-dimensional analysis of multi-cluster distributed storage in a non-invasive manner. For example, where are the number of files distributed, where are small files distributed, and where are empty files distributed. For the distribution of cold data, we also made a refined portrait of the partition directory of each table in each database.

Job portraits: We collect real-time jobs from different computing engines across multiple clusters and all product lines, from multiple dimensions such as time dimension, queue dimension, and job submission source, from time-consuming and resource-consuming, data skew, high throughput, and high RPC Insights from many aspects, such as the assignments, identify problematic assignments, and filter out those jobs that need to be optimized.

Data blood relationship: By analyzing 100,000-level SQL statements in the production environment, non-invasive, global, and highly accurate data blood relationship is drawn. It also provides functions such as data table-level/account-level calling frequency, data table-level dependencies, process changes in production line processing, the impact of processing failures, and garbage table insights in any period.

In addition, we also made some portraits of user operation audit and metadata.

The above picture is a large screen of cluster management storage. In addition to some macro indicators such as the total number of files, the number of empty files, and empty folders, there are also the number of cold directories, the amount of cold data, and the proportion of small files. We also analyze the cold data, such as which data was last accessed and the amount of data in a certain month, from which we can see the time distribution of cold data; there are also files under 10 megabytes, 50 megabytes, and 100 megabytes. On which tenants are they distributed. In addition to these indicators, it is also possible to precisely locate which library, which table, and which partition contains small files.

The above figure shows the effect of cluster governance. It can be seen that the time for the resource load to reach 100% is also significantly shortened, the number of files is reduced by more than 60%, and the RPC load is also greatly reduced. There will be tens of millions of cost savings every year, which solves the problem of long-term resource shortage and reduces the number of machines that need to be expanded.

4. Future planning

At present, we do not have a complete real-time stream management platform, and the monitoring is relatively scattered. It is imperative to develop a general management and monitoring platform.

Facing the growing demand, although deep customization saves resources and increases the scale of support, its development efficiency is not ideal. For scenarios with a small amount of data, we also considered using Flink SQL to build a general platform to improve R&D efficiency.

Finally, we will continue to explore the application of Flink in the data lake.

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

Evolution and practice of China Unicom's real-time computing platform

1. Background of real-time computing platform

2. Evolution and practice of real-time computing platform

3. Cluster governance based on Flink

4. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

Dolphinscheduler IDEA本地调试

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

【Hadoop】HDFS架构解析

基于 MCP 的 AI Agent 应用开发实践

【Hadoop】HBase系统解析及适用场景

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！

Evolution and practice of China Unicom&#39;s real-time computing platform

1. Background of real-time computing platform

2. Evolution and practice of real-time computing platform

3. Cluster governance based on Flink

4. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

Dolphinscheduler IDEA本地调试

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

【Hadoop】HDFS架构解析

基于 MCP 的 AI Agent 应用开发实践

【Hadoop】HBase系统解析及适用场景

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！

Evolution and practice of China Unicom's real-time computing platform