The practice of Apache Flink in mobile cloud real-time computing

Abstract: This article organizes the speech of Xie Lei, a mobile software development engineer, at the Flink Forward Asia 2021 platform construction session. This article is mainly divided into four parts:
Construction of real-time computing platform
China Mobile Signaling Service Optimization
Stability Practice
Exploring future directions

view live replay & speech PDF

China Mobile (Suzhou) Software Technology Co., Ltd. is a wholly-owned subsidiary of China Mobile Communications Co., Ltd. The company is positioned as a builder of China Mobile's cloud facilities, a provider of cloud services, and a creator of cloud ecology. The company takes mobile cloud as its operation center, and its products and services are widely used in telecommunications, government affairs, finance, transportation and other fields.

1. Introduction of real-time computing platform

The evolution of the real-time computing engine in the mobile cloud is divided into several stages:

From 2015 to 2016, we used the first generation real-time computing engine Apache Storm;
In 2017, we began to investigate Apache Spark Streaming, which can be integrated with the self-developed framework, reducing operation and maintenance pressure and maintenance costs;
In 2018, users have more and more demand for cloud computing, and Storm and Spark have been unable to meet the business well. At the same time, we studied several well-known articles on stream computing, and found that Apache Flink already has some of the semantics mentioned in the article;
In 2019-20, we began to implement cloud services and put real-time computing platforms online to public and private clouds;
From 20 years to 2021, we began to investigate real-time data warehouses and launched LakeHouse on the mobile cloud.

At present, Flink is mainly used for the processing of China Mobile's signaling numbers, real-time user portraits and buried points, real-time data warehouses, real-time operation and maintenance monitoring, real-time recommendations, and mobile cloud data pipeline services.

The functions of China Mobile's real-time computing platform are divided into three parts.

The first part is service management, which supports the hosting of task life cycles, Flink and SQL jobs, Spark Streaming jobs, and engine multi-version support;
The second part is SQL support, which provides online Notebook writing, SQL syntax detection, UDF management and metadata management;
The third part is task operation and maintenance, which supports log retrieval of real-time tasks, collection of real-time performance indicators, message delay alarms, and task backpressure alarms.

This article mainly shares two core designs: engine multi-version design and real-time task log retrieval.

In daily task scenarios, we found that the debugging cost of user programs is relatively high, and the cycle for users to try the new version of the engine is relatively long. In addition, the function of the user's hack engine cannot be avoided, and some tasks fail to run without exception information, so we introduced the engine. Multi-version design.

The process of multi-version submission is as follows: the user's task will be submitted to the rtp service first, and the rtp service will upload the user program to HDFS for storage, and then pull it back from HDFS and submit it to the Yarn cluster when it needs to be submitted. There is a commonality for such tasks - the core package of Apache Flink is included in the job, which causes many problems.

Therefore, first of all, we will communicate with the business, so that the job package does not contain Flink's core package, but this kind of income is relatively small, so we do a test on the platform side, and actively detect the user package when the user uploads the jar package. Whether to include the core package. If the job is found to contain an illegal core package, the user is blocked from submitting it.

Such a simple operation has brought great benefits to the company:

First, it greatly reduces the positioning cost of some low-value bugs;
Second, job upgrades and rollbacks are more convenient;
Third, the stability and safety of the operation are improved.

In daily business scenarios, we need to verify the complex logic of the process through log retrieval. In addition, the UI log of the native TM cannot be opened, and it is easy to get stuck. And TM UI does not support retrieval. As shown in the figure above, when the business logic is very complex, Flink UI cannot provide the above functions. Therefore, we designed a real-time task log retrieval function.

The design of real-time task log retrieval needs to consider the following issues: How to collect job program logs and distribute TMs on different machines? How to collect logs without invading jobs? How can I limit the job to print a lot of useless logs?

For the first problem, we use the push mode to reduce the pressure of collecting logs;
For the second question, referring to the AOP mechanism in spring, we use AspectJWeaver, the entry point is the input or event of log4j, and then send the log to Sender;
For the third problem, we use RateLimiter to limit the current.

The above figure is the overall design of real-time task log retrieval. We added the AOP layer under the native TaskManager, and the log will first send the task through the TaskManager, and then send it to AOP. The entire AOP is unaware of the user because of the faceted approach. After that, it is sent to RateLimiter, and then to Sender, and RateLimiter performs the current limiting operation. Then the log continues to be sent to Kafka, and when it is retrieved, the log will be sent to Elastic Search.

With real-time task log retrieval, business programs can support log retrieval without any changes. At the same time, developers can easily verify business logic. Thanks to the current limiting measures, there is no log storage bottleneck. In addition, the pressure of platform management is also reduced.

2. China Mobile's signaling service optimization

The emergence of China Mobile's signaling service is to meet the needs of government departments at all levels for mobile user resource data, including tourism departments, emergency departments, transportation industries, etc., such as traffic planning, traffic surveys, and population flow monitoring in key areas such as tourist attractions. , monitoring and management of floating population, etc.

Relying on the high coverage rate of China Mobile's mobile phone users, using the mobile communication network regional service technology and GIS technology, through the statistics of mobile user signaling data, the urban population, mobility and other factors are analyzed and predicted to provide urban planning, transportation. Provide decision-making data support for government management behaviors such as planning, management, resource allocation, foreign population management, and policy formulation.

The average daily data of the business is about 10PB, 20 trillion per day, and the size of a single piece of data is 0.5KB, including 2345G Internet access data, location signaling, province and city, network type, interface type, etc. Data processing is also more complicated, and data encryption, compression, and version unification are required. The above figure shows the conditions and business logic when dealing with signaling numbers.

It is a reporting gateway that simplifies the requirements and responds to the cluster. It uploads the signaling data from various places, receives the data from the Flume cluster, and then transmits it to the Hadoop cluster. As you can see from the image above, there is a physical wall between Flume and Hadoop.

As the amount of data increased, we also encountered many problems:

First, the Flume cluster will always alert the Flume channel full;
Second, if the firewall exceeds the limit, an alarm will also be issued;
Third, when Flume writes Kafka, the Kafka sender will send a timeout alarm;
Fourth, when downstream processing signaling data, Spark Streaming processing is unstable.

The above problems can be summarized into two categories:

The first category is write performance issues. Kafka frequently times out when writing, and there is a bottleneck in production performance. And Flume cannot reach the upper limit speed of the network card when sending data;
The second category is architectural design issues. The architecture involves many components, which leads to high maintenance costs; in addition, the responsibilities of components are not clear, such as data cleaning logic in Flume; and Spark logic and processing logic are complex, there are multiple shuffles, and the processing performance is unstable.

The first thing to solve is the problem of PRO writing to Kafka timeout. To solve this problem, we made the following optimizations:

Optimized firewall ports;
Optimized some performance parameters of the Kafka server;
Some performance parameter tuning was done on the Kafka server side.

But this does not completely solve the problem of Flume writing to Kafka timeout, so we focus on the client. The first is how to optimize the parameters of the client, especially how to optimize batch.size, buffer.memory and request.time.out. The second is how to achieve the maximum number of network speeds for a single-machine network, that is, how many clients to set concurrently in a single-machine situation.

After practice, we found that when the batch.size is 256 megabytes and the buffer.memory is 128 megabytes, the performance will be optimal, but the maximum speed of the network card has not been reached at this time.

So we conducted a second round of testing, increased compression.type, and expected to increase the sending bandwidth by compressing the data sent, but the results did not meet our expectations.

This is due to a problem with Kafka in the lower version, each value of the parameter in its verification script is the same, so its compression ratio will be larger. But in the actual production environment, each number is different, so the compression ratio is very small.

Another question is how to reach the maximum speed of the network card? The easiest way is to increase the degree of parallelism, but the degree of parallelism is not the bigger the better. Through practice, it is found that when the concurrency is 4, the maximum speed of the network card can be reached. After exceeding 4, the average time consumption will increase significantly, which will also lead to Kafka write timeout.

The second point is the problem of Flume channel full.

When extending a service, the transaction API processing of the service is relatively low-level and needs to be processed manually. In addition, when the transaction of the service processes data, the data needs to be copied. As shown in the figure above, when the data is sent from the source to the channel, a copy of the data will be copied to the memory first, and when it is sent from the channel to the sink, it will be copied from the channel to the memory. Two copies in this process waste resources. When Flink does transactions, it relies on state management, so its processing performance is relatively stable. In addition, Flink has a wealth of sources and sinks, and it has strong scalability.

Therefore, we decided to use Flink instead of Flume to solve the problem. After replacing with Flink, the collection performance is improved, the performance bottleneck of massive data transmission is solved, and the stability is significantly improved. At the same time, the responsibilities of the components have been clarified, and we have transferred all the logic existing in the original service to the back-end real-time data decomposition, so that the acquisition layer focuses on data aggregation, and the processing layer focuses on data sorting. In addition, we unified the technology stack and adopted the Flink framework end-to-end, which achieved higher performance and reduced development and operation and maintenance costs.

The result is a 1/3 improvement in overall performance and reduced maintenance costs.

3. Stability Practice

Job stability mainly refers to service failures and solutions. Service failures mainly include job failures, job consumption delays, job OOMs, and job restarts abnormally. The corresponding solution is to physically isolate jobs, downgrade services, strengthen resource monitoring, and split services.

The platform maintainers are most concerned about the overall problem.

If a server in the ZooKeeper cluster experiences a network outage, it will also cause a large batch of task restarts. Flink JobManager will use ZooKeeper to elect the leader and manage the counter of CheckpointID discovery.

So we analyze the transition of ZooKeeper network state. When the client connects to the ZooKeeper cluster, its state is connected first, and after the network is interrupted, it will become the Suspended state, the Suspended state will be converted to the lost state, and will continue to be converted to the reconnected state. Flink relies on a curator2.0 component when using ZooKeeper. However, this component has a defect. When encountering the Suspended state, it will directly discard the leader, which will cause most jobs to restart, which is impossible for our business. accepted.

Officials did not fix this issue until Flink 1.14. In the previous version, LeaderLatch needs to be rewritten, and if Flink 1.8 version is used, ZooKeeperCheckpointIDCounter needs to be modified at the same time.

4. Exploration of future directions

In the future, we will mainly continue to explore in these two directions:

First, the direction of resource utilization. Including Elastic Scaling research and K8s Yunikorn resource queue research. We found that there is a problem of resource queues after Flink goes to the cloud, so it is necessary to manage the user's resources in queues;
Second, the direction of the data lake. The first is the unified stream-batch service gateway. Different engines may be used for real-time data warehouses, such as Flink and Spark. They belong to two different sets of services, so a unified stream-batch service gateway is required. The second is data lineage, data assets and data quality as a service.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

The practice of Apache Flink in mobile cloud real-time computing

1. Introduction of real-time computing platform

2. China Mobile's signaling service optimization

3. Stability Practice

4. Exploration of future directions

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

Dolphinscheduler IDEA本地调试

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

【Hadoop】HDFS架构解析

基于 MCP 的 AI Agent 应用开发实践

【Hadoop】HBase系统解析及适用场景

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！