Practice and optimization of Apache Flink on JD

This article is compiled from the topic "Practice and Optimization of Apache Flink in JD.com" shared by Fu Haitao, a senior technical expert on JD.com at Flink Forward Asia 2020. The content includes:
Business evolution and scale
Containerized practice
Flink optimization and improvement
future plan

1. Business evolution and scale

1. Business Evolution

JD.com built the first-generation streaming platform based on storm in 2014, which can better meet the business requirements for real-time data processing. However, it has some limitations. For those business scenarios where the amount of data is particularly large, but is not so sensitive to delay, it seems a little powerless. So we introduced Spark streaming in 2017 and used its micro-batch processing to deal with this business scenario.

With the development of business and the expansion of business scale, we urgently need a computing engine that has both low latency and high throughput, and supports window calculation, state, and exactly-once semantics.

So in 2018, we introduced Flink and at the same time began to upgrade and transform real-time computing containerization based on K8s;
By 2019, all our real-time computing tasks are running on K8s. In the same year, we built a new SQL platform based on Flink 1.8 to facilitate business development of real-time computing applications;
By 2020, the new real-time computing platform based on Flink and K8s has been relatively complete. We have unified the computing engine and support intelligent diagnosis to reduce the cost and difficulty of user development and operation and maintenance of applications. In the past, stream processing was a focus of our attention. In the same year, we also began to support batch processing, so the entire real-time computing platform began to evolve towards the integration of batch and stream.

2. Business scenario

JD Flink serves many business lines within JD. The main application scenarios include real-time data warehouse, real-time large screen, real-time recommendation, real-time report, real-time risk control and real-time monitoring, and of course there are other application scenarios. In short, business requirements for real-time computing are generally developed with Flink.

3. Business scale

At present, our K8s cluster is composed of more than 5,000 machines, serving more than 20 first-level departments within JD.com. At present, there are more than 3000 online streaming computing tasks, and the processing peak of streaming computing reaches 500 million per second.

Second, containerized practice

Let's share the practice of containerization.

In 2017, most of JD’s internal tasks are still storm tasks, which are run on physical machines, and a small part of Spark streaming runs on Yarn. Different operating environments lead to extremely high deployment and operation and maintenance costs, and a certain waste of resource utilization. Therefore, we urgently need a unified cluster resource management and scheduling system to solve this problem.

After a series of trials, comparisons and optimizations, we chose K8s. It can not only solve some problems of deployment, operation and maintenance, and resource utilization, but also has the advantages of cloud-native elastic self-healing, complete isolation of natural containers, and easier expansion and migration. So at the beginning of 2018, we started to upgrade the containerization.

At 6.18 in 2018, only 20% of our tasks ran on K8s; by February 2019, all tasks that have achieved real-time computing ran on K8s. The containerized real-time computing platform has gone through 6.18, several major promotions on Double 11, and has managed to bear the peak pressure and operate very stably.

However, our past Flink containerization solution was based on a static method of pre-allocation of resources, which could not meet many business scenarios. Therefore, we also upgraded a containerization solution in 2020, which will be described in detail later.

containerization brings a lot of benefits, here mainly emphasizes three points:

First, it is easy to realize the mixed deployment of services, which greatly improves resource sharing capabilities and saves machine resources.
Second, natural elastic expansion, certain self-healing ability, and it can achieve a more complete resource isolation, and better guarantee the stability of the business.
Third, through containerization, a consistent environment for development, testing, and production is realized, while the ability to deploy and automate operation and maintenance is improved, and the cost of management and operation and maintenance is reduced by half.

Our past containerization solution was based on the Standalone Session cluster deployed by K8s deployment. It requires the user to estimate the resources required by the cluster in advance when creating a cluster on the platform, such as the required resource specifications and number of jobmanager and taskmanager, and then the platform sends a request to the K8s master through the K8s client to create the deployment and taskmanager of the jobmanager Deployment.

Among them, the high availability of the entire cluster is implemented based on ZK; the state storage mainly exists in HDFS, and a small part exists in OSS; monitoring indicators (container indicators, JVM indicators, task indicators) are reported to Prometheus, combined with Grafana to achieve intuitive display of indicators; logs It is collected, stored and inquired based on our internal Logbook system of JD.com.

In practice, it is found that the program has two shortcomings:

First, resources need to be allocated in advance, which cannot meet flexible and changeable business needs and cannot be allocated on demand.
Second, the Pod cannot be pulled up normally in extreme scenarios, which affects mission recovery.

So we carried out an upgrade of the containerization solution, and realized the dynamic resource allocation method based on K8s. When the cluster is created, first we will create the deployment of jobmanager according to the number of job managers specified by the user; when the user submits a task, we will dynamically apply for resources from the platform and create a taskmanager based on the number of resources required by the task.

During the running process, if it is found that the task needs to be expanded, the job manager will interact with the platform to perform dynamic expansion; and when resources are wasted, it will scale down. In this way, the problems caused by static pre-allocation can be solved well, and resource utilization can be improved.

Here, the platform interacts with K8s to create and destroy resources. mainly based on 4 considerations:

Ensure the supervision of resources by the computing platform.
Avoid the impact of platform cluster configuration & logical changes on mirroring.
The differences between different container platforms are shielded.
The platform's original K8s interaction related code reuse.

In addition, in order to be compatible with the original Slot allocation strategy (distributed by slot), when submitting the task, the resources required for the task will be estimated and applied for at a time, while waiting according to a certain strategy. When there are enough resources to meet the needs of task operation, slot allocation is performed. This is largely compatible with the original slot decentralized allocation strategy.

Three, Flink optimization and improvement

The following describes the optimization and improvement of Flink.

1. Preview the topology

In the process of using the platform for business, we found that there are several business pain points:

First, the task of tuning is cumbersome. After the platform submits and runs tasks, if you want to adjust the task parallelism, slot grouping, chaining strategy, etc., you need to re-modify the program, or tune through the command line parameter configuration, which is very cumbersome.
Second, the SQL task cannot flexibly specify the operator configuration.
Third, after the task is submitted to the cluster, how many resources are needed, and the number of slots required for the task is unclear in advance.
Fourth, the network buffer is insufficient after the parallelism is adjusted.

To solve these problems, we developed a preview of topological features:

First, topology configuration. After the user submits the task to the platform, we will preview the topology, allowing it to flexibly configure the parallelism of these operators.
Second, the slot group preview. We will clearly show the task's slot grouping situation and how many slots are needed.
Third, network buffer estimation. This can maximize the convenience for users to adjust and optimize the business on the platform.

The following briefly introduces the workflow of previewing the topology. The user submits a SQL job or Jar job on the platform. After the job is submitted, an operator configuration information will be generated and fed back to our platform. Our platform will preview the entire topology map, and then users can adjust the operator configuration information online. After adjustment, resubmit the adjusted configuration information to our platform. Moreover, this process can be adjusted continuously, and the user can submit the task after the adjustment is ok. After submitting the task, the entire online adjustment parameters will take effect.

The task here can be submitted multiple times. How to ensure the stable correspondence between the two submissions before and after the generation operator? We adopt such a strategy: if you specify uidHash or uid, we can use uidHash and uid as the key of such a correspondence. If not, we will traverse the entire topological graph and generate a certain unique ID according to the position of the operator in the topological graph in the order of breadth first. After getting the unique ID, you can get a definite relationship.

2. Back pressure quantification

Let's introduce our second improvement, back pressure quantification. At present, observe back pressure in two ways:

The first way is through the back pressure panel of Flink UI, can be very intuitive to view the current back pressure situation. But it also has some problems:
- First, the back pressure cannot be collected in some scenarios.
- Second, it is impossible to track historical back pressure.
- Third, the effect of back pressure is not intuitive.
- Fourth, there will be a certain amount of pressure in back pressure collection during large parallelism.
Another way to observe back pressure is based on Flink Task Metrics. For example, it will report inPoolUsage, outPoolUsage these indicators, and then collect it to Prometheus for a query, this method can solve the problem of back pressure history tracking. But it has some other problems:
- First, there are certain differences in the meaning of the back pressure indicators of different Flink versions.
- Second, there is a certain threshold to analyze back pressure. You need to have a deeper understanding of the entire back pressure-related indicators and conduct joint analysis.
- Third, the impact of back pressure is not so intuitive, and it is difficult to measure its impact on the business.

In response to this problem, our solution is to collect the location, time, and frequency indicators of the occurrence of back pressure, and then report it. Combining quantitative backpressure monitoring indicators with runtime topology, you can intuitively see the impact of backpressure (affecting task location, duration, and number of times).

3. The file system supports multiple configurations

The following describes the file system's support for multiple configurations.

Currently when using the file system in Flink, FileSystem.get is used to pass in the URI. FileSystem will use shceme+authority as the key to find the cached file system. If it does not exist, find the FileSystemFactory according to the scheme and call create to create the file system. After returning You can operate on the file. However, in the process of platform practice, often encountered such a problem:

First, how to write checkpoint to public HDFS and write business data to another HDFS? For example, in the unified management status of the platform, users do not pay attention to the storage of the status, but only pay attention to the scenario of their own business data reading and writing HDFS, there will be such a demand. How to meet such a business scenario?
- One solution is to merge the configurations of multiple HDFS clusters, but it will have a problem. That is, if the configuration of multiple HDFS clusters conflicts, the merger will bring certain problems.
- In addition, you can consider some federated mechanisms, such as ViewFs, but this mechanism may be a bit heavy. Are there other better solutions?
Second, how to read data from one OSS storage and write it to another OSS storage after processing?

Both of these questions involve allows Flink's same file system to support multiple configurations. Our solution is to specify and isolate different configurations by using different schemes. Take HDFS supporting multiple configurations as an example, as shown in the following figure:

The first step is to set the binding scheme (HDFS) of the custom scheme (aaHDFS) and the corresponding HDFS configuration path in the configuration.
The second step is to load the Hadoop configuration from the path corresponding to aaHDFS when calling FileSystem.get.
The third step is to use HadoopFileSystemWrapper to convert the path of the user-defined scheme (aaHDFS://) to the real hadoop path (HDFS://) when reading and writing HDFS.

We have also done many other optimizations and extensions, mainly divided into three parts.

The first block is performance , including HDFS optimization (merging small files, reducing RPC calls), load-based dynamic rebalance, slot allocation strategy expansion (sequential, random, distributed by slot), and so on.
The second block is stability , including ZK anti-shake, JM Failover optimization, the last checkpoint as a savepoint, and so on.
The third block is usability , including log enhancement (log separation, log level dynamic configuration), SQL extension (windows support incremental calculation, support offset), intelligent diagnosis and so on.

Four, future planning

Finally, there is future planning. Summarized into 4 points:

First, continue to improve the SQL platform. Continue to enhance and improve the SQL platform to encourage users to use SQL development tasks more.
Second, intelligent diagnosis and automatic adjustment. Fully automatic intelligent diagnosis, adaptive adjustment of operating parameters, job autonomy.
Third, the batch flow is integrated. Batch stream integration at the SQL level, with both low-latency stream processing and high-stability batch processing capabilities.
Fourth, AI exploration and practice. Unification of batch flow and real-time AI, exploration and practice of artificial intelligence scenarios.

Practice and optimization of Apache Flink on JD

1. Business evolution and scale

1. Business Evolution

2. Business scenario

3. Business scale

Second, containerized practice

Three, Flink optimization and improvement

1. Preview the topology

2. Back pressure quantification

3. The file system supports multiple configurations

Four, future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成