Continuous optimization practice of Flink on K8s in JD.com

Abstract: This article is compiled from the speech of Fu Haitao, a senior technical expert of JD.com, at the Flink Forward Asia 2021 platform construction session. The main contents include:
basic introduction
Production Practice
Optimization and improvement
future plan

Click to view live replay & speech PDF

1. Basic introduction

K8s is a very popular container orchestration and management platform in the industry. It can manage containerized applications on multiple hosts in the cloud platform very simply and efficiently. Around 2017, our real-time computing coexisted with multiple engines, including Storm, Spark Streaming, and the new-generation computing engine Flink being introduced. Storm clusters run on physical machines, and Spark Streaming runs on YARN. Different operating environments lead to Deployment and operation costs are particularly high, and resource utilization is wasteful, so a unified cluster resource management and scheduling system is urgently needed to solve this problem.

And K8s can solve these problems very well: it can easily manage thousands of containerized applications, easy to deploy and operate; Streaming and batch computing are mixed together to achieve better resource utilization; in addition, it also has the ability of natural container isolation and native elastic self-healing, which can provide better isolation and security.

After a series of trials, optimizations and performance comparisons, we chose K8s.

At the beginning of 2018, the real-time computing platform began to be fully containerized; by June 2018, 20% of the tasks were running on K8s. From the running results, whether it is the sharing ability of resources, business processing ability, and agility and efficiency. The efficiency has been greatly improved, and the expected effect has been achieved initially; by February 2019, all real-time computing has been containerized; since then, we have been optimizing and practicing in the K8s environment, such as elastic scaling , service co-location, the practice of rapid task recovery capacity building, etc.

The benefits after all on K8s are quite obvious: firstly, the hybrid deployment service and resource sharing capabilities have been improved, saving 30% of machine resources; secondly, it has better resource isolation and elastic self-healing capabilities, which is easier to implement according to the business load. The elastic scaling of resources ensures the stability of the business; finally, a consistent environment for development, testing, and production avoids problems caused by the environment to the entire development process, and at the same time greatly improves the ability of deployment and operation automation, and reduces management, operation and maintenance. the cost of.

The platform architecture of JD Flink on K8s is shown in the figure above. The bottom is the physical machine and the cloud host, and the top is K8s. It adopts the JDOS platform developed by JD. Based on the standard K8s, many customization optimizations have been carried out to make it more suitable for our production environment. actual situation. Most of JDOS runs on physical machines, and a few run on cloud hosts. Further up is the Flink engine that is deeply customized based on the community version of Flink.

At the top is JD.com's real-time computing platform JRC, which supports SQL jobs and jar package jobs, provides high throughput, low latency, high availability, elastic self-healing and easy-to-use one-stop massive flow batch data computing capabilities, supports rich data sources and The target source has comprehensive job management, configuration, deployment, log monitoring, and self-operation and maintenance functions, and provides backup rollback and one-click migration functions.

Our real-time computing platform serves many business lines within JD.com. The main application scenarios include real-time data warehouse, real-time large screen, real-time recommendation, real-time report, real-time risk control and real-time monitoring and other application scenarios. At present, our real-time K8s cluster consists of more than 7,000 machines, the number of online Flink tasks is more than 5,000, and the peak data processing can reach more than 1 billion per second.

2. Production practice

In the beginning, the containerization solution adopted the standalone session cluster deployed based on K8s deployment, which is a mode of static resource allocation. As shown in the figure above, the user needs to decide the number and specifications of Jobmanager management nodes required when creating them. (including the number of CPU cores, the size of memory and disk, etc.), the number and specifications of the running node Taskmanager (including the size of CPU, memory and disk, etc.), and the number of slots contained in the Taskmanager. After the cluster is created, the JRC platform sends a request to the K8s master through the K8s client to create the deployment of the Jobmanager. Here, ZK is used to ensure high availability, and HDFS and OSS are used for state storage. After the cluster is created, tasks can be submitted.

However, in the process of our practice, we found that this solution has some shortcomings. It requires the business to estimate the required resources in advance, which is not very friendly to the business and cannot meet the flexible and changeable business scenarios. For example, for some complex topologies or scenarios where a cluster runs multiple tasks, it is difficult for the business to accurately determine the required resources in advance. In this case, a larger cluster is generally created first, which will lead to a certain waste of resources. In the process of task running, there is no way to dynamically scale resources on demand according to the running status of the task.

So we upgraded the containerization solution to support the elastic resource model. This is a method of on-demand resource allocation. As shown in the figure above, it requires the user to specify the number and specifications of the Jobmanager to be managed, as well as the specifications of the running node Taskmanager, while the number of Taskmanagers can be unspecified. After clicking Create Cluster, the JRC platform will make a request to the K8s master through the K8s client to create the deployment of the Jobmanager and optionally pre-create the specified number of pods of the Taskmanager.

After the platform submits the task, the JobMaster sends a rest request for resources to the JRC platform through the JDResourceManager, and then the platform dynamically applies for resources to the K8s master to create a pod running the Taskmanager. During the running process, if a Taskmanager is found to be idle for a long time, it can be Configure dynamic release of resources. Here, the creation and destruction of resources through the interaction between the platform and K8s is mainly to ensure the management and control of resources by the computing platform, while avoiding the impact of cluster configuration and logical changes on the image; by supporting the user to configure the number of Taskmanagers to pre-allocate resources, It can achieve the same fast task submission speed as the static allocation of resources; at the same time, by customizing the resource allocation strategy, it can achieve balanced scheduling compatible with the distributed distribution of the original slots.

In the environment of Flink on K8s, logs and monitoring indicators are very important. They can help us observe the operation of the entire cluster, container, and task, and quickly locate problems and deal with them in time based on logs and monitoring.

The monitoring indicators here include physical machine indicators (such as CPU, memory, load, network, connectivity, disk and other indicators), container indicators (such as CPU, memory, network and other indicators), JVM indicators and Flink indicators (cluster indicators and task indicators) ). The physical machine indicators and container indicators are collected and reported to the Origin system through the metric agent, and the JVM indicators and Flink indicators are reported to the Baize system through the customized metric reporters in Jobmanager and Taskmanager, and then unified on the computing platform for monitoring, viewing and alarming.

Log collection uses JD's Logbook service. Its basic mechanism is to run a log agent on each Node to collect logs from the specified path; then Jobmanager or Taskmanager will output logs to the specified directory according to the specified rules, and then the logs will be It is automatically collected into the Logbook system; finally, real-time logs and historical logs can be retrieved and queried through the computing platform.

Next is the performance issue of container networking. Generally speaking, virtualized things will bring a certain performance loss. As an important component of container virtualization, container network will inevitably have some performance loss compared with physical machine network. The degree of performance degradation will vary depending on the network plug-in, protocol type, and packet size.

As shown in the figure above, it is a performance evaluation for cross-host container network communication. The reference baseline is that the server and client communicate on the same host. As can be seen from the figure, the host mode achieves throughput and latency close to the reference baseline, and NAT and Calico have a large performance loss, which is due to the overhead of address translation and network packet routing; and all overlay networks have Very large performance penalty. In general, the encapsulation and decapsulation of network packets is more expensive than address translation and routing, so which network to use needs to be a trade-off. For example, the overlay network has a lot of overhead due to the encapsulation and decapsulation of network packets, and the performance will be relatively poor, but it allows more flexible and secure network management; NAT and host mode networks are easier to achieve good performance, but the security is poor. ; Routing network performance is also good but requires additional support.

In addition, network loss has a great impact on the speed of checkpointing. According to our comparison test, in the case of different network modes, running the same task in the same environment, the checkpoint time of using the container network task is more than double that of using the host network. So how to solve the performance problem of this container network?

First, you can choose the appropriate network mode according to the computer room environment: for example, for some of our old computer rooms, the performance of the container network is particularly degraded, and the network architecture cannot be upgraded. The host network (as shown in the figure above, in the pod yaml file) is used. Configure hostNetwork=true) to avoid the problem of loss. Although this is not in line with the style of K8s, it needs to be weighed according to the conditions; for the new computer room, due to the performance improvement of the basic network and the use of new high-performance network plug-ins, Compared with the host network, the performance loss is very small, so the container network is used;
The second is to try not to use a heterogeneous network environment, avoid K8s across the computer room, and properly adjust the relevant parameters of the cluster network to increase the fault tolerance of the network. For example, you can appropriately increase the two parameters akka.ask.timeout and taskmanager.network.request-backoff.max .

Let's talk about the performance of the disk. The storage space in the container consists of two parts, as shown in the above figure, the bottom layer is the read-only image layer, and the top is the read-write container layer. When the container is running, the write operations involving files are all done in the container layer. Here, a storage driver is required to provide a joint file system for management. Storage drivers are generally optimized for space efficiency, additional abstraction will bring a certain performance penalty (depending on the specific storage driver), and the writing speed is lower than the local file system, especially the storage driver that uses copy-on-write. , the loss is greater. This has a greater performance impact for write-intensive applications. In Flink, many places involve reading and writing of local disks, such as log output, RocksDB reading and writing, batch task shuffle, etc. So how to deal with it to reduce the impact?

First, you can consider using an external Volume, using a local storage volume, and directly writing data to the host fileSystem to improve performance;
In addition, you can also tune disk IO-related parameters, such as tuning RocksDB parameters to improve disk access performance;
Finally, you can also consider adopting some solutions for separating storage and computing, such as using remote shuffle to improve the performance and stability of local shuffle.

In practice, it is often found that the configuration of computing tasks in many businesses is unreasonable, which occupies too many resources and causes resource waste. In addition, traffic has peaks and troughs. How to automatically expand capacity during flood peaks and automatically shrink during troughs, reduce manual intervention and ensure business stability while improving resource utilization, all involve elastic resource scaling issues. To this end, we have developed an elastic scaling service, which dynamically adjusts the parallelism of tasks and the specifications of Taskmanager according to the running conditions of jobs to solve problems such as insufficient job throughput and waste of resources.

As shown in the figure above, the general workflow is as follows: First, the scaling configuration of tasks is performed on the JRC platform, including the upper and lower limits of running degree adjustment and some scaling policy thresholds. These configurations will be sent to the scaling service; Real-time monitoring of cluster and task operation indicators (mainly the usage rate of some CPUs and the busyness of operators, etc.), combined with scaling configuration and adjustment strategy to generate task adjustment results and send them to the JRC platform; finally, the JRC platform based on the adjustment results, the cluster and tasks are adjusted.

At present, this scaling service can better solve the resource waste problem in some scenarios and the performance problem under the condition that task throughput and operator parallelism are linearly related. However, it still has certain limitations, such as external system bottlenecks, data skew, and performance bottlenecks of the task itself, as well as scenarios that cannot be improved by expanding parallelism, which cannot be well addressed.

In addition, combined with elastic scaling, we have also tried some real-time streaming tasks and offline batch tasks in staggered peaks. As shown on the right side of the figure above, around the early morning, the stream tasks are relatively idle, and will shrink and release some resources to batch tasks; then these released resources can be used to run batch tasks at night; in the daytime, the resources released by batch tasks can be used again. It is also given to flow tasks for scaling up to cope with traffic peaks, thereby improving the overall utilization of resources.

Compared with the physical machine or YARN environment, it is relatively more difficult to troubleshoot Flink on K8s after a problem occurs, because it also involves many components of K8s, such as container network, DNS resolution, K8s scheduling and other aspects, all of which have certain problems. threshold.

In order to solve this problem, we have developed an intelligent diagnosis service, which combines the monitoring indicators of various dimensions related to the job (including physical machine, container, cluster and task indicators) with the task topology and opens up with K8s, combined with pod Logs and task logs are jointly analyzed, and some methods of daily manual operation and maintenance are summarized and applied to analysis strategies to diagnose job problems and give optimization suggestions. Currently, it supports the diagnosis of some common problems such as task restart, task back pressure, checkpoint failure, and low cluster resource utilization, and will continue to be enriched and improved in the future.

3. Optimization and improvement

In practice, when the static resource allocation mode is adopted, the slots are generally scattered according to the Taskmanager, and the resource-consuming operators are scattered according to the Taskmanager, so as to achieve balanced scheduling of jobs and improve job performance.

As shown in the upper right figure, there are 2 Taskmanagers, each Taskmanager has 4 slots, 1 job has 2 operators (represented in green and red), and each operator has 2 degrees of parallelism. In the case of using the default scheduling strategy (sequential scheduling), all operators of this job will be concentrated in one Taskmanager; if balanced scheduling is used, all operators of this job will be scattered horizontally according to the Taskmanager, and each Taskmanager will be divided into One degree of parallelism to two operators (green and red).

When using the dynamic resource allocation mode (native K8s), resources are created individually by pods, so how to achieve balanced scheduling at this time? We solve this problem by pre-allocating resources before task scheduling. The specific process is as follows: After the user submits the job, if the resource pre-allocation is enabled, the JobMaster will not schedule the task immediately, but will pre-apply the resources required by the job to the ResourceManager at one time. When the required resources are in place, the JobMaster will be notified. At this time, scheduling tasks can achieve the same balanced scheduling as in the static resource allocation mode. Here, you can also configure a timeout period for the JobMaster. After the timeout, the normal task scheduling process will be followed instead of waiting for resources indefinitely.

We compared the performance of real scenarios. As shown on the right side of the figure above, when sequential scheduling is used, the job throughput is 57 million/minute, and after resource pre-allocation and balanced scheduling are enabled, the job throughput is 89.47 million/minute, and the performance It has increased by 57%, and it still has a relatively obvious effect.

There are many businesses on our platform that use a cluster to run multiple tasks, so there will be a Taskmanager that distributes tasks of different jobs, resulting in mutual influence between different jobs. So how to solve this problem?

We have customized the slot allocation strategy. When the Jobmanager requests a slot from the ResourceManager, if the task resource isolation is enabled, the SlotManager will label the Taskmanager with the allocated slot as a job, and then the free slot of the Taskmanager can only be used for the slot of the job. ask. By grouping Taskmanagers according to jobs, resource isolation of cluster multitasking is achieved.

As shown on the right of the above figure, a Taskmanager provides 3 slots, has 3 jobs, each job has an operator, and the parallelism is 3 (represented by green, blue and red). When slot tiling is enabled, before isolation, the three tasks will share the three Taskmanagers, and each Taskmanager is distributed with a degree of parallelism of each job. After the task resource isolation is enabled, each job department will have an exclusive Taskmanager and will not affect each other.

The container environment is complex and changeable, and pods may be expelled or restarted: for example, hardware failures on machines, docker failures, and high node load will cause pods to be expelled; unhealthy processes, abnormal process exits, and abnormal docker restarts will also cause pods to restart. . In this case, the task will be restarted and restored, which will affect the business. So how can you reduce the impact on your business?

One aspect is to speed up the perception of pod exceptions (evicted or restarted) for container environments and quickly resume jobs. In the official default implementation, if a pod is abnormal, it may be sensed from two paths: one is that the downstream operator of the faulty pod may sense the disconnection of the network connection, thereby triggering an exception and triggering failover; the other is that the Jobmanager will first feel When the heartbeat of Taskmanager times out, failover will also be triggered at this time. No matter which path is taken, the time required will be longer than the timeout. Under our default system configuration, the required time is more than 60 seconds.

Here we optimize the speed of pod anomaly perception. When the pod is stopped abnormally, there will be a graceful stop time of 30 seconds by default. At this time, the container main process startup script will receive the TERM signal from K8s. In addition to the necessary cleanup actions, we have added a notification to the Jobmanager exception. The link of Taskmanager; when the taskmanager of the worker process in the container exits abnormally, the main process (here is the startup script) will also perceive it, and will also notify the Jobmanager which Taskmanager has the exception. In this way, the Jobmanager can be notified immediately when the pod is abnormal, and can perform job failure recovery in time.

Through this optimization, in a typical test scenario, when the cluster has spare resources, the task failover duration is shortened from more than 60 seconds to a few seconds; when there are no spare resources in the cluster to wait for the pod to rebuild, the task failover The duration is also shortened by more than 30 seconds, and the effect is still relatively obvious.

Another aspect is to reduce the scope of impact of pod exceptions on jobs. Although the community version after 1.9 provides a region-based local recovery strategy, when a task fails, only the task in the region associated with the faulty task is restarted, which can reduce the impact in some scenarios. However, in many cases, the operators of a job are fully connected such as rebalance or hash, and the region strategy does not play much role. To this end, in versions 1.10 and 1.12, we developed a single point of failure recovery strategy based on faulty tasks. When a task fails, only the faulty task is restored, and non-faulty tasks are not affected.

As shown in the figure above, this job has three operators source, map and sink. Where source and map are 1 degree of parallelism, sink is 2 degrees of parallelism. Map's first parallelism map(1/1) and sink's second parallelism sink(2/2) are distributed on pod_B. When pod_B is evicted, Jobmanager will detect the abnormality of pod_B, and then it will be in the new Redeploy these two tasks on the pod_D of the faulty task, denoted as map(1/1)' and sink(2/2)'; after the deployment is completed, the downstream sink(1/1) of the faulty task map(1/1) will be notified The new upstream Task map(1/1)' is ready, and then sink(1/1) will re-establish a connection with the upstream map(1/1)' to communicate.

There are a few things to keep in mind when implementing:

First, before the fault is recovered, how does the upstream of the faulty task handle the data to be sent and the downstream of the residual data received? Here, we will directly discard the upstream output to the faulty task data, and if the downstream collects incomplete data, it will also be discarded;
Second, when the upstream and downstream cannot perceive the abnormality of the other party, how to deal with it when it recovers? A forced update process may be required here;
The third is a situation where multiple tasks are distributed on a pod. If the pod is abnormal, there are multiple faulty tasks. If there is a dependency between these faulty tasks, how to deal with them correctly? This needs to be deployed in order according to the dependencies.

Through the single-point recovery strategy, the online application has achieved good results, and the impact on the job is greatly reduced (depending on the specific job, it can be reduced to a few tenths to a few hundredths of the original), avoiding business At the same time, the recovery time is greatly reduced (from more than a minute in a typical scenario to a few seconds - tens of seconds).

Of course, this strategy also has a price. It will bring a small amount of data loss during recovery, which is suitable for business scenarios that are not sensitive to a small amount of data loss, such as traffic services.

4. Future planning

In the future, we will continue to explore the following aspects:

The first is scheduling optimization:
- One is the optimization of resource scheduling at the K8s level, which can manage online services and offline jobs of big data more efficiently, and improve the utilization and operating efficiency of the K8s cluster;
- One is Flink job scheduling optimization, which supports richer and finer-grained scheduling strategies, improves the utilization and stability of Flink job resources, and meets the needs of different business scenarios.
The second is service co-location: services with different loads are co-located together to maximize resource utilization and maximize the value of servers on the premise of ensuring service stability;
Then there is intelligent operation and maintenance: it supports intelligent diagnosis of tasks, and adaptively adjusts operating parameters to achieve job qualifications and reduce the cost of user tuning and platform operation and maintenance;
Finally, the support of Flink AI: In artificial intelligence application scenarios, Flink has some unique advantages including feature engineering, online learning, resource prediction, etc. We will also explore and practice these scenarios from the platform level.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Continuous optimization practice of Flink on K8s in JD.com

1. Basic introduction

2. Production practice

3. Optimization and improvement

4. Future planning

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Continuous optimization practice of Flink on K8s in JD.com

1. Basic introduction

2. Production practice

3. Optimization and improvement

4. Future planning

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈