In-depth analysis of Flink fine-grained resource management

Abstract: This article is compiled from the speech delivered by Alibaba senior development engineer Guo Yangze (Tian Ling) at Flink Forward Asia 2021. The main contents include:
Fine-grained resource management and applicable scenarios
Flink resource scheduling framework
Resource configuration interface based on SlotSharinGroup
Dynamic resource cutting mechanism
Resource Request Policy
Summary and future outlook

FFA 2021 Live Replay & Presentation PDF Download

1. Fine-grained resource management and applicable scenarios

Before Flink 1.14, a coarse-grained resource management method was used. The resources required by each slot request of each operator were unknown. It was represented by a special value of UNKNOWN inside Flink. This value can be combined with any The physical slot of the resource specification to match. From the perspective of TaskManager (hereinafter referred to as TM), the number of slots it has and the resource dimension of each slot are statically determined according to the Flink configuration.

For most simple jobs, the existing coarse-grained resource management can basically meet the requirements for resource efficiency. For example, in the above job, the data is read in by Kafka after some simple processing, and finally the data is written to Redis. For this kind of job, we can easily keep the upstream and downstream concurrency consistent, and put the entire pipeline of the job into a SlotSharingGroup (hereinafter referred to as SSG). In this case, the resource requirements of the slots are basically the same, and users can directly adjust the default slot configuration to achieve high resource utilization efficiency. At the same time, because the peaks of different tasks are not necessarily the same, through the peak shaving effect, the Putting different tasks into a large slot can further reduce the overall resource overhead.

However, for some complex jobs that may be encountered in production, coarse-grained resource management cannot meet their needs well.

For example, in the job shown in the figure, there are two 128 concurrent Kafka sources and one 32 concurrent Redis dimension table, with two data processing paths up and down. One is two Kafka sources. After joining, they go through some aggregation operations, and finally sink the data into the third 16 concurrent Kafka; the other path is to join Kafka and Redis dimension tables, and the results flow into a TensorFlow-based online The inference module is finally stored in Reids.

Coarse-grained resource management in this job may lead to lower resource utilization efficiency.

First of all, the upstream and downstream concurrency of the job is inconsistent. If you want to put the entire job into a slot, you can only align it with the highest 128 concurrently. The alignment process is not a big problem for lightweight operators, but for relatively heavy resource consumption. The operator will lead to a great waste of resources. For example, the Redis dimension table on the graph caches all data in memory to improve performance, while aggregation operators require relatively large managed memory to store state. For these two operators, only 32 and 16 resources need to be applied for originally, and 128 resources need to be applied for after alignment and concurrency.

At the same time, the pipeline of the entire job may not be placed in a slot or TM due to too many resources, such as the memory of the above operators, and the Tensorflow module requires GPU to ensure computing efficiency. Since GPU is a very expensive resource, there may not be enough number on the cluster, so that the job cannot apply for enough resources due to alignment and concurrency, and eventually cannot be executed.

We can split the whole job into multiple SSGs. As shown in the figure, we divide the operator into 4 SSGs according to concurrency to ensure that the concurrency inside each SSG is aligned. However, since each slot has only one default specification, all resource dimensions of the slot still need to be aligned to the maximum value of each SSG. For example, the memory needs to be aligned with the requirements of the Redis dimension table, the managed memory needs to be aligned with the aggregation operator, or even A GPU needs to be added to the expansion resources, which still cannot solve the problem of resource waste.

In order to solve this problem, we propose fine-grained resource management. The basic idea is that the resource specifications of each slot can be individually customized, and users can apply on demand to maximize the efficiency of resource utilization.

To sum up, fine-grained resource management is to improve the overall utilization efficiency of resources by enabling each module of a job to apply for and use resources as needed. Its applicable scenarios include the following: the concurrency of upstream and downstream tasks in the job is significantly different, the resources of the pipeline are too large, or it contains relatively expensive expansion resources. In these cases, the job needs to be split into multiple SSGs, and the resource requirements of different SSGs are different. In this case, resource waste can be reduced through fine-grained resource management. In addition, for batch tasks, the job may contain one or more stages, and there are significant differences in resource consumption between different stages, and fine-grained resource management is also required to reduce resource overhead.

2. Flink resource scheduling framework

There are three main roles in Flink's resource scheduling framework, namely JobMaster (hereinafter referred to as JM), ResourceManager (hereinafter referred to as RM) and TaskManager. The task written by the user will be compiled into JobGraph first, and then submitted to JM after injecting resources. The role of JM is to manage the resource application of JobGraph and execute deployment.

The scheduling-related component in JM is Scheduler, which generates a series of SlotRequests based on JobGraph, and then aggregates these SlotRequests to generate a ResourceRequirement and send it to RM. After RM receives the resource declaration, it will first check the existing resources in the cluster. If it does not meet its needs, if possible, it will send a request to the TM to ask him to offer the slot to the corresponding JM (here the allocation of the slot is done by the SlotManager component). If the existing resources are not enough, it will apply for new resources from the external K8s or Yarn through the internal driver. Finally, after the JM receives enough slots, it will start to deploy the operator, and the job can run.

Following this framework, the technical implementation details and design choices in fine-grained resource management are analyzed and explained.

3. Resource configuration interface based on SlotSharingGroup

At the entry, Flink needs to inject resource configuration into JobGraph. This part is the resource configuration interface based on SlotSharingGroup proposed in FLIP-156. Regarding the design choice of the resource configuration interface, the main problem is the granularity of resource configuration:

The first is the smallest operator granularity operator. If the user configures resources on the operator, Filnk needs to further aggregate resources into slot levels according to chaining and slot sharing before scheduling resources.

The advantage of using this granularity is that we can decouple resource configuration from the logic of chaining and slot sharing. Users only need to consider the needs of the current operator, without considering whether it is embedded with other operators or whether it is scheduled to a slot. middle. Second, it allows Flink to more accurately calculate the resources of each slot. If the upstream and downstream operators in a certain SSG have different concurrency, the resources required by the physical slots corresponding to the SSG may also be different; and if Flink masters the resources of each operator, it has the opportunity to further optimize resource efficiency.

Of course, it also has some shortcomings. First, the cost of user configuration is too high. The complex jobs in production contain a large number of operators, and it is difficult for users to configure them one by one. Second, in this case, it is difficult to support coarse and fine-grained mixed resource configuration. If there are both coarse-grained and fine-grained operators in an SSG, Flink cannot judge how many resources it needs. Finally, since there will be a certain degree of deviation in the user's resource allocation or estimation, this deviation will continue to accumulate, and the peak-shaving and valley-filling effect of the operator cannot be effectively utilized.

The second option is to use the task formed after operator chaining as the granularity of resource configuration. In this case, we must expose Flink's internal chaining logic to users, and Flink's runtime still needs to further aggregate resources into slot levels according to the task's slot sharing configuration before scheduling resources.

Its advantages and disadvantages are roughly the same as the operator granularity, but compared with operators, it reduces the user's configuration cost to a certain extent, but this is still a pain point. At the same time, it comes at the cost of not being able to decouple resource configuration and chaining, exposing the internal logic of chaining and Flink to users, resulting in limited internal potential optimization. Because once the user configures the resources of a task, the change of the chaining logic may split the task into two or three, resulting in incompatible user configuration.

The third option is to directly use SlotSharingGroup as the granularity of resource configuration, so that for Flink, what you see is what you get for resource configuration, and the previous resource aggregation logic is omitted.

At the same time, this choice has the following advantages:

First, make the user's configuration more flexible. We leave the choice of configuration granularity to the user. You can configure operator resources, task resources, and even subgraph resources. You only need to put the subgraph in an SSG and configure its resources.
Second, it can support the coarse and fine-grained mixed configuration relatively easily. The granularity of all configurations is slot, so there is no need to worry that the same slot contains both coarse-grained and fine-grained tasks. For a coarse-grained slot, its resource size can simply be calculated according to the default specifications of TM. This feature also makes the allocation logic of fine-grained resource management compatible with coarse-grained scheduling. We can regard coarse-grained as a fine-grained one. a special case.
Third, it enables users to take advantage of the peak-shaving and valley-filling effect between different operators to effectively reduce the impact of deviations.

Of course, some restrictions will also be introduced, which couples the chaining of resource configuration and slot sharing. In addition, if there are concurrency differences among operators in an SSG, in order to maximize resource utilization efficiency, users may be required to manually ungroup.

Comprehensive consideration, we finally chose the resource configuration interface based on SlotSharingGroup in FLIP-156. In addition to the above-mentioned advantages, the most important thing is that it can be found from the resource scheduling framework that slot is actually the most basic unit in resource scheduling. From Scheduler to RM\TM, resource scheduling applications are performed in units of slots. Using this granularity avoids increasing the complexity of the system.

Going back to the example job, after supporting the fine-grained resource management configuration interface, we can configure different resources for the 4 SSGs, as shown in the figure above. As long as the scheduling framework is matched strictly according to this principle, we can maximize the efficiency of resource utilization.

4. Dynamic resource cutting mechanism

After solving the resource allocation, the next step is to apply for slots for these resources. This step requires the dynamic resource cutting mechanism proposed by FLIP-56.

Looking back at this picture briefly, the JobGraph on the far left now has resources, and if you go to the right, you will enter the resource scheduling of JM, RM, and TM. Under coarse-grained resource management, TM's slots are of fixed size and determined according to the startup configuration. In this case, RM cannot meet slot requests of different specifications. Therefore, we need to modify the way of creating slots.

Let's first look at the existing static slot application mechanism. In fact, when the TM is started, the slots are already divided and numbered. It will report these slots to the Slot Manager. When the slot request comes, the Slot Manager will decide to apply for slot1 and slot3. Finally, the slot will be released after the task on slot1 is finished running. In this case, only slot3 is occupied. We can find that although the TM has 0.75 core and 3G of free resources, if the job applies for a slot corresponding to the resource size, the TM cannot satisfy it, because the slot has been divided in advance.

Therefore, we propose a dynamic resource cutting mechanism. The slot is no longer generated and unchanged after the TM is started, but is dynamically cut from the TM according to the actual slot request. When the TM is started, we regard the resources that can be allocated to the slot as a whole resource pool. For example, there are 1core and 4G memory resources in the above picture. Now there is a fine-grained job. The Slot Manager decides to ask for a 0.25core from the TM. For a 1G slot, TM will check whether its resource pool can cut this slot, then dynamically generate a slot and allocate the corresponding resources to JM, and then apply for a 0.5core, 2G slot for this job. Slot Manager can still use the same Apply for a slot on a TM, as long as it does not exceed the idle resources. When a slot is no longer needed, we can destroy it, and the corresponding resources will return to the free resource pool.

With this mechanism, we solve the problem of how fine-grained resource requests are satisfied.

Going back to the example job, we only need 8 TMs of the same specification to schedule the job, each TM has a GPU to meet SSG4, and then mix and align the CPU-intensive SSG1 and memory-intensive SSG2 and SSG3. The overall CPU memory ratio on the TM is sufficient.

5. Resource application strategy

What is the resource application strategy? It includes two decisions when RM interacts with Resource Provider and TM, one is what resource specification TM to apply from Resource Provider and how many TMs of each specification are needed, and the other is how to place slots in each TM. Actually both decisions are made inside the Slot Manager component.

The coarse-grained resource application strategy is relatively simple, because there is only one specification of TM, and the slot specification is the same. In the allocation strategy, it is only necessary to consider whether to tile the slots to each TM as much as possible. However, strategies under fine-grained resource management need to take into account different requirements.

First, we introduced the dynamic resource cutting mechanism. The scheduling of slots can be regarded as a multi-dimensional boxing problem. It is necessary to consider how to reduce resource fragmentation and ensure the efficiency of resource scheduling. In addition, whether the slot needs to be evaluated, and the cluster may have some requirements on the resource specifications of the TM, such as not being too small, if the TM resource is too small on K8s, it will cause the startup to be too slow, and the final registration will time out, but it should not be too large. It will affect the scheduling efficiency of K8s.

Faced with the above complexity, we abstract this resource application strategy and define a ResourceAllocationStrategy. Slot Manager will tell strategy the current resource request and the existing available resources in the cluster. Strategy is responsible for making decisions and telling Slot Manager how the existing resources are. allocation, how many new TMs still need to be requested and their respective specifications, and whether there are any jobs that cannot be met.

At present, fine-grained resource management is still in beta version, and the community has built a simple default resource management strategy. Under this strategy, the TM specification is fixed and determined according to the coarse-grained configuration. If the request for a certain slot is larger than the resource configuration, it may not be allocated, which is its limitation. In terms of resource allocation, it will scan the currently idle TMs sequentially, and will cut directly as long as the slot request is satisfied. This strategy ensures that resource scheduling will not become a bottleneck even in large-scale tasks, but the cost is that resources cannot be avoided. generation of debris.

6. Summary and future prospects

Fine-grained resource management is currently only a beta version in Flink. As can be seen from the above figure, for runtime, through FLIP-56 and FLIP-156, the work of fine-grained resource management has been basically completed. From the perspective of user interface, FLIP-169 has opened up fine-grained configuration on the Datastream API. For details, please refer to the user documentation of the community.

In the future, our development direction is mainly in the following aspects:

First, customize more resource management strategies to meet different scenarios, such as session and OLAP;
Second, at present, we regard the extended resource as a TM-level resource, and each slot on the TM can see its information, and then we will further limit its scope;
Third, at present, fine-grained resource management can support mixed configuration of coarse and fine-grained, but there are some resource efficiency problems. For example, coarse-grained slot requests can be satisfied by slots of any size. In the future, we will further optimize the matching logic to better support mixed configuration;
Fourth, we will consider adapting to the Reactive Mode newly proposed by the community;
Finally, the WebUI is optimized to display the segmentation information of the slot, etc.

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

In-depth analysis of Flink fine-grained resource management

1. Fine-grained resource management and applicable scenarios

2. Flink resource scheduling framework

3. Resource configuration interface based on SlotSharingGroup

4. Dynamic resource cutting mechanism

5. Resource application strategy

6. Summary and future prospects

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

Dolphinscheduler IDEA本地调试

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 MCP 的 AI Agent 应用开发实践

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！