Flink 1.12 new features of resource management

Introduction to introduces some features of Flink 1.12 resource management, including memory management, resource scheduling, and extended resource framework.
This article is organized by community volunteer Chen Zhengyu, Apache Flink Committer, Alibaba technical expert Song Xintong, Apache Flink Contributor, Alibaba senior development engineer Guo Yangze shared, mainly introduces some features of Flink 1.12 resource management. The content is mainly divided into 4 parts:
Memory management
Resource Scheduling
Extended resource framework
future plan

GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~

One, memory management

First review the changes in Flink's memory model. The left side of the figure below is the new memory model introduced by Flink 1.10 and Flink 1.11. Although there are many modules involved, 80%-90% of users only need to pay attention to the four parts of Task Heap Memory, Task Off-Heap Memory, Network Memory, and Managed Memory that are actually used for task execution.

Most of the other modules are Flink's framework memory, and normally do not need to be adjusted. Even if problems are encountered, they can be resolved through the community documentation. In addition, "how much memory does a job need to meet actual production requirements" is also a question that everyone has to face, such as the use of other indicators, whether the job affects performance due to insufficient memory, and whether there is a waste of resources.

In response to the above content, the community has provided a brand new version of Flink 1.12, about Task manager and Job
Manager's Web UI.

In the new Web UI, the configuration value and actual usage of each monitoring indicator can be directly mapped to the memory model for intuitive display. On this basis, you can more clearly understand the operation of the job, how to adjust, which configuration parameters to adjust, etc. (the community also has corresponding documents to provide support). Through the new Web UI, everyone can better understand the usage of homework, and memory management is also more convenient.

1. Local memory (Managed Memory)

Flink managed memory is actually a kind of local memory unique to Flink, which is not managed by JVM and GC, but managed by Flink itself.

The characteristics of local memory are mainly reflected in two aspects:

On the one hand, it is the budget planning , which can ensure that some operators or tasks cannot be run due to insufficient memory during the operation of the job; and it will not waste resources due to reserved too much memory and not used. . At the same time, Flink can ensure that the memory is accurately released when the task runs, ensuring that enough memory is available when the Task Manager executes a new task.
On the other hand, resource adaptability is also one of the very important characteristics of managed memory, which means that the operator's demand for memory is dynamically adjustable. With adaptability, the operator will not waste resources due to the excessive memory given to the task, nor will the entire job fail to run due to the relatively small amount of memory provided, so that the use of memory is kept within a certain reasonable range. Inside.
Of course, when the memory allocation is relatively small, the job will be subject to certain restrictions. For example, frequent disk placement is required to ensure the operation of the job, which may affect performance.

Currently, for managed memory, Flink's usage scenarios are as follows:

RocksDB state backend: In the streaming computing scenario, each Slot will use the State Operator to share the same underlying RocksDB cache;
Flink built-in operators: includes batch processing, Table SQL, DataSet API and other operators. Each operator has an independent resource budget and will not be shared with each other;
Python process: users use PyFlink and need to start the Python virtual machine process when using the Python language to define UDF.

2. Job Graph compilation phase

Flink's management of management memory is mainly divided into two stages.

2.1 Job Graph compilation phase of the job

Three issues need to be paid attention to at this stage:

The first question is: which operators or tasks in the slot will be executed at the same time. This question is related to how to plan the memory in a query job, whether there are other tasks that need to use management memory, so as to save the corresponding memory. In a streaming job, this problem is relatively simple, because we need all the operators to be executed at the same time to ensure that the upstream output data can be consumed by the downstream in time, and this data can flow through the entire job grep. . But if we are in some scenarios of batch processing, in fact, we will have two data shuffle modes,
- One is the pipeline mode. This mode is the same as the stream mode, which is the bounded stream processing method we mentioned earlier. It also requires upstream and downstream operators to run at the same time. The upstream outputs at any time and the downstream consumes at any time.
- The other is the so-called batch blocking method, which requires the upstream to output all the data, and after the end of the order, the downstream can start to read the data.
These two modes will affect which tasks can be executed simultaneously. Currently in Flink, according to the type of an edge in the job topology graph (as shown above). We have divided a defined concept called pipelined region, which is a subgraph that is all connected by pipeline edge locks. We identify this subgraph and use it to determine which tasks will be executed at the same time.
The second question is: What are the usage scenarios in the slot? We have just introduced three usage scenarios of manage memory. At this stage, for streaming jobs, Python UDF and Stateful Operator may appear. What we need to pay attention to at this stage is that it is not certain that the State Operator will use the management memory, because this is related to its state type.
- If it uses RocksDB State Operator, it needs to use manage memory;
- But if it uses Heap State Backend, it is not needed.
However, during the compilation phase of the job, the type of state is not actually known. Here is where you need to pay attention.
The third question: For batch operations, in addition to knowing what usage scenarios are there, we also need to be clear about one thing, which is the batch operator mentioned earlier. It uses management memory in an operator-exclusive way, rather than sharing it in slot units. We need to know how much memory should be allocated for different operators. This is currently set automatically by Flink's scheduled job.

2.2 Execution phase

The first step is to determine whether there is RocksDB based on the type of State Backend. As shown in the figure above, for example, a slot has three operators ABC, B and C both use Python, and C also uses Stateful Operator. In this case, if it is in the case of heap, we take the above branch, and there is only one in the entire slot in use, which is Python. There will be two ways to use it later:

One of them is RocksDB State Backend. After the judgment of the first step, in the second step, we will decide how to share the management memory of the slot according to the user's configuration.
In this Steaming example, the weight of Python we defined is 30%, and the weight of State Backend is 70%. In this case, if there is only Python, the part of Python will naturally use 100% memory (Streaming's Heap State Backend branch);
In the second case (Streaming's RocksDB State Backend branch), the two Operators of B and C share 30% of the memory for the Python UDF, and C has an exclusive 70% of the memory for the RocksDB State Backend. Finally, Flink will determine the amount of memory that each operator can actually use according to the resource configuration of the Task manager and how much manager memory there is in a slot.

There are two differences between batch processing and streaming. First, it does not need to determine the type of State Backend, which is a simplification; second, for batch operators, each operator mentioned above has its own exclusive use. In this case, we will calculate how much Shared is required for different usage scenarios based on the usage rate, and then further subdivide the proportion into each Operator.

3. Parameter configuration

Configuration parameter	Defaults	Remark
size	taskmanager.memory.managed.size	/	Absolute size
Weights	taskmanager.memory.managed.fraction	0.4	Relative size (occupies Flink) total memory ratio
taskmanager.memory.managed.consumer-weight	DATAPROC:70,PYTHON:30	Assign weights when multiple uses coexist

The chart above shows that what we need is a manager, and there are two ways to configure the memory size:

One is the absolute value configuration method,
There is also a configuration method that is a relative value of the total memory of the Task Manager.

taskmanager.memory.managed.consumer-weight is a newly added configuration item, and its data type is the type of map, which means that we actually give a key colon value, and then a comma plus the next group A data structure such as key and colon value. Here we currently support two types of consumer keys:

One is DATAPROC. DATAPROC contains not only the memory of State Backend in the stream processing state backend, but also the Batch Operator in batch processing;
The other is Python.

Two, resource scheduling

Some of the features related to resource scheduling are frequently asked in other versions or mailing lists. Here we will also introduce the corresponding ones.

1. Maximum number of slots

Flink supports a limitation of the maximum number of slots in 1.12 ( slotmanager.number-of-slots.max ). We also mentioned before that for streaming jobs, we require all operators to be executed at the same time to ensure data. Smooth operation. In this case, the concurrency of the job determines how many slots and resources our task requires to execute the job.

However, this is not the case for batch processing. Batch processing jobs can often have a large degree of concurrency, but actually do not need so many resources. Batch processing uses very few resources. After running the previous tasks, the Slot is freed up. Use for subsequent tasks. Performing tasks in this serial way can avoid excessive occupancy of YARN/K8s cluster resources. Currently this parameter is supported in yarn/mesos/native k8.

2. TaskManager fault tolerance

In our actual production, there may be problems such as program errors, network jitter, hardware failures, etc., which may cause TaskManager to fail to connect or even hang up directly. What we commonly see in the log is an error such as TaskManagerLost. In this case, the job needs to be restarted, and resources need to be re-applied and the TaskManager process needs to be restarted during the restarting process. This performance cost is very high.

For jobs with relatively high stability requirements, Flink 1.12 provides a new feature that can support a small number of redundant TaskManagers in the Flink cluster. These redundant TaskManagers can be used in the event of a single point of failure. Recover quickly, without waiting for a new resource application process.

Redundant TaskManager can be realized by configuring slotmanager.redundant-taskmanager-num The so-called redundant TaskManager here does not completely have two TaskManagers running at no load, but it means that there will be two more TaskManagers compared to the total number of resources I need.

Tasks may be relatively evenly distributed on it. While using idle TaskManagers, it can also achieve a relatively good load. Once a failure occurs, you can quickly schedule the task to the existing TaskManager that is still alive, and then proceed to a new round of resource application. Currently this parameter is supported in yarn/mesos/native k8.

3. Task tiling

The problem of task tiling mainly occurs in Flink Standalone mode or the deployment of an older version of k8s mode. In this mode, because the number of TaskManagers and the number of slots on each TaskManager are defined in advance, this will often cause uneven scheduling problems. It may be that some managers are full of tasks, and some are loosely placed. .

cluster.evenly-spread-out-slots was introduced in the 1.11 version. Such a parameter can control it to perform a relatively balanced scheduling.

Notice:

First, this parameter is only for Standalone mode , because in the yarn and k8s modes, how many task managers are actually determined according to the needs of your job, so there is a need first and then a TaskManager, not There is a task manager first, and then there is a slot scheduling requirement.
Every time you schedule a task, you can actually only see the currently registered TaskManager. Flink has no way to globally know how many TaskManagers will be registered later. This is also a question that many people are asking, why is the feature After opening it, it doesn't seem to have a good effect. This is the first thing.
The second point to note is that we can only determine how many free slots there are on each TaskManager, but we cannot determine that each operator has a different number of concurrency. Flink cannot determine whether each operator is on the TaskManager. It is a uniform distribution, because in the resource scheduling logic of flink, tasks are completely invisible at the allocation level of the entire slot.

Three, expand the resource framework

1. Background

In recent years, with the continuous development of the field of artificial intelligence, deep learning models have been applied to a variety of production needs, typical scenarios such as recommendation systems, advertisement push, and intelligent risk control. These are also scenarios that Flink has been widely used for a long time. Therefore, supporting artificial intelligence has always been one of the long-term goals of the Flink community. In response to this goal, there are already many third-party open source extensions. There are two main open source jobs by Alibaba:

One is the Flink AI Extended project, which is based on Flink's deep learning extension framework. It currently supports the integration of TensorFlow, PyTorch and other frameworks. It allows users to use TensorFlow as an operator and place it in Flink tasks.
The other is Alink , which is a general algorithm platform based on Flink, which also contains many commonly used machine learning algorithms.

The above two tasks are functionally extending Flink, but from the perspective of computing power, deep learning models or machine learning algorithms are usually the computing bottleneck of the entire task. GPU is a resource widely used in this field to accelerate training or prediction. Therefore, supporting GPU resources to accelerate computing is an indispensable function of Flink in the development process of the AI field.

2. Use extended resources

At present, the only resource dimensions that Flink supports for user configuration are CPU and memory. In actual use, not only GPU, we will also encounter other resource requirements, such as network acceleration devices such as SSD or RDMA. Therefore, we hope to provide a universal extended resource framework. Any extended resource can be added to this framework in the form of a plug-in. GPU is just one of the extended resources.

For the use of extended resources, two general requirements can be abstracted:

Need to support the configuration and scheduling of this type of extended resource. The user can specify the requirements for such extended resources in the configuration. For example, a GPU card is required on each TaskManager, and when Flink is deployed on a resource base such as Kubernetes/Yarn, the user’s requirements for extended resources need to be adjusted. Forward to ensure that there are corresponding extended resources in the applied Container/Pod.
Need to provide the operator with extended resource information at runtime. The user may need some runtime information in the custom operator to use the extended resources. Taking the GPU as an example, the operator needs to know which GPU card its internal model can be deployed on, so it needs to provide this information to the operator .

3. How to use the extended resource framework

Using the resource framework, we can be divided into the following 3 steps:

First, set the relevant configuration for the extended resource;
Then prepare the plug-ins in the extended resource framework for the required extended resources;
Finally, in the operator, obtain the information of the extended resources from the RuntimeContext and use these resources

3.1 Configuration parameters

# 定义扩展资源名称，“gpu”
external-resources: gpu
# 定义每个 TaskManager 所需的 GPU 数量
external-resource.gpu.amount: 1 
# 定义Yarn或Kubernetes中扩展资源的配置键
external-resource.gpu.yarn.config-key: yarn.io/gpu
external-resource.gpu.kubernetes.config-key: nvidia.com/gpu
# 定义插件 GPUDriver 的工厂类。
external-resource.gpu.driver-factory.class: 
org.apache.flink.externalresource.gpu.GPUDriverFactory

The above is a configuration example that uses GPU resources:

For any extended resource, the user first needs to add its name to "external-resources". This name will also be used as a prefix for other related configurations of the extended resource. In the example, we define a resource named "gpu".
At the scheduling layer, users are currently supported to configure extended resource requirements at the granularity of the TaskManager. In the example, we define the number of GPU devices on each TaskManager as 1.
When deploying Flink on Kubernetes or Yarn, we need to configure the configuration key of the extended resource on the corresponding resource base so that Flink can forward the resource demand. The configuration corresponding to the GPU is shown in the example.
If a plug-in is provided, the factory class name of the plug-in needs to be put into the configuration.

3.2 Preparation

Before actually using the extended resources, some preparatory work needs to be done. Take the GPU as an example:

In Standalone mode, the cluster administrator needs to ensure that GPU resources are visible to the TaskManager process.
In Kubernetes mode, the cluster needs to support Device Plugin[6], the corresponding Kubernetes version is 1.10, and the GPU corresponding plug-in is installed in the cluster.
In Yarn mode, GPU scheduling requires that the cluster Hadoop version is 2.10 or above 3.1, and the resource-types.xml and other files are correctly configured.

3.3 Extended Resource Framework Plugin

After completing the scheduling of the extended resource, the user-defined operator may also need the information of the extended resource at runtime to use it. The plug-in in the extended resource framework is responsible for completing the acquisition of this information, and its interface is as follows:

public interface ExternalResourceDriverFactory {
  /**
  * 根据提供的设置创建扩展资源的Driver
  */
  ExternalResourceDriver createExternalResourceDriver(Configuration config) throws Exception;
}

public interface ExternalResourceDriver {
  /**
  * 获取所需数量的扩展资源信息
  */
  Set<? extends ExternalResourceInfo> retrieveResourceInfo(long amount) throws Exception;
}

The ExternalResourceDriver will be started on each TaskManager, and the extended resource framework will call the retrieveResourceInfo interface of each Driver to obtain the extended resource information on the TaskManager, and pass the obtained information to the RuntimeContext of the operator. ExternalResourceDriverFactory is the factory class of the plug-in.

4. GPU plug-in

Flink currently has a built-in plug-in for GPU resources. It executes a script called Discovery Script to obtain GPU information available in the current environment. The current information includes the Index of the GPU device.

Flink provides a default script located in the "plugins/external-resource-gpu/" directory of the project. Users can also implement a custom Discovery Script and specify the use of custom scripts through configuration. The agreement between this script and the GPU plug-in is:

When calling the script, the number of GPUs required will be entered as the first parameter, and then the user-defined parameter list.
If the script executes normally, output a list of GPU Index, separated by commas.
If an error occurs in the script or the execution result is not as expected, the script exits with a non-zero value, which will cause the TaskManager initialization to fail, and the error message of the script will be printed in the log.

The default script provided by Flink uses the "nvidia-smi" tool to obtain the number and index of GPUs available in the current machine, and returns the corresponding number of GPU Index lists according to the number of GPUs required. When the required number of GPUs cannot be obtained, the script will exit with a non-zero value.

The resources of GPU devices are divided into two dimensions, stream processors and video memory, and their video memory resources only support exclusive use. Therefore, when multiple TaskManagers are running on the same machine, if a GPU is used by multiple processes, it may cause its video memory OOM. Therefore, in Standalone mode, a TaskManager-level resource isolation mechanism is required.

The default script provides Coordination Mode to support GPU resource isolation between multiple TaskManager processes in a single machine. This mode uses file locks to synchronize GPU usage information among multiple processes and coordinate the use of GPU resources by multiple TaskManager processes on the same machine.

5. Get extended resource information in the operator

In the user-defined operator, the resource name defined in "external-resources" can be used to call the getExternalResourceInfos interface of RuntimeContext to obtain the corresponding extended resource information. Taking GPU as an example, each ExternalResourceInfo obtained represents a GPU card, and a field named "index" in it represents the device Index of the GPU card.

public class ExternalResourceMapFunction extends RichMapFunction<String, String> {
  private static finalRESOURCE_NAME="gpu";
  @Override
  public String map(String value) {
    Set<ExternalResourceInfo> gpuInfos = getRuntimeContext().getExternalResourceInfos(RESOURCE_NAME);
    List<String> indexes = gpuInfos.stream()
          .map(gpuInfo -> gpuInfo.getProperty("index").get()).collect(Collectors.toList());
    // Map function with GPU// ...    
  }
}

6. MNIST Demo

The following figure uses the recognition task of the MNIST data set to demonstrate the use of GPU to accelerate Flink jobs.

As shown in the figure above, MNIST is a handwritten digital picture data set, and each picture can be represented as a 28*28 matrix. In this task, we use a pre-trained DNN model. The image input passes through a layer of fully connected network to obtain a 10-dimensional vector. The subscript of the largest element of the vector is the recognition result.

We start a Standalone cluster with two TaskManager processes on an ECS with two GPU cards. With the Coordination Mode function provided by the default script, we can ensure that each TaskManager uses one of the GPU cards.

The core operator of this task is the image recognition function MNISTClassifier, the core implementation is as follows

class MNISTClassifier extends RichMapFunction<List<Float>, Integer> {

  @Override
  public void open(Configuration parameters) {
    //获取GPU信息并且选择第一块GPU
    Set<ExternalResourceInfo> externalResourceInfos =   getRuntimeContext().getExternalResourceInfos(resourceName);
    final Optional<String> firstIndexOptional = externalResourceInfos.iterator().next().getProperty("index");
    // 使用第一块GPU的index初始化JCUDA组件
    JCuda.cudaSetDevice(Integer.parseInt(firstIndexOptional.get()));
    JCublas.cublasInit();
  }
}

In the Open method, get the GPU available for the current TaskManager from the RuntimeContext, and select the first block to initialize the JCuda and JCublas libraries.

class MNISTClassifier extends RichMapFunction<List<Float>, Integer> {
    @Override
    public Integer map(List<Float> value) {
        // 使用Jucblas做矩阵算法
        JCublas.cublasSgemv('n', DIMENSIONS.f1, DIMENSIONS.f0, 1.0f,
                matrixPointer, DIMENSIONS.f1, inputPointer, 1, 0.0f, outputPointer, 1);

        // 获得乘法结果并得出该图所表示的数字
        JCublas.cublasGetVector(DIMENSIONS.f1, Sizeof.FLOAT, outputPointer, 1, Pointer.to(output), 1);

        JCublas.cublasFree(inputPointer);
        JCublas.cublasFree(outputPointer);

        int result = 0;
        for (int i = 0; i < DIMENSIONS.f1; ++i) {
            result = output[i] > output[result] ? i : result;
        }
        return result;
    }
}

In the Map method, the pre-trained model parameters and input matrix are put into GPU video memory, JCublas is used to perform matrix multiplication in GPU, and finally the result vector is taken out of GPU video memory and the recognition result number is obtained.

For the specific case demonstration process, you can go to watch the video or refer to the link on Github to try it out.

Four, future plans

In addition to the released features described above, the Apache Flink community is also actively preparing more resource management optimization features, and will meet with you in future versions.

Passive resource scheduling mode: Managed memory allows Flink tasks to flexibly adapt to different TaskManager/Slot resources, make full use of available resources, and provide computing tasks with the best computing power under given resource constraints. However, users still need to specify the degree of parallelism of computing tasks, and Flink needs to apply for TaskManager/Slot that meets the number of parallel degrees to execute smoothly. Passive resource scheduling will enable Flink to dynamically change the degree of parallelism based on available resources, perform best effort data processing when resources are insufficient, and restore the specified parallelism when resources are sufficient to ensure processing performance.
Fine-grained resource management: Flink is currently based on Slot's resource management and scheduling mechanism, and believes that all Slots have the same specifications. For some complex large-scale production tasks, it is often necessary to split the computing task into multiple subgraphs, and each subgraph is executed by a Slot. When the resource requirements between subgraphs differ greatly, it is often difficult to meet resource efficiency requirements using Slots of the same specification, especially for expensive expansion resources such as GPUs. Fine-grained resource management allows users to specify resource requirements for the subgraph of the job. Flink will use different specifications of TaskManager/Slot to perform computing tasks according to resource requirements, thereby optimizing resource efficiency.

Five, summary

Through the introduction of the article, I believe that everyone has a clearer understanding of Flink memory management.

First, answer the memory management and memory allocation details of each process from the local memory, Job Graph compilation stage, and execution stage, and control the memory allocation of TaskManager through new parameter configuration;
Then, we usually encounter resource scheduling-related problems, including the use of the maximum number of Slots, how to perform TaskManager fault tolerance, and how to allocate task resources evenly through task tiling;
Finally, GPUs are often used in the field of machine learning and deep learning for accelerated calculations. By explaining how Flink uses the extended resource framework and demo in version 1.12, it shows us the use of resource extensions. In terms of resource utilization, the two communities are making plans in the future, including passive resource mode and fine-grained resource management.

Six, appendix

[1] Accelerating your workload with GPU and other external resources

[2] Extended Resource Framework Document

[3] FLIP-108: Add GPU support in Flink

[4] flink-mnist project

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright, and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.