Improvements of Job Scheduler and Query Execution on Flink OLAP

Abstract: This article is compiled from the speech of ByteDance infrastructure engineer Fang Yong in the core technology special session of Flink Forward Asia 2021. The main contents include:
background
Questions and Analysis
scheduling execution optimization
Future plan

Click to view live replay & speech PDF

1. Background

Many business parties of ByteDance have mixed computing needs, and hope that a system can support both TP computing and AP computing.

The figure above is the overall architecture of our HTAP system. The TP side uses our internal self-developed database as the TP computing engine, and the AP side uses Flink as the AP computing engine. We use the MySQL protocol as a unified entry to the outside world. If a query is AP calculation, it will be forwarded to Flink's Gateway to generate an execution plan, and then submitted to the Flink engine to execute the calculation. There is a columnar storage on the AP side, and the Flink engine interacts with the meta information of the storage side and the storage layer through the interfaces of the Catalog and Connector, respectively. After the AP calculation is completed, the client will initiate a proxy data request to the Flink Gateway, and then the Gateway will proxy the result data to the Flink cluster. At this point, the query interaction and calculation execution of the entire AP calculation are completed.

As we all know, Flink is a stream-batch computing engine, which can support both stream computing and batch computing. Why do many systems choose to use Flink for OLAP computing?

We compared the differences between Flink and Presto. First of all, from the perspective of architecture, Flink supports a variety of different deployment modes. Flink's session cluster is a very typical MPP architecture, which is the premise and foundation for Flink to support OLAP computing. Flink's computing execution can be divided into four parts: execution plan, job Runtime management, computing task execution management, cluster deployment, and Failover management. Judging from the architecture and functional module diagrams of Presto and Flink OLAP, the two systems have great differences in the specific implementations that support these computing functions, but the system capabilities and module functions they provide are basically the same.

Therefore, the Flink engine can fully support the complete computing requirements of Flink OLAP in terms of architecture and function implementation.

Inside the byte, Flink was originally used for streaming computing. Later, due to the computing power of Flink's stream-batch integration, we also used Flink as a batch computing engine for some real-time data warehouse scenarios. The final choice of Flink as the AP computing engine is mainly based on three considerations:

First, the unified engine reduces operation and maintenance costs. We have very rich experience in Flink operation, maintenance and optimization development. On the basis of stream-batch integration, using Flink as the AP computing engine can reduce development and operation and maintenance costs;
Second, ecological support. There are many storage systems in Flink, and many business parties use Flink SQL to develop streaming and batch jobs, and our internal storage system has developed and connected to many other systems, so it is very convenient for users to use Flink to support OLAP computing;
The last one is the performance advantage. We have done benchmark benchmark tests related to TCP-DS internally. Compared with Presto and Spark SQL, the Flink computing engine is not inferior in computing performance, and even has an advantage in some queries.

2. Problems and Analysis

First, let's introduce how to use Flink to do OLAP calculations.

First of all, in the access layer, we use Flink SQL Gateway as the access layer to provide the rest protocol to directly receive SQL query queries; in terms of architecture, we pull up the session integration of Flink on K8s, which is a very typical MPP architecture; computing mode In the above, we use the batch mode and the scheduling mode of full calculation, which reduces the data storage between computing nodes and can improve the performance of OLAP computing.

In the Flink OLAP calculation process, there are mainly the following problems:

First of all, compared with streaming and batch computing, Flink OLAP computing has the biggest feature that Flink OLAP computing is a small job for seconds and milliseconds. The job will frequently apply for memory, network and disk resources during the startup process, resulting in Flink A large number of resource fragments are generated in the cluster.
Another major feature of OLAP is that query jobs have requirements for latency and QPS. It is necessary to ensure that jobs provide relatively high concurrent scheduling and execution capabilities under the premise of latency, which puts forward a new requirement for the Flink engine.

In order to test Flink's ability to perform OLAP calculations, we compared Benchmark related tests of Flink job scheduling. We designed three sets of jobs of different complexity, which are a single-node job, a two-node wordcount job, and a six-node join job. The concurrency of each group of job computing nodes is 128.

We selected 5 physical machines to start a Flink session cluster. There are more than 10,000 slots in the cluster. We also implemented a Benchmarket Client that can submit jobs concurrently with multiple threads, and then counted the number of jobs completed within 10 minutes and the average of completed jobs. latency.

The result is shown in the figure below.

First analyze the results of QPS:

For single-node jobs, when the client is single-threaded, the QPS is 7.81; when the number of threads is 4, it has reached the QPS limit of about 17;
Wordcount two-node job, when the client is single-threaded, the QPS is 1.38; when the number of threads is 32, the QPS is 7.53;
The performance of the Join job is the worst. When the client is single-threaded, the QPS is only 0.44; when the number of threads increases to 32, the QPS is only 2.17.

Let's take a look at the performance of latency:

When the number of client threads increases, the latency of a single job submission increases from more than 100 milliseconds to 2 seconds; for Wordcount jobs, it increases from more than 700 milliseconds to 4 seconds; for Join jobs, it increases from 2 seconds to more than 15 seconds, a multiple increase.

Such job scheduling performance is unacceptable during online use.

For the performance of Flink concurrent job scheduling, we also tried to perform simple optimizations for some performance bottlenecks, but the results were not ideal. Therefore, we analyze the whole link of Flink job scheduling and execution, and divide the execution of Flink jobs into three main stages: job management, resource application, and computing tasks, and then perform corresponding performance optimization and improvement for each stage.

3. Scheduling and execution optimization

3.1 Optimization of job management

The first is job management optimization. Flink receives and manages jobs through the Dispatcher module. The entire job execution process can be divided into four steps: initialization, job execution preparation, starting job execution, and ending job execution.

There are 3 thread pools inside the Dispatcher responsible for executing the 4 steps of the job, namely Netty/Rest, Dispatcher Actor and Akka thread pool. According to testing and analysis:

Netty/Rest thread pool default size is too small;
The Dispatcher Actor is a single point of processing and performs some very heavyweight job operations;
In addition, the Akka thread pool is too busy, not only responsible for job management in the dispatcher, but also responsible for the specific execution of many jobmaster-like jobs and resource management in the resource manager.

In response to the above problems, we have carried out corresponding optimizations respectively. The size of the Netty/Rest thread pool is increased, the job is disassembled and two independent thread pools are created: the IO thread pool and the Store thread pool, which are respectively responsible for performing heavyweight operations in the job management process and reduce the Dispatcher Actor and Akka thread pool work pressure.

During the specific execution of a job, there will be many scheduled tasks, including the timeout check/heartbeat check between job modules, and the timeout check during the job resource application process. In Flink's current implementation, these timeout tasks will be placed in the Akka thread pool, which will be scheduled and executed by the Akka thread pool. Even if a job has ended, there is no way to directly reclaim and release it, which will cause a lot of timed tasks cached in the Akka thread pool, resulting in a large number of fullGCs on the JobManager node. About 90% of the memory of the JobManager process is occupied by these timed tasks.

In response to this problem, we have also carried out related optimizations. When each job is started, a job-level local thread pool is created for it. Job-related timing tasks will be submitted to the local thread pool first. When these tasks need to be actually executed At times, the local thread pool will send them to the Akka thread pool for direct execution. After the job is finished, it will be released directly, and the scheduled task will be recycled quickly.

3.2 Resource application optimization

ByteDance currently uses the Flink 1.11 version internally. Flink resource application is mainly based on the slot dimension. We use the full-pull job scheduling mode, so the job will wait for all the slot resource applications to complete before scheduling the computing task. For example, the resource manager has 4 slots. Now there are two jobs that apply for resources concurrently. Each job requires three slots. If they only apply for two slots, the two jobs will wait for each other for slot resources, resulting in deadlock. .

In response to this problem, we choose to optimize the resource application of the slot granularity to the application of the job batch granularity. There are two main difficulties in the batch resource application here:

One is how to be compatible with the original resource application of slot granularity, because there are many mechanisms based on slot granularity, such as resource application timeout, these two mechanisms need to be seamlessly integrated;
The second is the transactional nature of batch resource applications. We need to ensure that the resources in a batch are applied for or released at the same time. If there are exceptions, these resource applications need to be cancelled at the same time.

3.3 Task Execution Optimization

3.3.1 Connection multiplexing between jobs

The first is the problem of connection multiplexing. Flink upstream and downstream computing tasks transmit data through channels. Within a Flink job, the network connections of the same computing node can be reused, but the network connections between different jobs cannot be reused. After all computing tasks of a job are finished, its network connection between managers will be closed and released. When another job performs computation, the TaskManager needs to create a new network connection, which will frequently create and close connections during the execution of the job task, which will ultimately affect the latency and QPS of the computing task and query job, and will also lead to resource usage during the process. instability, increasing CPU usage and peaks and valleys in CPU usage.

Multiple job multiplexing mainly has the following difficulties in the network connection of TaskManager:

First, the problem of stability, channel is not only used for data transmission, but also related to the back pressure of computing tasks, so multiplexing connections may cause problems such as starvation and deadlock of computing tasks;
Second, the problem of dirty data, different jobs multiplexing computing connections may cause dirty data to be generated during the execution of computing tasks;
Third, the expansion and recycling of network connections. For network connections that are no longer used, we need to detect and close and release resources in time.

We implement connection multiplexing between Flink jobs. The main solution is to add a net connection pool to the TaskManager. When the job needs to create a carrier connection, it will initiate a request to the connection pool. The connection pool will create or reuse existing connections as needed. After computing, the computing task releases the connection to the connection pool.

In order to ensure the stability of the system, the connection usage mechanism in Flink's existing jobs remains unchanged. Each net connection has three states, namely idle, busy and invalid. The connection pool will manage the three states of the calculation network connection. It also supports the creation of network connections as needed, and then sends additional checks. At the same time, the network connection will be recycled, and the background timing task will check the connection status.

3.3.2 PartitionRequest Optimization

The second block is partition request optimization, which is mainly divided into two aspects: batch optimization and notification mechanism optimization.

After the upstream and downstream computing tasks in a job create a connection, the downstream computing task will send a partition request message to the upstream, telling the upstream task which partition data it needs to receive. The biggest problem with Partition request messages is that the amount of messages is too large, which is the square of the concurrency of upstream and downstream computing nodes.

The main purpose of batch optimization is to package the number of partition request messages between upstream and downstream computing tasks within the same TaskManager to reduce the magnitude of the partition request. For example, in the case of 100 concurrent computing nodes, the number of two TaskManager partition requests can be reduced from the original 10,000 to 4 now, from the square of the concurrency to the square of the number of TaskManagers, and the improvement is very obvious.

Since the upstream and downstream computing tasks are deployed in parallel, after the downstream computing tasks are deployed, the upstream computing tasks have not yet started to be deployed. When the downstream computing task sends a partition request to the upstream, the upstream TaskManager will return a partition not found exception, and the downstream computing task will continue to retry and poll according to this exception until the request is completed.

There are two problems in this process, one is that the number of partition requests is too large, and the other is that there is a time difference between the downstream computing tasks in the process of polling and retrying, which leads to an increase in the latency of computing tasks. So we implement a listen+notify mechanism for the interaction of upstream and downstream computing tasks. When the upstream TaskManager receives the partition request sent by the downstream computing task, if the upstream computing task has not been deployed, it will put the partition request into a listen list, and then obtain the partition request from the computing queue after the computing task is deployed, and The callback execution completes the entire interaction.

3.3.3 Network Memory Pool Optimization

The last piece is network memory optimization. After TaskManager is started, it will pre-allocate a piece of memory as a network memory pool. When a computing task is deployed in TaskManager, a local memory pool will be allocated from the network memory pool and added to the list of network memory pools. After a computing task creates a local memory pool, the network memory pool will traverse the list of local memory pools when applying for memory sharding and releasing the local memory pool. When there are many computing tasks executed in parallel in the TaskManager, the number of traversals will be very large. It is the number of slots multiplied by the number of upstream concurrency, which can even reach the order of tens of millions.

The main purpose of traversal is to release free memory fragments in other local memory pools in advance to improve memory usage. Our main optimization is to delete this traversal operation. Although this will cause some memory waste, it can greatly improve the execution of computing tasks.

In addition, we have done a lot of other optimizations and modifications. In terms of computing scheduling, we support the scheduling mode that combines full pull-up and block; in terms of execution plans, we optimize and implement a lot of computing push-down, and push the computing down to storage for execution; in terms of task execution, we target tasks A lot of related optimizations and implementations have been done for the pull-up and initialization.

3.4 Benchmark

The picture above is the optimized Benchmark.

In terms of QPS, the calculation QPS for a single node has increased from the original 17 to 33, the Wordcount two nodes have increased from the highest of 7.5 to about 20, and the join calculation has been increased from the original high of about 2 to about 11, and the effect is very obvious.

The latency improvement effect is also very good. Under 32 threads, the latency of a single-node job is reduced from the original 1.8 seconds to about 200 milliseconds, and the wordcount two-node job is reduced from 4 seconds to less than 2 seconds. The most obvious is the join job. From the original 15 seconds to about 2.5 seconds, the improvement is very large.

4. Future plans

Although Flink OLAP has been used in actual business scenarios, I think it has only gone from 0 to 1. In the future, we hope to do better, from 1 to 100. The indicators of the Flink OLAP system can be mainly divided into three parts: stability, performance and function.

In terms of stability, we must first improve the stability of single points, including single point of resource management and single point of job management. Secondly, it improves the management of resource usage and computing threads at runtime, as well as the optimized management of OLAP computing results. There are also many other stability-related optimizations.

In terms of performance, including plan optimization, fine-grained computing task execution and management optimization, and row-combination-column-oriented computing optimization, can greatly improve the performance of Flink OLAP computing.

In terms of functions, we hope to continue to improve the productized construction, including the continuous improvement and construction of the history server. On the other hand, we will improve web analysis tools to help business parties better locate problems found in the query process.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Improvements of Job Scheduler and Query Execution on Flink OLAP

1. Background

2. Problems and Analysis