Introduction to Flink Sort-Shuffle Implementation

This article describes how Sort-Shuffle helps Flink to deal with large-scale batch data processing tasks more easily. The main contents include:
Introduction to Data Shuffle
The significance of introducing Sort-Shuffle
Flink Sort-Shuffle implementation
Test Results
Tuning parameters
Future outlook

Flink is a big data computing engine integrating batch and stream, and large-scale batch data processing is also an important part of Flink's data processing capabilities. With the iteration of Flink's version, its batch data processing capabilities are continuously enhanced. The introduction of sort-shuffle makes Flink more comfortable in dealing with large-scale batch data processing tasks.

1. Introduction to Data Shuffle

Data shuffle is an important stage of batch data processing. In this stage, the output data of the upstream processing node will be persisted to the external storage, and then the downstream computing node will read the data and process it. These persistent data are not only a form of data exchange between computing nodes, but also play an important role in error recovery.

At present, there are two batch data shuffle models adopted by existing large-scale distributed computing systems, namely the hash-based method and the sort-based method:

The core idea based on the hash method is to write the data sent to different concurrent consumption tasks downstream to a separate file, so that the file itself becomes a natural boundary for distinguishing different data partitions;
The core idea based on the sort method is to first write the data of all partitions together, and then use sort to distinguish the boundaries of different data partitions.

We introduced sort-based batch shuffle implementation to Flink in Flink version 1.12 and continued to optimize performance and stability in the follow-up; by Flink version 1.13, sort-shuffle has been available for production.

Second, the significance of introducing Sort-Shuffle

An important reason why we want to introduce the implementation of sort-shuffle into Flink is that Flink's original hash-based implementation is not available for large-scale batch jobs. This is also proven by other existing large-scale distributed computing systems:

stability: For high-concurrency batch jobs, hash-based implementation will generate a large number of files, and these files will be read and written concurrently, which consumes a lot of resources and puts greater pressure on the file system. The file system needs to maintain a large amount of file metadata, which will cause instability risks such as file handles and inode exhaustion.
performance: For high-concurrency batch jobs, concurrent reading and writing of a large number of files means a lot of random IO, and the amount of data actually read and written per IO may be very small, which is a huge challenge to IO performance. On mechanical hard drives, this makes data shuffle easily a performance bottleneck for batch jobs.

By introducing the batch data shuffle implementation based on sort, the number of concurrent read and write files can be greatly reduced, which is conducive to achieving better data sequential read and write, which can improve the stability and performance of Flink's large-scale batch processing jobs. In addition, the new sort-shuffle implementation can also reduce the consumption of memory buffers. For a hash-based implementation, each data partition requires a read and write buffer, and the memory buffer consumption is proportional to concurrency. The sort-based implementation can decouple memory buffer consumption and job concurrency (although larger memory may bring higher performance).

The more important point is that we have implemented a new storage structure and read and write IO optimization, which makes Flink's batch data shuffle more advantageous than other large-scale distributed data processing systems. The following chapters will introduce Flink's sort-shuffle implementation and the results obtained in more detail.

Three, Flink Sort-Shuffle implementation

Similar to the implementation of batch data sort-shuffle in other distributed systems, the entire shuffle process of Flink is divided into several important stages, including writing data to the memory buffer, sorting the memory buffer, and writing the sorted data out To file and read shuffle data from the file and send it to downstream. However, compared with other systems, Flink's implementation has some fundamental differences, including the multi-segment data storage format, the elimination of the data merging process, and the data read IO scheduling. All of these make the implementation of Flink have a better performance.

1. Design goals

In the entire implementation of Flink sort-shuffle, we consider the following points as the main design goals:

1.1 Reduce the number of files

As discussed above, a hash-based implementation will generate a large number of files, and reducing the number of files is conducive to improving stability and performance. The Sort-Spill-Merge method is widely adopted by distributed computing systems to achieve this goal. First, data is written into the memory buffer. When the memory buffer is filled, the data is sorted, and the sorted data is written out to a In the file, the total number of files is: (total data volume / memory buffer size), so the number of files is reduced. When all the data is written out, the generated files are merged into one file, thereby further reducing the number of files and increasing the size of each data partition (facilitating sequential reading).

Compared with the implementation of other systems, Flink's implementation has an important difference, that is, Flink always appends data to the same file, instead of writing multiple files and then merging, the advantage of this is always only one file, the number of files Achieved minimization.

1.2 Open fewer files

At the same time, opening too many files will consume more resources, and at the same time easily lead to insufficient file handles, resulting in poor stability. Therefore, opening fewer files helps improve the stability of the system. For data writing, as described above, by always appending data to the same file, each concurrent task always opens only one file. For data reading, although each file needs to be read by a large number of downstream concurrent tasks, Flink still achieves the goal of opening each file only once by opening the file only once and sharing the file handle among these concurrent reading tasks.

1.3 Maximize sequential reads and writes

The sequential reading and writing of files is critical to the IO performance of files. By reducing the number of shuffle files, we have reduced random file IO to a certain extent. In addition, Flink's batch data sort-shuffle also implements more IO optimizations to maximize the sequential read and write of files. In the data writing stage, better sequential writing is achieved by aggregating the data buffers to be written into larger batches and writing them out through the wtitev system call. In the data reading stage, by introducing read IO scheduling, the data read request is always served in the offset order of the file to maximize the sequential read of the file. Experiments show that these optimizations greatly improve the performance of batch data shuffle.

1.4 Reduce read and write IO amplification

The traditional sort-spill-merge method increases the size of the read data block by merging the generated multiple files into a larger file. Although this implementation has brought benefits, it also has some shortcomings. The final point is to read and write IO amplification. For data shuffle between computing nodes, in the case of no errors, it only needs to write and Data is read once, but data merging causes the same data to be read and written multiple times, resulting in an increase in the total amount of IO, and the consumption of storage space will also increase.

The implementation of Flink circumvents the process of file merging by continuously adding data to the same file and a unique storage structure. Although the size of a single data block is smaller than the size of the merged file, it avoids the overhead of file merging combined with Flink's unique Some IO scheduling can finally achieve higher performance than the sort-spill-merge solution.

1.5 Reduce memory buffer consumption

Similar to the implementation of sort-shuffle in other distributed computing systems, Flink uses a fixed-size memory buffer to cache and sort data. The size of this memory buffer has nothing to do with concurrency, so that the size of the memory buffer required for upstream shuffle data writing is decoupled from concurrency. Combined with another memory management optimization, FLINK-16428 can simultaneously realize the concurrency and independence of the memory buffer consumption of downstream shuffle data reading, thereby reducing the memory buffer consumption of large-scale batch jobs. (Note: FLINK-16428 is applicable to batch and stream operations at the same time)

2. Implementation details

2.1 Sorting memory data

In the sort-spill stage of shuffle data, each piece of data is first serialized and written into the sort buffer. When the buffer is filled, all binary data in the buffer is sorted in the order of data partitions. After that, the sorted data will be written out to the file in the order of the data partition. Although the data itself is not sorted at present, the interface of the sorting buffer is sufficiently generalized to realize the subsequent potentially more complicated sorting requirements. The interface of the sort buffer is defined as follows:

public interface SortBuffer {
 
   */** Appends data of the specified channel to this SortBuffer. \*/*
   boolean append(ByteBuffer source, int targetChannel, Buffer.DataType dataType) throws IOException;
 
   */** Copies data in this SortBuffer to the target MemorySegment. \*/*
   BufferWithChannel copyIntoSegment(MemorySegment target);
 
   long numRecords();
 
   long numBytes();
 
   boolean hasRemaining();
 
   void finish();
 
   boolean isFinished();
 
   void release();
 
   boolean isReleased();
 }

In the sorting algorithm, we chose bucket-sort with lower complexity. Specifically, a 16-byte metadata is inserted before each serialized data. Including 4-byte length, 4-byte data type, and 8-byte pointer to the next data in the same data partition. The structure is shown in the figure below:

When reading data from the buffer, all data belonging to the data partition can be read only according to the chain index structure of each data partition, and these data maintain the order when the data is written. In this way, reading all the data in the order of the data partition can achieve the goal of sorting according to the data partition.

2.2 File storage structure

As mentioned earlier, the shuffle data generated by each parallel task will be written to a physical file. Each physical file contains multiple data regions, and each data region is generated by a sort-spill of the data buffer. In each data block, all data belonging to different data partitions (consumed by different parallel tasks of downstream computing nodes) are sorted and aggregated according to the sequence number of the data partition. The following figure shows the detailed structure of the shuffle data file. Among them (R1, R2, R3) are 3 different data blocks, which correspond to 3 sort-spill writes of data respectively. There are 3 different data partitions in each data block, which will be read by 3 different parallel consumer tasks (C1, C2, C3). In other words, data B1.1, B2.1 and B3.1 will be processed by C1, data B1.2, B2.2 and B3.2 will be processed by C2, and data B1.3, B2.3 and B3.3 will be processed by C3 deal with.

Similar to other distributed processing system implementations, in Flink, each data file corresponds to an index file. The index file is used to index the data (data partition) belonging to each consumer when reading. The index file contains the same data region as the data file. Each data region contains the same number of index items as the data partition. Each index item contains two parts, which correspond to the offset of the data file and the length of the data. As an optimization. Flink caches up to 4M index data for each index file. The corresponding relationship between the data file and the index file is as follows:

2.3 Read IO scheduling

In order to further improve the file IO performance, based on the above storage structure, Flink further introduces an IO scheduling mechanism, which is similar to the elevator algorithm of disk scheduling. Flink's IO scheduling is always scheduled according to the file offset order of the IO request. More specifically, if the data file has n data regions, each data region has m data partitions, and m downstream computing tasks read this data file at the same time, the following pseudo code shows Flink's IO scheduling algorithm Work flow:

*// let data_regions as the data region list indexed from 0 to n - 1*
 *// let data_readers as the concurrent downstream data readers queue indexed from 0 to m - 1*
 for (data_region in data_regions) {
   data_reader = poll_reader_of_the_smallest_file_offset(data_readers);
   if (data_reader == null)
     break;
   reading_buffers = request_reading_buffers();
   if (reading_buffers.isEmpty())
     break;
   read_data(data_region, data_reader, reading_buffers);
 }

2.4 Data broadcasting optimization

Data broadcasting refers to all parallel tasks that send the same data to downstream computing nodes. A common application scenario is broadcast-join. Flink's sort-shuffle implementation optimizes this process, so that only one copy of broadcast data is saved in the memory sort buffer and shuffle file, which can greatly improve the performance of data broadcast. More specifically, when writing a piece of broadcast data to the sort buffer, this piece of data will only be serialized and copied once. Similarly, when writing data out to a shuffle file, only one piece of data will be written. In the index file, for the data index items of different data partitions, they all point to the same piece of data in the data file. The following figure shows all the details of data broadcasting optimization:

2.5 Data compression

Data compression is a simple and effective optimization method. The test results show that data compression can improve the overall performance of TPC-DS by more than 30%. Similar to Flink's hash-based batch shuffle implementation, data compression is performed in units of network buffers, and data compression does not cross data partitions, that is to say, data sent to different downstream parallel tasks are separately compressed and compressed. Occurs after the data is sorted and before being written out, the downstream consumer task decompresses the data after receiving the data. The following figure shows the entire process of data compression:

Four, test results

1. Stability

The implementation of the new sort-shuffle greatly improves the stability of Flink running batch jobs. In addition to solving the instability problems of potential file handles and inode exhaustion, it also solves some known problems of Flink's original hash-shuffle, such as FLINK-21201 (creating too many files causes the main thread to block), FLINK- 19925 (I/O operations performed in the network netty thread cause network stability to be affected), etc.

2. Performance

We ran the TPC-DS 10T data-scale test under 1000-scale concurrency, and the results showed that compared to Flink's original batch data shuffle implementation, the new data shuffle implementation can achieve a performance improvement of 2-6 times. If the calculation is excluded Time, only the statistical data shuffle time can be up to 10 times the performance improvement. The following table shows the detailed data of the performance improvement:

Jobs	Time Used for Sort-Shuffle (s)	Time Used for Hash-Shuffle (s)	Speed up Factor
q4.sql	986	5371	5.45
q11.sql	348	798	2.29
q14b.sql	883	2129	2.51
q17.sql	269	781	2.90
q23a.sql	418	1199	2.87
q23b.sql	376	843	2.24
q25.sql	413	873	2.11
q29.sql	354	1038	2.93
q31.sql	223	498	2.23
q50.sql	215	550	2.56
q64.sql	217	442	2.04
q74.sql	270	962	3.56
q75.sql	166	713	4.30
q93.sql	204	540	2.65

On our test cluster, the data read and write bandwidth of each mechanical hard disk can reach 160MB/s:

Disk Name	SDI	SDJ	SDK
Writing Speed (MB/s)	189	173	186
Reading Speed (MB/s)	112	154	158

Note: Our test environment is configured as follows. Because we have a large amount of memory, some shuffle jobs with a small amount of data. The actual data shuffle is only for reading and writing memory. Therefore, the above table only lists some shuffles with a large amount of data and performance improvements. Obvious query:

Number of Nodes	Memory Size Per Node	Cores Per Node	Disks Per Node
12	About 400G	96	3

Five, tuning parameters

In Flink, sort-shuffle is not turned on by default. If you want to turn on, you need to adjust the configuration of this parameter: taskmanager.network.sort-shuffle.min-parallelism . The meaning of this parameter is that if the number of data partitions (a computing task concurrently needs to send data to several downstream computing nodes) is lower than this value, the implementation of hash-shuffle is adopted, and if it is higher than this value, sort-shuffle is enabled. In actual applications, it can be configured as 1 on a mechanical hard disk, that is, sort-shuffle is used.

Flink does not enable data compression by default. For batch jobs, it is recommended to enable it in most scenarios, unless the data compression rate is low. The parameter to be turned on is taskmanager.network.blocking-shuffle.compression.enabled .

For shuffle data writing and data reading, memory buffers are required. Among them, the size of the data write buffer is controlled by taskmanager.network.sort-shuffle.min-buffers , and the data read buffer is controlled taskmanager.memory.framework.off-heap.batch-shuffle.size The data write buffer is divided from the network memory. If you want to increase the data write buffer, you may need to increase the total size of the network memory to avoid the error of insufficient network memory. The data read buffer is segmented from the frame's off-heap memory. If you want to increase the data read buffer, you may also need to increase the frame's off-heap memory to avoid direct memory OOM errors. Generally speaking, a larger memory buffer can bring better performance. For large-scale batch operations, several hundred megabytes of data write buffer and read buffer are sufficient.

6. Future prospects

There are some follow-up optimization work, including but not limited to:

1) Network connection reuse, which can improve the performance and stability of network establishment. Related Jira includes FLINK-22643 and FLINK-15455;

2) Multi-disk load balancing, which helps solve the problem of uneven load. Related Jira includes FLINK-21790 and FLINK-21789;

3) Realize the remote data shuffle service, which is conducive to further improving the performance and stability of batch data shuffle;

4) Allow users to select the disk type, which can improve the ease of use. Users can choose to use HDD or SSD according to the priority of the job.

Original English link:

https://flink.apache.org/2021/10/26/sort-shuffle-part1.html

https://flink.apache.org/2021/10/26/sort-shuffle-part2.html

On December 4-5, Flink Forward Asia 2021 be launched. There are 40+ multi-industry first-tier manufacturers around the world, 80+ dry goods issues, bringing a technical feast exclusively for developers.
https://flink-forward.org.cn/

In addition, the first Flink Forward Asia Hackathon officially launched, and the 10W bonus is waiting for you!
https://www.aliyun.com/page-source//tianchi/promotion/FlinkForwardAsiaHackathon

For more Flink related technical issues, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics in the first time, please follow the public account~

(https://img.alicdn.com/imgextra/i4/O1CN017JJwxq1U8gBGBXvx6_!!6000000002473-0-tps-5653-3144.jpg)

Introduction to Flink Sort-Shuffle Implementation

1. Introduction to Data Shuffle

Second, the significance of introducing Sort-Shuffle