大数据 - Flink State - Backend Improvements and Evolution in 2021 - 个人文章

Abstract: This article is compiled from the speeches of ASF Member, Apache Flink & HBase PMC, Alibaba senior technical expert Li Yu (top), Apache Flink Committer, Alibaba technical expert Tang Yun (tea dry) at the Flink Forward Asia 2021 core technology special session. The main contents include:
State Backend Improvement
Snapshot Improvement
Future Work

FFA 2021 Live Replay & Presentation PDF Download

1. State Backend improvement

In the past year, the Flink community state-backend module has grown a lot. Before version 1.13, users lacked monitoring methods for the performance of state-related operators, and there was no good way to learn the latency of state read and write operations.

We introduced the delay monitoring of state access. The principle is to use system.nowTime to count the access delay before and after each state access, and then store it in a histgram type indicator, and monitor the impact of the function on its performance. In particular, the performance impact of state access is relatively small.

Regarding the configuration related to state access delay monitoring, it is necessary to emphasize the two configurations of sampling interval and historical data retention. The smaller the sampling interval, the more accurate the data results, but the performance impact on daily access is also slightly greater; historical data The more the number of reservations, the more accurate the data results, but the memory usage will be slightly larger.

In Flink 1.14 version, we finally upgraded RocksDB from 5.17 to 6.20. In addition to fixing some bugs in RocksDB, the new version of RocksDB also adds some features, which can be used in Flink 1.14 and 1.15. First, it supports the ARM platform, which can ensure that Flink jobs can run on the basis of arm, and secondly, it provides more fine-grained WriteBuffer memory control to improve the stability of memory control. In addition, the deleteRange interface is also provided, which will greatly help the performance improvement in the future expansion scenario.

With the increasing popularity of cloud native, it has become the choice of more and more manufacturers to run Flink jobs in a container environment through K8s transfer. It is inevitable to consider how the limited resources can run stably, especially the control of memory usage. . RocksDB, which was born in 2010, is inherently insufficient in this regard, and Flink 1.10 introduced memory control for the first time. Over the past year, RocksDB has made some progress and improvements in memory management.

First, let's review the RocksDB memory problem. Before talking about this problem, we must understand how Flink uses state and RocksDB.

Every time Flink declares a state, it corresponds to a column family in RocksDB. A column family is an independent memory allocation in RocksDB, and they are isolated by physical resources;
Secondly, Flink does not limit the number of states declared by users in an operator, so it does not limit the number of column families;
Finally, Flink can have multiple operators with keyed state in a slot under the slot-sharing mechanism.

Based on the above three reasons, even if RocksDB's own memory management limitations are not considered, theoretically, the use of Flink may lead to unlimited memory usage.

The above figure defines multiple RocksDB instances of a SQL class, sharing a writeBuffer manager and its corresponding block cache. The manager that manages multiple writeBuffers records the memory it applies for in the block cache, and the data related Blocks are cached in the block cache. The cache includes data-related data blocks, index-related index blocks and filter blocks for filtering, which can be simply understood as write cache and read cache.

It can be seen that the way the writeBuffer manager and the block cache work together is that the manager performs accounting in the block cache.

After the buffer application process, the manager will upgrade the memory in the block cache with io blocks as the basic unit. The default io block is 1/8 of a single writeBuffer, and the configuration of writeBuffer is 64Mb, so the size of the io block is 8Mb, and the 8Mb memory application will be split into several dummy entries again and allocated to several shards of the Block. It should be noted that after Flink upgrades RocksDB, the minimum unit of dummy entry is reduced to 256KB, which reduces the probability of excessive memory application.

Because RocksDB itself is designed for multi-threading, there will be multiple shards in a cache, so its memory application will be more complicated.

In the memory generated by the internal implementation of the WriteBuffer manager, when is a mutable WriteBuffer converted to an immutable WriteBuffer. In the process of flushing the Immutable table to the disk, the usage of the mutable writebuffer is limited by default. After the limit is reached, the WriteBuffers will be flushed in advance. This will cause a problem. Even if the amount of data written is not large, once an arena block is applied, especially when there are many arena blocks, the member table flush will be triggered in advance. From the user's point of view, it will be found that there are a large number of small SST files locally, and the overall read and write performance is also very poor. Therefore, the Flink community has specially made the configuration verification function of arena block size for this.

At present, RocksDB itself has shortcomings and limitations in memory management and control, so it is necessary to reserve a part of external memory for RocksDB to use excessively in certain scenarios. Comparing the memory model of the Flink process in the above figure, we can see that the memory needs to be properly reserved on the jvm-overhead to prevent RocksDB from overusing. The table on the left shows the default configuration values related to jvm-overhead. If you want to configure jvm-overhead to 512Mb, you only need to configure both mini and max to 512Mb.

In the case of limited memory, data block, index block and fliter blocks have competition problems. The block instance in the above figure is drawn according to the actual size. Taking the SST of a 256Mb file as an example, the index block is about 0.5Mb, the fliter block is about 5Mb, and the data block is generally 4KB-64KB. It can be seen that block competition will lead to A large number of swaps in and out will greatly affect the read performance.

In order to solve the above problems, we encapsulate the partition-index and partition-filter functions of RocksDB to optimize the performance in the case of limited memory. It is to store the index index and filter filter in layers, so that the data block can be stored as much as possible in the limited memory, reducing the read probability of the disk, thereby improving the overall performance.

In addition to stability-related improvements, Flink also focuses on refactoring state-related APIs, making it easier for newbies to understand.

The previous API mixed the concepts of statebackend for state read and write and checkpoint for fault-tolerant backups. Take MemoryStatebackend and FsStateBackend as examples, they are exactly the same in terms of state read/write and access to objects, the difference is only in fault-tolerant backup, so beginners can easily confuse the difference.

The above figure shows the difference between the Flink state read/write and fault-tolerant check API after the update and before the update.

In the new version, we have separate settings for state access and fault-tolerant backup. The figure above is a comparison table between the new version of the API and the old version of the API. It can be seen that MemoryStatebackend and FsStateBackend are responsible for reading and writing the state of HashMaoStateBackend state storage.

The biggest difference between the two is that in terms of checkpoint fault tolerance, one corresponds to the full-memory ManagercCheckpointStorage, while the other corresponds to the file-based FileSystemSCheckpointStorage. I believe that through the refactoring of the API, developers can have a deeper understanding.

2. Snashot Improvement

SavePoint itself is coupled with state-backend, and is not limited by what state-backend is implemented. However, in previous Flink versions, the SavePoint formats of different state-backends were different, but in the new version of Flink, the community has unified the relevant SavePoint formats, and for the same job, the state-backend can be seamlessly switched without losing the state. backend.

In addition, the community has further enhanced the stability of unaligned checkpoints. Treat the buffer in the channel as in-flight data, and treat it as part of the operator state for early persistence to avoid the time for barrier alignment.

In addition, in the new version of Flink, the community supports the automatic switching between the traditional aligned and unaligned. As long as a global timeout is set, Flink will automatically switch after the checkpoint reaches the threshold. I believe that the introduction of this function is also possible Further help developers get better checkpoint performance.

3. Future Work

In the future, we will further improve the production ease of use of the RocksDB backend. On the one hand, we will add some key performance indicators inside RocksDB, such as block cache hit rate, to the standard monitoring indicators, so that the performance of RocksDB can be tuned more easily. On the other hand, we plan to redirect the RocksDB log files to the TM log directory or the TM log, so that the RocksDB log information can be viewed more easily to locate problems and tune them.

Secondly, we will further clarify the snapshot semantics of Flink. Currently, there are three forms of snapshots in Flink, namely checkpoint, savepoint and retained checkpoint.

Among them, checkpoint is a system snapshot, and its data life cycle is completely controlled by the Flink framework. It is used to fail over when an exception occurs. Once the job is stopped, it will be automatically deleted;
Savepoint is responsible for data backup in a unified format. Its life cycle is decoupled from Flink jobs and completely controlled by users. It can be used to implement Flink job version upgrades, cross-cluster migration, and state-backend switching.
The semantics and life cycle of retained checkpoint are currently vague. It can exist independently of the Flink job life cycle. However, when restoring and opening incremental snapshots based on it, the checkpoint of the new job will depend on the data in it, resulting in It is difficult for users to tell when it is safe to delete it.

In order to solve this problem, the community proposed FLIP-193, which requires users to declare whether to use claim or no-claim mode when starting jobs based on retained checkpoints.

If the claim mode is adopted, the data life cycle of the retained checkpoint is completely controlled by the new job, that is, with the occurrence of data compaction, when the new snapshot no longer depends on the data in the retained checkpoint, the new job can safely delete it;
If the no-claim mode is adopted, the new job cannot modify the data of the retained checkpoint, which means that the new job needs to make a physical copy during the first snapshot, and cannot refer to the data in the retained checkpoint. This way, the retained checkpoint can be manually deleted at any time when needed, without worrying about affecting jobs that are restored based on it.

In addition, we plan to give clearer semantics to user-controlled snapshots in the future, and introduce the concept of savepoint in native format to replace retained checkpoint.

Finally, the work in progress on FLIP-158 is introduced. It introduces Changelog based state backend to achieve faster and more stable incremental snapshots, which is equivalent to introducing a snapshot based on log management. Compared with the existing snapshot robust incremental snapshot mechanism, it has a shorter snapshot interval, but at the same time sacrifices some state data processing delay. This is actually a trade-off and balance between latency and fault tolerance.

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

Flink State - Backend Improvements and Evolution in 2021

1. State Backend improvement

2. Snashot Improvement

3. Future Work

ApacheFlink

引用和评论

阿里妈妈基于 Flink+Paimon 的 Lakehouse 应用实践

大模型是新的数据库！蚂蚁开源负责人王旭：应用开发新范式，新一代LAMP正在形成 | MEET 2025

一文教你如何本地部署玩转DeepSeek-V3，免费体验100度算力包跑通!

小米正式官宣开源！杀疯了！

Apache Hudi源码解读—Filnk写Hudi链路

开源鸿蒙5.0重磅发布，共赴万物智联未来

GitHub 上排名前 11 的开源管理后台（Admin Dashboard）项目