Read Flink source code to talk about design: the way to effectively manage memory

This article was first published on Mooring Floating Purpose Brief : https://www.jianshu.com/u/204b8aaab8ba

Version	date	Remark
1.0	2021.12.20	Article first published
1.1	2021.12.22	typo correction
1.2	2022.2.24	Description bugfix

0. Preface

When I first came into contact with Flink, it came from the sharing of some leading players in the industry - everyone will use it to process massive data. In this scenario, the question of how avoids the side effects of StopTheWorld brought by JVM GC has been lingering in my mind. Until I used Flink and read the relevant source code (based on 1.14.0), I finally got some answers. I will share it with you in this article.

`1. Inadequate JVM memory management`

In addition to the above-mentioned StopTheWorld, the memory management of the JVM also brings the following problems:

Memory waste: When a Java object is stored in memory, it will be divided into three parts: object header, instance data, and alignment padding. First, in 32-bit and 64-bit implementations, the object header occupies 32 bits and 64 bits respectively. In order to provide overall usage efficiency, the data in the JVM memory is not stored continuously, but is stored in multiples of 8bytes. Even if you only have 1byte, it will automatically padding7byte.
Cache miss: Everyone knows that the CPU has L1, 2, and 3 caches. When the CPU reads the data in the memory, it will read the adjacent data in the memory into the cache - this is the principle of program locality a practical means. The data recently accessed by the CPU will be accessed by the CPU in the short term (time); the data near the data accessed by the CPU will be accessed by the CPU in the short term (space). But as we mentioned earlier, Java objects are not contiguous when they are stored on the heap, so when the CPU reads objects on the JVM, the cached adjacent memory area data is often not required by the CPU for the next calculation. At this time, the CPU can only idle waiting to read data from memory (the speed of the two is not an order of magnitude). If the data happens to be swapped to the hard disk, it is even more difficult.

`2. Evolution scheme of Flink`

Before v0.10, Flink used an implementation of on-heap memory. In simple terms, it allocates contiguous memory by means of byte array, and references it by means of byte array, and the application layer maintains the type information to obtain the corresponding data. But this still has problems:

When the memory in the heap is too large, the JVM startup time will be very long, and the Full GC will reach the minute level.
Low IO efficiency: at least 1 memory copy is required to write to disk or network on the heap.

Therefore, after v0.10, Flink introduced the off-heap memory management function. See Jira: Add an off-heap variant of the managed memory . In addition to solving the problem of in-heap memory, it will bring some benefits:

Off-heap memory can be shared between processes. This means that Flink can use this for failure recovery.

Of course, everything has two sides, the disadvantages are:

Allocating short-lived objects is more expensive to allocate on off-heap memory than on on-heap memory.
Troubleshooting errors in off-heap memory is more complicated.

This implementation can also be found in Spark, which is called MemoryPool , and supports both in-heap and out-of-heap memory methods, see MemoryMode.scala for details; Kafka also has a similar idea - save its messages through Java NIO's ByteBuffer.

`3. Source code analysis`

In general, Flink's implementation in this area is relatively clear - just like the operating system, there are memory segments and data structures such as memory pages.

`3.1 Memory segment`

The main implementation is MemorySegment . Before v1.12 MemorySegment Just for one interface, its implementation has two HybridMemorySegment and HeapMemorySegment . In the subsequent development, everyone found that HeapMemorySegment is basically not used, but HybridMemorySegment is used. In order to optimize performance - to avoid checking the function table every time to confirm the called function during HeapMemorySegment , 0621721843ae40 was removed, and HybridMemorySegment Moved to MemorySegment - this brings nearly 2.7 times the call speed optimization. : Off-heap Memory in Apache Flink and the curious JIT compiler and Jira: Don't explicitly use HeapMemorySegment in raw format serde .

MemorySegment is mainly responsible for referencing the memory segment and reading and writing data in it - it supports basic types very well, while complex types require external serialization. The specific implementation is relatively simple, and the implementation can be roughly seen from the field declaration. The only thing that needs to be talked about is LITTLE_ENDIAN : different CPU architectures will have different storage orders - PowerPC will use Big Endian method, and the low address will store the least significant byte; while x86 will use Little Endian method to store data, and the low address will store the most effective byte. byte.

To be honest, I was a little shocked when I read this code, because writing Java for so many years is almost unaware of the underlying hardware. Unexpectedly, the Java code also needs to consider the logic compatible with the CPU architecture.

At this time, some students will ask, how does this MemorySegments work in Flink? We can look at a test case: testPagesSer in BinaryRowDataTest: First, there are MemorySegments, write data to RowData through the corresponding BinaryRowWriter, and then use BinaryRowDataSerializer to write RowData to RandomAccessOutputView:

    @Test
    public void testPagesSer() throws IOException {
        MemorySegment[] memorySegments = new MemorySegment[5];
        ArrayList<MemorySegment> memorySegmentList = new ArrayList<>();
        for (int i = 0; i < 5; i++) {
            memorySegments[i] = MemorySegmentFactory.wrap(new byte[64]);
            memorySegmentList.add(memorySegments[i]);
        }

        {
            // multi memorySegments
            String str = "啦啦啦啦啦我是快乐的粉刷匠，啦啦啦啦啦我是快乐的粉刷匠，" + "啦啦啦啦啦我是快乐的粉刷匠。";
            BinaryRowData row = new BinaryRowData(1);
            BinaryRowWriter writer = new BinaryRowWriter(row);
            writer.writeString(0, fromString(str));
            writer.complete();

            RandomAccessOutputView out = new RandomAccessOutputView(memorySegments, 64);
            BinaryRowDataSerializer serializer = new BinaryRowDataSerializer(1);
            serializer.serializeToPages(row, out);

            BinaryRowData mapRow = serializer.createInstance();
            mapRow =
                    serializer.mapFromPages(
                            mapRow, new RandomAccessInputView(memorySegmentList, 64));
            writer.reset();
            writer.writeString(0, mapRow.getString(0));
            writer.complete();
            assertEquals(str, row.getString(0).toString());

            BinaryRowData deserRow =
                    serializer.deserializeFromPages(
                            new RandomAccessInputView(memorySegmentList, 64));
            writer.reset();
            writer.writeString(0, deserRow.getString(0));
            writer.complete();
            assertEquals(str, row.getString(0).toString());
        }
     // ignore some code
    }

`3.2 Memory pages`

A MemorySegment corresponds to a memory block of 32KB by default. In stream processing, data exceeding 32KB is easy to appear, at this time, it is necessary to cross MemorySegment. Then those who write the corresponding logic need to hold multiple MemorySegments, so Flink provides the implementation of memory pages, which will hold multiple MemorySegment instances, which is convenient for framework developers to quickly write Memory-related code without paying attention to One by one MemorySegment.

Its abstraction is DataInputView and DataOutputView, which respectively read and write data.

Next, let's take a look at the actual code. Let's take our most common use of KafkaProducer as an example:

|-- KafkaProducer#invoke //在这里指定了serializedValue
  \-- KeyedSerializationSchema#serializeValue //序列化record 的value

Let's pick an implementation and take a look at TypeInformationKeyValueSerializationSchema as an example:

|-- TypeInformationKeyValueSerializationSchema#deserialize //KeyedSerializationSchema的实现类
|-- DataInputDeserializer#setBuffer // 这是DataInputView的实现，用内部的byte数组存储数据。这里很奇怪的是并没有使用MemorySegement。
|-- TypeSerializer#deserialize  // 它的实现会针对不同的类型，从DataInputView里读出数据返回

In fact, the example here is not appropriate. Because KeyedSerializationSchema has been marked as deprecated. The community recommends that we use KafkaSerializationSchema. The first reason is that the abstraction of KeyedSerializationSchema is not suitable for Kafka. When Kafka adds new fields to Record, it is difficult to abstract this interface - this interface only focuses on key, value and topic.

Expanding with KafkaSerializationSchema , we can look at a typical implementation - KafkaSerializationSchemaWrapper , where we care about is easy to find:

    @Override
    public ProducerRecord<byte[], byte[]> serialize(T element, @Nullable Long timestamp) {
        byte[] serialized = serializationSchema.serialize(element);
        final Integer partition;
        if (partitioner != null) {
            partition = partitioner.partition(element, null, serialized, topic, partitions);
        } else {
            partition = null;
        }

        final Long timestampToWrite;
        if (writeTimestamp) {
            timestampToWrite = timestamp;
        } else {
            timestampToWrite = null;
        }

        return new ProducerRecord<>(topic, partition, timestampToWrite, null, serialized);
    }

The declaration of this serializationSchema is an interface named SerializationSchema . You can see that it has a large number of implementations, many of which correspond to DataStream and the format in the SQL API. Let's take TypeInformationSerializationSchema as an example to continue tracking:

@Public
public class TypeInformationSerializationSchema<T>
        implements DeserializationSchema<T>, SerializationSchema<T> {

    //ignore some filed

    /** The serializer for the actual de-/serialization. */
    private final TypeSerializer<T> serializer;
....

We saw the familiar interface TypeSerializer again. As mentioned above, its implementation will interact with DataInputView and DataOutputView for different types, providing serialization and deserialization capabilities. It can also be seen in its method signature:

    /**
     * Serializes the given record to the given target output view.
     *
     * @param record The record to serialize.
     * @param target The output view to write the serialized data to.
     * @throws IOException Thrown, if the serialization encountered an I/O related error. Typically
     *     raised by the output view, which may have an underlying I/O channel to which it
     *     delegates.
     */
    public abstract void serialize(T record, DataOutputView target) throws IOException;

    /**
     * De-serializes a record from the given source input view.
     *
     * @param source The input view from which to read the data.
     * @return The deserialized element.
     * @throws IOException Thrown, if the de-serialization encountered an I/O related error.
     *     Typically raised by the input view, which may have an underlying I/O channel from which
     *     it reads.
     */
    public abstract T deserialize(DataInputView source) throws IOException;

    /**
     * De-serializes a record from the given source input view into the given reuse record instance
     * if mutable.
     *
     * @param reuse The record instance into which to de-serialize the data.
     * @param source The input view from which to read the data.
     * @return The deserialized element.
     * @throws IOException Thrown, if the de-serialization encountered an I/O related error.
     *     Typically raised by the input view, which may have an underlying I/O channel from which
     *     it reads.
     */
    public abstract T deserialize(T reuse, DataInputView source) throws IOException;

    /**
     * Copies exactly one record from the source input view to the target output view. Whether this
     * operation works on binary data or partially de-serializes the record to determine its length
     * (such as for records of variable length) is up to the implementer. Binary copies are
     * typically faster. A copy of a record containing two integer numbers (8 bytes total) is most
     * efficiently implemented as {@code target.write(source, 8);}.
     *
     * @param source The input view from which to read the record.
     * @param target The target output view to which to write the record.
     * @throws IOException Thrown if any of the two views raises an exception.
     */
    public abstract void copy(DataInputView source, DataOutputView target) throws IOException;

So how is TypeSerializer#deserialize called? These details are not what this article needs to be concerned with. Here we show the call chain. Interested readers can look at the specific code along the call chain:

|-- TypeSerializer#deserialize
|-- StreamElementSerializer#deserialize
|-- TypeInformationKeyValueSerializationSchema#deserialize
|-- KafkaDeserializationSchema#deserialize
|-- KafkaFetcher#partitionConsumerRecordsHandler //到这里已经很清楚了，这里是由FlinkKafkaConsumer new出来的对象

`3.3 Buffer pool`

Another interesting class is LocalBufferPool , which encapsulates MemorySegment . Generally used for network buffer (NetworkBuffer), NetworkBuffer is the packaging of network exchange data. When the result partition (ResultParition) starts to write data, it needs to apply for Buffer resources from LocalBufferPool.

Write logic:

|-- Task#constructor //构造任务
|-- NettyShuffleEnvironment#createResultPartitionWriters // 创建用于写入结果的结果分区
|-- ResultPartitionFactory#create
  \-- ResultPartitionFactory#createBufferPoolFactory //在这里创建了一个简单的BufferPoolFactory
|-- PipelinedResultPartition#constructor
|-- BufferWritingResultPartition#constructor
|-- SortMergeResultPartition#constructor or BufferWritingResultPartition#constructor
|-- ResultPartition#constructor
  \-- ResultPartition#steup // 注册缓冲池到这个结果分区中

Also, NetworkBuffer implements Netty's AbstractReferenceCountedByteBuf . This means that the classic reference counting algorithm is used here, and when the Buffer is no longer needed, it will be recycled.

`4. Other`

`4.1 Related Flink Jira`

Here is the list of Jira I referenced while writing this article:

Add an off-heap variant of the managed memory：https://issues.apache.org/jira/browse/FLINK-1320
Separate type specific memory segments.：https://issues.apache.org/jira/browse/FLINK-21417
Investigate potential out-of-memory problems due to managed unsafe memory allocation：https://issues.apache.org/jira/browse/FLINK-15758
Adjust GC Cleaner for unsafe memory and Java 11：https://issues.apache.org/jira/browse/FLINK-14522
FLIP-49 Unified Memory Configuration for TaskExecutors：https://issues.apache.org/jira/browse/FLINK-13980
Don't explicitly use HeapMemorySegment in raw format serde：https://issues.apache.org/jira/browse/FLINK-21236
Refactor HybridMemorySegment：https://issues.apache.org/jira/browse/FLINK-21375
use flink's buffers in netty：https://issues.apache.org/jira/browse/FLINK-7315
Add copyToUnsafe, copyFromUnsafe and equalTo to MemorySegment：https://issues.apache.org/jira/browse/FLINK-11724

Read Flink source code to talk about design: the way to effectively manage memory

0. Preface

`1. Inadequate JVM memory management`

`2. Evolution scheme of Flink`

`3. Source code analysis`

`3.1 Memory segment`

`3.2 Memory pages`

`3.3 Buffer pool`

`4. Other`

`4.1 Related Flink Jira`

泊浮目

`引用和评论`

紧跟Flink 2.0，FlinkSQL提效神器v2025.3.0发布！

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性