Introducing the FFI-based PyFlink Next-Generation Python Runtime

Abstract: This article is compiled from the speech delivered by Alibaba senior development engineer Huang Xingbo (Duanchen) in the core technology special session of Flink Forward Aisa 2021. The main contents include:
PyFlink latest features
PyFlink Runtime
FFI-based PEMJA
PyFlink Runtime 2.0
Future Work

Click to view live replay & speech PDF

The JCP project in the original Flink Forward Asia 2021 speech has been renamed PEMJA, and will be officially open sourced on January 14, 2022. The open source address is:

https://github.com/alibaba/pemja

Ps: JCP has been replaced by PEMJA in this article.

1. New features of PyFlink

PyFlink 1.14 has added many new features, which are mainly divided into three aspects: function, ease of use and performance.

In terms of functions, State TTL config has been added. Before 1.14, the Python Datastream API and some functions for manipulating State have been implemented, but the configuration of State TTL config is not provided, which also means that the user cannot automatically clear the value of State when writing a custom function of the Python Datastream API. Instead, it requires manual operation, which is not user-friendly.

In terms of ease of use, the following main functions have been added:

The tar.gz format is supported in the dependency management section.
Profile function. Users write PyFlink to use some Python custom functions, but it is not clear where the performance bottleneck of these functions is. With the profile function, when a performance bottleneck occurs in a Python function, you can use the profile to analyze the specific cause of its bottleneck, so that you can make some optimizations for this part.
Print function. Before 1.14, printing custom log information must use Python's custom logging module. But for Python users, print is a way of outputting log information that they are more accustomed to. So this part of the function was added in 1.14.
Local Debug mode. Before 1.14, if users developed PyFlink jobs locally using Python-defined functions, they had to use remote debug to debug custom logic, but it was relatively cumbersome to use and had a high threshold for use. This mode has been changed in 1.14. If you write a PyFlink job locally using a Python custom function, you can automatically switch to the local debug mode, and you can directly debug the custom Python function in the ide.

In terms of performance, the following functions have been added:

Operator Fusion. This function is mainly aimed at the scenario of performing several consecutive operator operations in the job of the Python Datastream API. For example, two .map operations, before 1.14, these two .maps will run in two Python workers respectively, but after implementing Operator Fusion, they will be merged and run in the same operator, and then executed by the Python worker The overall result is a very good performance optimization.
State serialization/deserialization optimization. Before 1.14, the State serialization/deserialization optimization was to use Python's built-in serializer pickle, which can serialize various Python-defined data structures, but needs to serialize the State type information into the data structure. Will result in a larger body of serialized structs. It is optimized in 1.14, using a custom serializer, one type corresponds to one serializer for optimization, making the serialization information smaller.
Finish Bundle optimizations. Before 1.14, Finish Bundle was a synchronous operation. Now it is changed to an asynchronous operation, which improves its performance and can solve some scenarios that Checkpoint cannot complete.

2. PyFlink Runtime

The above picture is the existing framework diagram of PyFlink.

The uppermost Python Table API & SQL and Datastream API on the left side of the figure are the Python APIs provided to the user. Users can write PyFlink jobs through these two Python APIs, and then convert the Python APIs into Java APIs through a py4j third-party library, which can correspond to the Flink Java API to describe the job.

There is an additional optimizer for Table and SQL jobs, which has two kinds of rules, one is common rules, and the other is Python rules. Why are there Python rules here? As we all know, common rules are valid for various Table and SQL existing jobs, and Python rules are optimized for scenarios where custom Python functions are used in PyFlink jobs, and the corresponding operators can be extracted.

After describing the job, it is translated into a jobgraph with corresponding Python operators. The jobgraph described by Python operators will be submitted to TM (Runtime) to run, and there is also a Python operator in Runtime.

On the right side of the figure are various components of Python operators, which describe the core part of PyFlink Runtime. It is mainly divided into two parts: Java operator and Python worker.

It has many components in Java operator, including data service and State service, as well as some processing for checkpoint, watermark and State request. Because custom Python functions cannot run directly on Flink's existing architecture, Flink's existing architecture is based on JVM, but writing Python functions requires a Python Runtime, so operator worker is used to solve this problem.

The solution is as follows: Initiate a Python process to run a Python-defined function, use the Java operator to process the upstream data, and then send it to the corresponding Python worker after special processing. The scheme of inter-process communication is used here, that is, the data service in the figure. The State service is aimed at the state operation of the Python Datastream API. By operating the State in Python, the data will be returned from the Python worker to the Java operator. The Java operator will then access the State backend to get the corresponding State data, and return it to the Python worker. Finally, The user can then manipulate the result of the State.

The picture above is the PyFlink Runtime Workflow. The roles inside are Python operator, Python runner, bundle processor, coder, and Python operation. These different roles run in different places. Among them, the Python operator and the Python runner run in the Java JVM and are responsible for connecting the upstream and downstream Java operators, while the bundle processor, coder and Python operation run in the PVM. The bundle processor uses the existing Apache Bean framework and can receive data from Java Python data, they use inter-process communication between them. The coder is a custom serializer on the Python side. The Java side sends a piece of data, which is sent to the Python runner through the Python operator, serialized by the Python runner, and then sent to the bundle processor through inter-process communication. The bundle processor then deserializes the serialized binary array through the coder and obtains a Python object. Finally, the deserialized Python parameter is used as the input parameter of a function body through the Python operation, and then the customized Python function is called to obtain the customized result.

The bottleneck of the above process mainly lies in the following aspects: first, the computing side calls user-defined functions and there is an overhead written in Python at the framework layer before calling; the second is the custom serialization part, which requires serialization on both the Java side and the Python side. Data is serialized and deserialized; the third part is the communication between processes.

For the above bottlenecks, some column optimizations are made:

In terms of calculation, using codegen to change all the variables of the existing Python function to constants, the execution efficiency of the function will be higher; in addition, all the implementations of the existing Python operation are changed to cython, which is equivalent to converting Python to .c , the performance has been greatly improved;
In terms of serialization, a custom serializer is provided, all of which are implemented in pure c, which is more efficient than Python.
In terms of communication, optimization has not yet been achieved.
The problem of serialization and communication is essentially the problem of Java and Python calling each other, that is, how to optimize the runtime architecture of PyFlink.

3. FFI-based PEMJA

Calling each other between Java and Python is already a relatively common problem, and there are already many implementations.

The first is the scheme of calling each other between processes, that is, the scheme of network communication, including the following:

The socket scheme, all of its communication protocols are implemented by itself, which can be very flexible, but it is relatively cumbersome;
The py4j solution, that is, both PyFlink and PySpark use py4j when writing jobs on the client side;
Alink scheme, which uses py4j during runtime, and also has custom Python functions; grpc scheme, which uses existing grp service, does not require custom protocols, and has custom services and messages;
In addition, the shared memory scheme is also another inter-process communication scheme, such as Tensorflow on Flink, which is implemented by sharing memory. There is also PyArrow Plasma, which is also an object-style shared memory store.

The above solutions are all for inter-process communication, so can Python and Java run in the same process, so as to completely eliminate the troubles caused by inter-process communication?

There are indeed some existing libraries that try this, the first option is to convert Python to Java. For example, p2j converts Python source code into Java source code, and voc converts Python code directly into Java bytecode. The essence of this solution is to convert Python into a set of code that can run directly on the JVM. But this solution also has many flaws, because Python is constantly developing, it has various syntaxes, and it is very difficult to map Python syntax to corresponding objects in Java, they are different languages after all.

The second option is a Java-based Python interpreter. The first is the Jython solution. Python is actually a set of Python interpreters written in the c language. The Python interpreter written in c can run on top of c, then the Python interpreter implemented by Java can also run directly on the JVM. Another solution is Graalvm, which provides a way of truffle framework, which can support various programming languages to use a common structure. This structure can run on the JVM, so that various languages can run in the same process. inside.

The premise of the above solution is to be able to recognize Python code, which means that it must be compatible with various existing Python codes, but at present, compatibility is a difficult problem to solve, which prevents this set of Python from being converted to Java The possibility of further promotion of the program.

The third is a set of schemes based on FFI.

The essence of FFI is how the host language invokes a guest language, that is, the mutual invocation between Java and Python. There are many corresponding specific implementation schemes.

Java provides JNI (Java native interface), which allows Java users to call some libs implemented by c through the JNI interface, and vice versa. With this set of interfaces, JVM manufacturers will implement JNI according to this set of interfaces, so as to realize mutual calls between Java and c.

The Python/C API is also similar. Python is a set of interpreters implemented by c, so it can well support the Python code to call the third-party library of c, and vice versa.

Cython provides a tool to convert source code into code recognized by another language. For example, the Python code is converted into a set of very efficient c language code, and then embedded into the cPython interpreter to run directly, which is very efficient.

Ctypes encapsulates the c library so that Python can efficiently call the c library.

The core of the FFI-based solution mentioned above is c. With the c bridge, a code written in Java can call c through the JNI interface, and then c calls the cPython API interface, and finally realizes that Java and Python run in the same thread, which is the overall idea of PEMJA . It solves the problem of inter-process communication, and because it uses the Python/C API provided by itself, there is no compatibility problem, and overcomes the shortcomings of Java's implementation of the interpreter.

The above figure shows several implementations based on this set of ideas, but these implementations have more or less problems.

The problem that JPype solves is the problem of Python calling Java. It does not support Java calling Python, so it is not suitable for this scenario.

JEP implements Java to call Python, but its specific implementation has many limitations. First, it can only be installed from source code, which has very high requirements on the environment, and it needs to rely on some .source files from cPython third parties, which is very unfavorable for cross-platform installation. use. The startup entry of JEP must be the JEP program, which needs to dynamically load the class library and must be set in the environment variable in advance, which is very unfavorable for it to run on another architecture as a third-party middleware plug-in. In addition, there are performance issues, it does not well overcome the problems of the existing Python GIL, so its performance is not so efficient.

And PEMJA basically overcomes the above problems and better realizes the mutual call between Java and Python.

The above figure is a performance comparison of several frameworks. A relatively standard and simple String upper function is used here. The main comparison here is the overhead of the framework layer, not the performance of the custom function, so the simplest function is used. At the same time, considering that the most commonly used data structure of various existing functions is String, String is used here.

Here we compare the performance of 100 bytes and 1000 bytes under these four interpreters. It can be seen that Jython is not as efficient as imagined, but has the lowest performance among the four implementations. The performance of JEP is also far less than that of PEMJA. PEMJA is about 40% of the pure Java implementation at 100 bytes, and the performance exceeds the pure Java implementation at 1000 bytes.

How to explain this phenomenon? String upper itself is a Java implementation, while in Python it is a .c implementation. The execution efficiency of the function itself is higher than that of Java. Combined with the fact that the framework overhead is small enough, the overall performance is higher than that of Java. It means that in some scenarios, the performance of Python UDF may surpass Java UDF.

A key point that many users now use Java UDFs instead of Python UDFs is that Python UDFs are far less performant than Java. But if Java's performance is not better than Python, Python has an advantage, because it is a scripting language after all, and it is more convenient to write.

The figure above shows the architecture of PEMJA.

The damond thread in Java is responsible for initialization and final destruction as well as creation and release of resources in PEMJA and the corresponding Python PVM. The user uses the PEMJA instance in Java, the instance is mapped to the corresponding PEMJA instance in PEMJA, and the instant will create each Python sub interpreter. Compared with the global Python interpreter, the Python double interpreter is a smaller concept that can control the GIL. It has its own independent hip space, so it can achieve namespace isolation. Each thread here corresponds to a Python sub interpret, which can execute its own Python function in the corresponding PVM.

4. PyFlink Runtime 2.0

PyFlink Runtime 2.0 is based on PEMJA.

The left side of the above figure is the architecture of PyFlink 1.0. There are two processes in it, one is the Java process and the other is the Python process. The data interaction between them is realized through data service and State service, using process IPC communication.

With PEMJA, the data service and State service can be replaced with PEMJA Lib, and then the original JVM on the left and the PVM on the right can be run in the same process, thus completely solving the problem of IPC process communication.

The above figure compares the performance of the existing PyFlink UDF, a set of PyFlink UDFs based on PEMJA, and Java UDFs. Also use the String upper function to compare the performance of 100 bytes and 1000 bytes. It can be seen that in the case of 100 bytes, the implementation of UDF on PEMJA has basically reached 50% of the performance of Java UDF. In the case of 1000 bytes, the performance of UDF on PEMJA has surpassed Java UDF. Although this is related to the implementation of custom functions, it can also illustrate the high performance of this PEMJA framework.

5. Future Work

In the future, the PEMJA framework will be open-sourced (officially open-sourced on January 14, 2022), because it involves a general solution, not only on PyFlink, but also in various Java and Python calling solutions. This set of frameworks will make an independent open source for the PEMJA framework. Its first version only supports the function of calling Python from Java for the time being, and will support the function of calling Java from Python in the future, because the function calling State written in Python in the Python Datastream API depends on the function of calling Java from Python. In addition, PEMJA will be implemented to support Numpy native data structure. After this support is implemented, pandas UDF will be used, and the performance will be qualitatively improved.

Welcome to join the "PyFlink Exchange Group" to exchange PyFlink related issues.

Flink CDC Meetup Online

Time: May 21st 9:00-12:25

Live streaming on PC: https://developer.aliyun.com/live/248997

On the mobile terminal, it is recommended to scan on WeChat and pay attention to the ApacheFlink video number to make an appointment to watch:

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

Introducing the FFI-based PyFlink Next-Generation Python Runtime

1. New features of PyFlink

2. PyFlink Runtime

3. FFI-based PEMJA

4. PyFlink Runtime 2.0

5. Future Work

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Introducing the FFI-based PyFlink Next-Generation Python Runtime

1. New features of PyFlink

2. PyFlink Runtime

3. FFI-based PEMJA

4. PyFlink Runtime 2.0

5. Future Work

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈