PyFlink development environment tool: Zeppelin Notebook

PyFlink is the entry point of Flink's Python language. Its Python language is indeed very simple and easy to learn, but the development environment of PyFlink is not easy to build. If you are not careful, the PyFlink environment will be messed up, and it is difficult to troubleshoot the cause. Today, I will introduce you a PyFlink development environment tool that can help you solve these problems: Zeppelin Notebook. The main content is:
Ready to work
Set up the PyFlink environment
Summary and the future

Maybe you have heard of Zeppelin a long time ago, but the previous articles focused on how to develop Flink SQL in Zeppelin. Today, I will introduce how to efficiently develop PyFlink Job in Zeppelin, especially to solve the environmental problems of PyFlink.

One sentence to summarize the topic of this article is to use Conda in Zeppelin notebook to create Python env to automatically deploy to the Yarn cluster. You do not need to manually install any PyFlink packages on the cluster, and you can use each other in a Yarn cluster at the same time Isolated multiple versions of PyFlink. The final effect you can see is this:

1. You can use third-party Python libraries on the PyFlink client, such as matplotlib:

2. You can use third-party Python libraries in PyFlink UDF, such as:

Let's take a look at how to achieve it.

1. Preparation

Step 1.

Get ready to build the latest version of Zeppelin. This will not be done here. If you have any questions, you can join the Flink on Zeppelin DingTalk Group (34517043) for consultation. Another thing to note is that the Zeppelin deployment cluster needs to be Linux. If it is a Mac, the Conda environment opened on the Mac machine cannot be used in the Yarn cluster (because the Conda package is not compatible between different systems).

Step 2.

Download Flink 1.13, it should be noted that the functions in this article can only be used in Flink 1.13 and above, then:

flink-Python-*.jar 1612e0bbcafd7e to the lib folder of Flink;
opt/Python 1612e0bbcafda2 to the lib folder of Flink.

Step 3.

Install the following software (these software is used to create Conda env):

miniconda：https://docs.conda.io/en/latest/miniconda.html
conda pack：https://conda.github.io/conda-pack/
mamba：https://github.com/mamba-org/mamba

Second, build a PyFlink environment

Then you can build and use PyFlink in Zeppelin.

Step 1. Make PyFlink Conda environment on JobManager

Because Zeppelin naturally supports Shell, you can use Shell to create a PyFlink environment in Zeppelin. Note that the third-party Python packages here are required by the PyFlink client (JobManager), such as Matplotlib, and make sure to install at least the following packages:

A version of Python (here is 3.7)
apache-flink (1.13.1 used here)
jupyter, grpcio, protobuf (these three packages are required by Zeppelin)

The remaining packages can be specified as needed:

%sh

# make sure you have conda and momba installed.
# install miniconda: https://docs.conda.io/en/latest/miniconda.html
# install mamba: https://github.com/mamba-org/mamba

echo "name: pyflink_env
channels:
  - conda-forge
  - defaults
dependencies:
  - Python=3.7
  - pip
  - pip:
    - apache-flink==1.13.1
  - jupyter
  - grpcio
  - protobuf
  - matplotlib
  - pandasql
  - pandas
  - scipy
  - seaborn
  - plotnine
 " > pyflink_env.yml
    
mamba env remove -n pyflink_env
mamba env create -f pyflink_env.yml

Run the following code to package the Conda environment of PyFlink and upload it to HDFS (note that the file format packaged here is tar.gz):

%sh

rm -rf pyflink_env.tar.gz
conda pack --ignore-missing-files -n pyflink_env -o pyflink_env.tar.gz

hadoop fs -rmr /tmp/pyflink_env.tar.gz
hadoop fs -put pyflink_env.tar.gz /tmp
# The Python conda tar should be public accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyflink_env.tar.gz

Step 2. Make PyFlink Conda environment on TaskManager

Run the following code to create the PyFlink Conda environment on the TaskManager. The PyFlink environment on the TaskManager contains at least the following two packages:

A version of Python (here is 3.7)
apache-flink (1.13.1 is used here)

The remaining packages are packages that Python UDF needs to depend on. For example, pandas is specified here:

echo "name: pyflink_tm_env
channels:
  - conda-forge
  - defaults
dependencies:
  - Python=3.7
  - pip
  - pip:
    - apache-flink==1.13.1
  - pandas
 " > pyflink_tm_env.yml
    
mamba env remove -n pyflink_tm_env
mamba env create -f pyflink_tm_env.yml

Run the following code to package the conda environment of PyFlink and upload it to HDFS (note that the zip format is used here)

%sh

rm -rf pyflink_tm_env.zip
conda pack --ignore-missing-files --zip-symlinks -n pyflink_tm_env -o pyflink_tm_env.zip

hadoop fs -rmr /tmp/pyflink_tm_env.zip
hadoop fs -put pyflink_tm_env.zip /tmp
# The Python conda tar should be public accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyflink_tm_env.zip

Step 3. Use Conda environment in PyFlink

Next, you can use the Conda environment created above in Zeppelin. First, you need to configure Flink in Zeppelin. The main configuration options are:

flink.execution.mode is yarn-application, the method described in this article is only applicable to yarn-application mode;
Specify yarn.ship-archives, zeppelin.pyflink.Python and zeppelin.interpreter.conda.env.name to configure the PyFlink Conda environment on the JobManager side;
Specify Python.archives and Python.executable to specify the PyFlink Conda environment on the TaskManager side;
Specify other optional Flink configurations, such as flink.jm.memory and flink.tm.memory here.

%flink.conf


flink.execution.mode yarn-application

yarn.ship-archives /mnt/disk1/jzhang/zeppelin/pyflink_env.tar.gz
zeppelin.pyflink.Python pyflink_env.tar.gz/bin/Python
zeppelin.interpreter.conda.env.name pyflink_env.tar.gz

Python.archives hdfs:///tmp/pyflink_tm_env.zip
Python.executable  pyflink_tm_env.zip/bin/Python3.7

flink.jm.memory 2048
flink.tm.memory 2048

Then you can use PyFlink and the specified Conda environment in Zeppelin as mentioned at the beginning. There are 2 scenarios:

In the following example, you can PyFlink client (JobManager side), for example, Matplotlib is used below.
The following example uses the library in the Conda environment on the TaskManager side created above PyFlink UDF

3. Summary and the future

The content of this article is to use Conda in the Zeppelin notebook to create a Python env to automatically deploy to the Yarn cluster. There is no need to manually install any Pyflink packages on the cluster, and multiple versions of PyFlink can be used in a Yarn cluster at the same time.

Each PyFlink environment is isolated, and the Conda environment can be customized and changed at any time. You can download the following note and import it into Zeppelin to reproduce what we said today: http://23.254.161.240/#/notebook/2G8N1WTTS

In addition, there are many things that can be improved:

Currently we need to create 2 conda envs because Zeppelin supports the tar.gz format, while Flink only supports the zip format. After the two sides are unified in the later stage, just create a conda env;
apache-flink now includes Flink's jar package, which makes the typed conda env extremely large. Yarn container will take a long time to initialize. This requires the Flink community to provide a lightweight Python package (not including Flink). jar package), you can greatly reduce the size of conda env.

Registration for the 3rd Apache Flink Geek Challenge begins! 300,000 bonuses are waiting for you!

With the impact of massive data, the value of data processing and analysis capabilities in the business is increasing day by day, and the exploration of timeliness of data processing in all walks of life is also deepening, as the main real-time computing computing engine-Apache Flink came into being.

In order to bring more real-time computing empowerment practices to the industry and encourage developers who love technology to deepen their grasp of Flink, the Apache Flink community has joined forces with Alibaba Cloud, Intel, and Alibaba Artificial Intelligence Governance and Sustainability Laboratory (AAIG). ), Occlum jointly organized the "3rd Apache Flink Geek Challenge and AAIG CUP" activity, which was officially launched today.

👉 Click for more event information 👈

PyFlink development environment tool: Zeppelin Notebook

1. Preparation

Step 1.

Step 2.

Step 3.

Second, build a PyFlink environment

Step 1. Make PyFlink Conda environment on JobManager

Step 2. Make PyFlink Conda environment on TaskManager

Step 3. Use Conda environment in PyFlink

3. Summary and the future

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

小米基于 Apache Paimon 的流式湖仓实践