Flink on Zeppelin series: Yarn Application mode support

Author: Zhang Jianfeng (Jian Feng)

Last year, when Flink Forward talked about the future of the Flink on Zeppelin project, we talked about the support for the Application mode. Today, there is good news to tell you that the community has implemented this feature, and everyone is welcome to download the latest version to use this feature.

Application mode is a new operating mode introduced after Flink 1.11. The problem to be solved is to reduce the pressure on the client, and run the user's main function in the JobManager instead of the user client. This mode is very suitable for Flink on Zeppelin, because the client of Flink on Zeppelin is the Flink interpreter process, and Flink interpreter is a long running main function, which continuously accepts commands from the front end and performs corresponding operations (such as submitting a Job, Stop Job, etc.). Next, we will talk in detail about how Zeppelin implements the Yarn Application mode and how to use this mode.

Architecture

When talking about the Yarn Application mode architecture, let's talk about the evolution of Flink on Zeppelin's architecture by the way.

Normal Flink on Yarn operating mode

In this mode of clients, Flink Interpreter runs on Zeppelin, and each client corresponds to a Flink Cluster on Yarn. If there are many Flink Interpreter processes, it will put a lot of pressure on Zeppelin.

Reference document: https://www.yuque.com/jeffzhangjianfeng/gldg8w/wt1g3h
Reference video: https://www.bilibili.com/video/BV1Te411W73b?p=6

Yarn Interpreter mode

Yarn Interpreter moved the client (Flink Interpreter) to the Yarn cluster, transferred the resource pressure to the Yarn cluster, and solved some of the problems of the normal Flink on Yarn operating mode above. This mode requires an additional Yarn Container for each Flink Cluster To run this Flink Interpreter, it is not very efficient in terms of resource utilization.

Reference document: https://www.yuque.com/jeffzhangjianfeng/gldg8w/gcah8t
Reference video: https://www.bilibili.com/video/BV1Te411W73b?p=24

Yarn Application mode

The Yarn Application mode completely solves the problems of the previous two modes, and runs the Flink interpreter in the JobManager, so that it will not affect the resource pressure of the Zeppelin Server machine, nor will it cause any waste of Yarn cluster resources.

How to use Yarn Application mode

Configuring Yarn Application mode is very simple, just set flink.execution.mode to yarn_application. All other configurations are no different from other modes. All the following features of Flink on Zeppelin can be used as usual in Yarn Application mode. We also take this opportunity to review all the functions of Flink on Zeppelin.

Multi-language support

The following 3 languages are supported in the same Flink Cluster, and these 3 languages are opened (shared Catalog, shared ExecutionEnvironment)

Scala (%flink)
PyFlink (%flink.pyflink)
SQL (%flink.ssql, %flink.bsql)

Reference document: https://www.yuque.com/jeffzhangjianfeng/gldg8w/pg5s82
https://www.yuque.com/jeffzhangjianfeng/gldg8w/ggxz76
https://www.yuque.com/jeffzhangjianfeng/gldg8w/te2l1c
Reference video: https://www.bilibili.com/video/BV1Te411W73b?p=4

Hive integration

Hive can be enabled by simple configuration:

Reference document: https://www.yuque.com/jeffzhangjianfeng/gldg8w/agf94n
Reference video: https://www.bilibili.com/video/BV1Te411W73b?p=10

UDF support

Support the following 4 ways to define and use Flink UDF

Write Scala UDF directly in Zeppelin
Write PyFlink UDF directly in Zeppelin
Create UDF with SQL
Use flink.udf.jars to specify the jar containing udf

Reference document: https://www.yuque.com/jeffzhangjianfeng/gldg8w/dthfu2
Reference video: https://www.bilibili.com/video/BV1Te411W73b?p=17
https://www.bilibili.com/video/BV1Te411W73b?p=18
https://www.bilibili.com/video/BV1Te411W73b?p=19

Third party reliance

In Zeppelin, you can use the following 2 ways to specify third-party dependencies, specifically

flink.excuetion.packages
flink.execution.jars (It should be noted that in Yarn Application mode, you need to specify the HDFS path here, because Flink Interpreter runs in JobManager, and JobManager runs in yarn container, and you may not have you on the NodeManager machine of yarn container The jar to be specified)

Reference document: https://www.yuque.com/jeffzhangjianfeng/gldg8w/rn6g1s
Reference video: https://www.bilibili.com/video/BV1Te411W73b?p=15

Checkpoint & Savepoint

Checkpoint and Savepoint are used as usual,

Reference document: https://www.yuque.com/jeffzhangjianfeng/gldg8w/mlnswx

SQL advanced features

Zeppelin has made a series of enhancements to Flink SQL, these enhancements can be used as usual, such as:

supports both Batch SQL and Streaming SQL
multi-sentence support
Comment Support
Job parallelism support
Multiple insert support
JobName setting
Stream SQL streaming data visualization

Specific reference documents: https://www.yuque.com/jeffzhangjianfeng/gldg8w/te2l1c

Flink on Zeppelin series: Yarn Application mode support

Architecture

Normal Flink on Yarn operating mode

Yarn Interpreter mode

Yarn Application mode

How to use Yarn Application mode

Multi-language support

Hive integration

UDF support

Third party reliance

Checkpoint & Savepoint

SQL advanced features

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

小米基于 Apache Paimon 的流式湖仓实践

物化视图详解：数据库性能优化的利器

基于Flink的配置化实时反作弊系统

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元