Introduction to How Zeppelin implements and uses Yarn Application mode.

Author: Zhang Jianfeng (Jian Feng)

Last year, when Flink Forward talked about the future of the Flink on Zeppelin project, we talked about the support for the Application mode. Today there is good news to tell you that the community has realized this feature. Welcome everyone to join the Flink on Zeppelin nail group (32803524), download the latest version to use this feature.

GitHub address

https://github.com/apache/flink

Everyone is welcome to give Flink likes and send stars~

Application mode is a new operating mode introduced after Flink 1.11. The problem to be solved is to reduce the pressure on the client, and run the user's main function in the JobManager instead of the user client. This mode is very suitable for Flink on Zeppelin, because the client of Flink on Zeppelin is the Flink interpreter process, and Flink interpreter is a long running main function, which continuously accepts commands from the front end and performs corresponding operations (such as submitting a Job, Stop Job, etc.). Next, we will talk in detail about how Zeppelin implements the Yarn Application mode and how to use this mode.

1. Architecture

When talking about the Yarn Application mode architecture, let's talk about the evolution of Flink on Zeppelin's architecture by the way.

Normal Flink on Yarn operating mode

In this mode of the client, the Flink Interpreter process runs on the Zeppelin server machine, and each client corresponds to a Flink Cluster on Yarn. If there are many Flink Interpreter processes, it will put a lot of pressure on the Zeppelin machine.

Reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/wt1g3h
Reference video:
https://www.bilibili.com/video/BV1Te411W73b?p=6

image.png

Yarn Interpreter mode

Yarn Interpreter moved the client (Flink Interpreter) to the Yarn cluster, transferred the resource pressure to the Yarn cluster, and solved some of the problems of the normal Flink on Yarn operating mode above. This mode will require an additional Yarn Container for each Flink Cluster To run this Flink Interpreter, it is not very efficient in terms of resource utilization.

Reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/gcah8t
Reference video:
https://www.bilibili.com/video/BV1Te411W73b?p=24

image.png

Yarn Application mode

The Yarn Application mode completely solves the problems of the previous two modes. The Flink interpreter runs in the JobManager. This will not affect the resource pressure of the Zeppelin Server machine, nor will it cause any waste of Yarn cluster resources.

image.png

2. How to use Yarn Application mode

Configuring Yarn Application mode is very simple, just set flink.execution.mode to yarn-application. All other configurations are no different from other modes. All the features of Flink on Zeppelin below can be used as usual in Yarn Application mode. We also take this opportunity to review all the functions of Flink on Zeppelin.

Multi-language support

The following 3 languages are supported in the same Flink Cluster, and these 3 languages are open (shared Catalog, shared ExecutionEnvironment):

  • Scala (%flink)
  • PyFlink (%flink.pyflink)
  • SQL (%flink.ssql, %flink.bsql)
Reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/pg5s82
https://www.yuque.com/jeffzhangjianfeng/gldg8w/ggxz76
https://www.yuque.com/jeffzhangjianfeng/gldg8w/te2l1c

Reference video:
https://www.bilibili.com/video/BV1Te411W73b?p=4

Hive integration

Hive can be enabled by simple configuration.

Reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/agf94n

Reference video:
https://www.bilibili.com/video/BV1Te411W73b?p=10

UDF support

The following 4 ways to define and use Flink UDF are supported:

  • Write Scala UDF directly in Zeppelin;
  • Write PyFlink UDF directly in Zeppelin;
  • Create UDF with SQL;
  • Use flink.udf.jars to specify the jar containing udf.
Reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/dthfu2

Reference video:

https://www.bilibili.com/video/BV1Te411W73b?p=17
https://www.bilibili.com/video/BV1Te411W73b?p=18
https://www.bilibili.com/video/BV1Te411W73b?p=19

Third party reliance

There are two ways to specify third-party dependencies in Zeppelin, specifically:

  • flink.excuetion.packages
  • flink.execution.jars (It should be noted that in Yarn Application mode, you need to specify the HDFS path here, because Flink Interpreter runs in JobManager, and JobManager runs in yarn container, and you may not have you on the NodeManager machine of yarn container The jar to be specified)
Reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/rn6g1s

Reference video:
https://www.bilibili.com/video/BV1Te411W73b?p=15

Checkpoint & Savepoint

Checkpoint and Savepoint are used as usual.

Reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/mlnswx

SQL advanced features

Zeppelin has made a series of enhancements to Flink SQL, these enhancements can be used as usual, such as:

  • Support both Batch SQL and Streaming SQL
  • Multi-statement support
  • Comment support
  • Job parallelism support
  • Multiple insert support
  • JobName settings
  • Stream SQL streaming data visualization
Specific reference documents:
https://www.yuque.com/jeffzhangjianfeng/gldg8w/te2l1c

In addition, the Alibaba Cloud open platform team has been recruiting outstanding big data talents (including internship + social recruitment) for a long time. Our main responsibility is to provide basic services of big data and AI to major SME customers on Alibaba Cloud. Your job will be to build an easy-to-use, enterprise-level big data and AI open platform around Spark, Flink, Hadoop, Tensorflow, PyTorch and other open source components. Not only are there technical challenges, but also the passion for making products. We use a large number of open source technologies (Hadoop, Flink, Spark, Zeppelin, Kubernetes, Tensorflow, Pytorch, etc.) and are committed to giving back to the open source community.

If you are interested in open source, big data or AI, here is the best soil. Committer & PMC in many open source fields such as Apache Flink, Apache Kafka, Apache Zeppelin, Apache Beam, Apache Druid, and Apache Hbase. Interested students please send your resume to: jeffzhang.zjf@alibaba-inc.com

Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own the copyright, and does not bear the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。