Xinye Container Cloud Revealed 02-Flink Goes to the Cloud

As an important framework and engine for big data and real-time data departments, Flink plays an important role. Flink is being used more and more, and cluster management is becoming more and more difficult. In order to support Flink to go to the cloud, the container cloud team has also done a lot of exploration work to ensure that Flink can be containerized better and steadily.

One: deployment method selection

The current version of Flink on the cloud is Flink 1.11. The deployment modes of Flink 1.11 based on kubernetes are: Session, Per-job, and Application. The following describes the comparison of the three deployment modes.

The difference between these three modes is:

Cluster life cycle and resource isolation guarantee
Is the main() method of the application executed on the client side or on the cluster

Picture: 1-1

Application mode

The main method of the user program will run in the cluster instead of the client. The user packs the program logic and dependencies into an executable jar package. The cluster entry program (ApplicationClusterEntryPoint) is responsible for calling the main method to generate the JobGraph. Application mode creates a cluster for each submitted application. The cluster can be seen as a cluster of sessions shared between the jobs of a specific application and terminates when the application is completed. In this architecture, the Application mode provides resource isolation and load balancing guarantees between different applications. On a specific application, the JobManager executing main() can save the required CPU cycles and also save the bandwidth required to download dependencies locally.

Per-Job mode

The per-job mode is to start the Flink cluster for each submitted job, with its own jobmanager and taskmanager. Therefore, the operating mode may have a relatively high delay when it is started. After the job is completed, the cluster will be destroyed and all resources will be cleaned up. This mode allows for better resource isolation, because running problematic jobs will not affect other jobs.

Session mode

The session mode assumes an already running cluster and uses the resources of the cluster to execute any submitted applications. Applications executing in the same (session) cluster use the same resources and therefore compete for the same resources. You don't pay all the expenses for every job. However, if one of the jobs misbehaves or causes the TaskManager to crash, all jobs running on the TaskManager will be affected by the failure. In addition to negatively impacting the job that caused the failure, this means a potentially large-scale recovery process where all restarted jobs access the file system at the same time and make it unavailable to other services. In addition, letting a cluster to run multiple jobs means that the load of the JobManager is greater, and the JobManager is responsible for recording all the jobs in the cluster. This mode is very suitable for short jobs where the start delay is very important, such as interactive queries.

目前信也科技在服务的容器化方面的支持已经很成熟，有一套完善的构建发布流程，所以通过对比Flink的几种部署模式的优缺点，最终我们采用了Application的部署方式，该方式相比于其它两种模式优点明显，拥有更好的隔离性，同时对资源的利用率也高，也更符合我们现有的发布流程规范。

Two: Flink on k8s

Picture: 2-1

When creating a Flink Kubernetes cluster, the Flink client will first connect to the Kubernetes ApiServer to submit the cluster, including ConfigMap and Job Manager. Then, Kubernetes will create a JobManager, during which Kubelet will pull the image, prepare and mount it, and then execute the startup command. After starting the JobManager command, Dispatcher and KubernetesResourceManager are available, and the cluster is ready to accept one or more jobs. When a user submits a job through the Flink client, the client will generate a Job graph and upload it to the Dispatcher together with the user jar. JobManager requests slots resources from KubernetesResourceManager. If there are no slots available, the resource manager generates a TaskManager and registers it in the cluster. This is a brief way of interacting with Flink inside kubernetes.

Three: build release

The version of Flink adopted by the letter is 1.11, and the deployment mode is Application. We abstract each job into an application. Therefore, the publishing process of each job is the same publishing process as a normal application:

Apply for Flink job related applications
The job of non-1.11 version is upgraded to version 1.11, and the maven mirroring build package plug-in is integrated
Package the image through the aladdin packaging platform. Select the corresponding application and packaged mirror version through the stargate platform to release the job

1. Build

Picture: 3-1

2. Release

Picture: 3-2

在程序进行升级的时候，停止job可以采用savepoint的机制来保持作业的完整快照，在下次启动的时候可以利用保存的savepoint来进行作业的恢复

Four: monitoring alarm

After Flink is deployed on kubernetes, job monitoring and operation and maintenance also require corresponding supporting facilities. When a Flink job hangs up during its operation, how can we monitor it and generate an alarm? Unhealthy operations may occur during job operation, such as long checkpoint time, frequent gc, or restart. How do we monitor these?

1. Configure the probe

Flink job is managed by jobmanager for resource management and job scheduling during the running process, so we add probes to jobmanager in each Flink job to check whether the job is running normally. When the health check fails, we send an alarm through the zmon platform

 readinessProbe:
        httpGet:
          path: /jobs/overview
          port: 8081
          scheme: HTTP
        initialDelaySeconds: 30
        timeoutSeconds: 3
        periodSeconds: 5
        successThreshold: 3
        failureThreshold: 12

2. Indicator collection

At present, the company's applications on the cloud use prometheus to collect metrics, so we still use prometheus to collect metrics for Flink. Use grafana to display (the picture below enters the display part)

Figure: 4-1

1.对于系统指标最常关注的是作业的可用性，如 uptime (作业持续运行的时间)、fullRestarts (作业重启的次数)
2.另外是 checkpoint 相关信息，例如最近完成的 checkpoint 的时长(lastCheckpointDuration )、最近完成的 checkpoint 的大小(lastCheckpointSize )、作业失败后恢复的能力(lastCheckpointRestoreTimestamp )、成功和失败的 checkpoint 数目(numberOfCompletedCheckpoints、numberOfFailedCheckpoints)以及在 Exactly once 模式下 barrier 对齐时间(checkpointAlignmentTime)

Five: high availability

The job may hang up due to machine maintenance or hardware failure. At this time, how to quickly restore the job has also become a problem that needs to be considered. At present, the method we use is to automatically resume the operation through the program

As shown below:

Figure: 5-1

note:

因为部分job并不适用这种方式来恢复job，所以在平台可以设置，如果job设置了启用高可用，默认情况下，检测到job挂掉，会采用checkpoint的机制来直接恢复job，如果没有设置，job挂掉后会通知对应的job负责人，负责人收到告警后，需要手动来恢复job

Six: Problems encountered

1. Network issues

Flink needs to access the Apiserver during the deployment process. At this time, the job needs to access the ApiServer through the clusterip. The company cluster uses the macvlan network and cannot access the clusterip. Therefore, in order to support Flink deployment, we add to the corresponding pod in the big data cluster Dual network card mode (one is macvlan, the other is bridge)

2. ip management

Because Flink is deployed in kubernetes in deployment mode, we cannot manage the IP of related Flink pods for pods deployed in this way, so the container cloud team deployed a webhook service to monitor related events of cluster pods and allocate them And release related ip

Picture: 6-1

3. Configuration issues

Flink relies on hdfs when doing checkpoint and savepoint during operation, so pod needs to access hdfs service during operation. We passed the configuration of hdfs-site.xml and core-site.xml in advance. In the kubernetes cluster Configmap, and specify the corresponding ConfigMap during Flink startup

Picture: 6-2

Xinye Container Cloud Platform

As the container cloud platform of Xinye Technology, stargate is an enterprise-level container management platform developed based on kubernetes. It has the characteristics of comprehensive coverage of business scenarios and rich ecology. It is currently open source

开源地址：https://github.com/ppdaicorp/stargate

Scan and add group discussion (Note: stargate)