The practice of Native Flink on Kubernetes in Xiaohongshu

Abstract: This article organizes the speech made by He Jun, a senior R&D engineer from Xiaohongshu’s data flow team, at the Flink Forward Asia 2021 platform construction session, and introduces the construction process of Xiaohongshu’s management of Flink tasks based on K8s, as well as the migration process to the Native Flink on K8s solution. some practical experience. The main contents include:
Multi-cloud deployment architecture
Business scene
Helm cluster management mode
Native Flink on Kubernetes
Stream-batch integrated operation management and control platform
future outlook

Click to view live replay & speech PDF

1. Multi-cloud deployment architecture

The above figure is the current multi-cloud deployment mode diagram of Flink cluster. Business data is scattered on various cloud vendors. In order to adapt to business data processing, Flink clusters are naturally deployed in multiple clouds. These cloud storage products are used for internal offline data storage on the one hand, and Flink for checkpoint storage on the other hand.

On these cloud infrastructures, we built the Flink engine to support the operation of SQL and JAR tasks. Thanks to a previous work to promote the SQLization of tasks, the current ratio of internal SQL tasks to JAR tasks has reached 9:1 .

On top of this is the flow-batch integrated job management and control platform, which mainly has the following functions: job development and operation and maintenance, task monitoring and alarming, task version management, data lineage analysis, metadata management, resource management, etc.

The platform data input mainly has the following three parts. The first part is business data, which exists in the internal DB system of the business, such as MySQL or MongoDB, and the other part is the front-end and back-end management data. The front-end management mainly refers to the behavior of users on the Xiaohongshu APP. Logs and back-end management are mainly data related to application performance indicators within the APP. After the data is processed by the Flink cluster, it will be output to three main business scenarios, the first is the message bus, such as the Kafka cluster and the RocketMQ cluster, the second will be output to the olap engine, such as StarRocks or Clickhouse, and finally it will be output to the online system. Such as Redkv or ES for some online queries.

2. Business Scenario

There are many application scenarios of Flink in Xiaohongshu, such as real-time anti-fraud monitoring, real-time data warehouse, real-time algorithm recommendation, and real-time data transmission. This chapter will focus on two of these scenarios.

The first is real-time recommendation algorithm training. The above figure shows the execution flow of recommendation algorithm training.

The Flink cluster first receives the raw data collected by the management service, attributes this part of the data and writes it to the Kafka cluster, and then there will be another Flink task to summarize this part of the data again, and then get a Summary tag Data, for this tag data, there are three real-time processing paths:

First, the summary tag data will be associated with the feature data of the notes recommended by the recommendation engine. This association is also performed in the Flink task, which is called the FeatureJoiner task internally. Then, an algorithm training sample will be produced. After the algorithm training, the sample will produce a recommendation model, and this model will eventually be fed back to the real-time recommendation engine.
Second, the summary tag data will be written to the OLAP engine in real time through Flink, such as to Hologres or Clickhouse.
Finally, the summary tag data will be written to the offline Hive table through Flink and provided for subsequent offline reports.

The second scenario is real-time data warehouse. The business data includes the front-end and back-end management data. After being processed according to the business distribution rules, it will be written to Kafka or RocketMQ. Subsequently, Flink will perform real-time ETL business processing on this part of the data, and finally enter the real-time data center. At present, the real-time data center is mainly implemented based on StarRocks. StarRocks is a very powerful OLAP engine, which carries many real-time related services of the company. On top of the data center, we also support many important real-time indicators, such as real-time DAU, real-time GMV, real-time live broadcast attribution, and real-time advertising billing.

3. Helm cluster management mode

For a long time before the official migration to Native Flink on K8s, cluster management was based on Helm. Helm is a package manager on K8s, which can define, install and upgrade K8s applications and services, and has the following features:

First, it can manage complex K8s applications. When creating a Flink cluster, a lot of K8s-related resources will be created, such as service or config map and Deployment, etc. Helm can package these resources into a Helm chart, and then manage them in a unified manner, thereby There is no need to perceive the underlying description file corresponding to each resource.
Second, it is more convenient to upgrade and roll back. You only need to execute a simple command to upgrade or roll back. At the same time, because its code is isolated from the code of Flink Client, there is no need to modify the code of Flink Client during the upgrade process, realizing code decoupling.
Third, it is very easy to share. After the Helm chart is deployed on the company's private server, it can support Flink cluster management of multiple cloud products at the same time.

The above figure is based on the Helm-managed Flink task life cycle, which is mainly divided into two stages: start task and stop task. There are three roles here, the first one is Client, which can be an API request or a click behavior of the user on the interface. When starting the task, after receiving the API request, Baichuan Platform will execute the install command through the Helm Client command to create the corresponding cluster resources. At the same time, the internally integrated Flink Client will also check whether the JobManager of the current cluster is started. job submission. After the job is submitted to the cluster to run, the Flink Client will also continuously check the running status of the current job, which is also the maintenance mechanism of the job status in the Helm management mode.

The second stage is the task stop stage. The Client will initiate a stop command to the Baichuan platform. After receiving the stop command, the Baichuan platform will send a cancel command to the JobManager through the Flink Client. At the same time, it will check whether the cancel command has been successfully executed, and find that the job has been canceled. After that, the delete command will be executed through the Helm Client to complete the destruction of cluster resources.

The image above shows which K8s resources are created through Helm.

The first is the most basic JobManager and TaskManager Deployment;
The second part is ConfigMap, which is mainly for the configuration of log4j and the configuration related to cloud storage products provided by major cloud vendors;
The third part is Ingress, which is currently mainly used for Flink web UI usage and access to the current task status of JobManager;
The fourth part is Nodeport Service. Every time a JobManager is started, a Nodeport Service will be started on JM and bound to Ingress;
The fifth part refers to disk resources, which mainly have the following two application scenarios: when using RocksDB Backend, you need to mount efficient cloud disks, and batch tasks need to mount disks for intermediate data exchange;
The last part is ServiceMesh. The TaskManager will access third-party services in the form of sidecars, such as Redkv service. The configuration of these services is also created here.

As you can see from the above figure, Helm Client integrates K8s-related configurations provided by major cloud manufacturers. When it receives the parameters for creating a task, it will render different Helm templates according to these parameters and submit them to different clouds. Execute on the above to create the corresponding cluster resources.

Under the current cluster management mode, there are still many problems encountered in the actual production process:

The first is the K8s resource bottleneck problem. Because each time a JobManager is started, a NodePort Service will be created, and this Service will occupy a port and a ClusterIP in the entire cluster. When the job scale reaches a certain level, these port resources and IP resources will encounter performance bottlenecks.
The second is the high cost of ServiceMesh configuration. As mentioned above, TaskManager will access third-party services, such as redkv service, so each time a redkv service is added, it is necessary to modify the corresponding configuration and complete the release. The cost of the process is relatively high.
The third is that there is a certain resource leakage problem. All resource creation and destruction are completed by executing the Helm command. In some abnormal cases, the failure of the job will cause the Helm delete command not to be executed. At this time, there may be a problem of resource leakage.
The fourth is that the mirror version is more difficult to converge. In the daily production process, if there is a problem with some online tasks, a hotfix version image will be temporarily produced and run online. Over time, there will be many version images running on the line, which is very important for the subsequent operation and maintenance work and troubleshooting. a very big challenge.
The last problem is the high complexity of UDF management, which is a problem that any distributed computing platform will encounter.

In response to the above problems, we optimized and solved them one by one in the Native Flink on K8s mode.

4. Native Flink on Kubernetes

First, why choose this deployment model? Because it has the following three characteristics:

Shorter failover time;
Resource hosting can be achieved without the need to manually create a TaskManager pod, and it can also be destroyed automatically;
Has more convenient HA. Before Flink 1.12, the implementation of JobManager HA still relied on the third-party zookeeper. However, in the Native Flink on K8s mode, you can rely on the native K8s leader election mechanism to complete the HA of the JobManager.

The above picture is the architecture diagram of Native Flink on K8s. A K8s Client is integrated in Flink Client, which can communicate directly with the K8s API server to complete the creation of JobManager Deployment and ConfigMap. After JobManager development is created, the resource manager module in it can communicate directly with the K8s API server to complete the creation and destruction of TaskManager pods, which is also a big difference between it and the traditional session cluster mode.

Internally, UDFs are divided into two categories:

The first category is built-in platform, which abstracts and summarizes UDFs that are often used in daily production work, and builds them into the image. There is a configuration file about UDF in the image, which has the name and type of UDF, and specifies its corresponding implementation class.
The other type is User-defined UDF. In Helm management mode, user-defined UDF management is relatively extensive. All UDF-related JAR packages under the user's project are loaded into the classloader, which will cause class conflicts. In the Native Flink mode, a create function using JAR syntax is implemented, which can load the JAR package corresponding to the UDF required by the user on demand, which can greatly alleviate the problem of class conflict.

In the original mode, image management is done by uniformly packaging all the codes into a large image, but there is a problem in this way, any modification to any module requires a compilation and packaging of the entire code base, and this process is very time consuming.

Under the Native Flink version, some optimizations have been made for image version management, mainly by dividing the Flink image into three parts, which are divided into Flink engine, connector and third-party plug-ins. These three parts have their own version numbers, and can be assembled and combined freely. This optimization reduces the frequency of engine packaging, which also means that the release efficiency can be improved.

After splitting, how does Flink combine these images into a runnable image? The following is an example of loading a Kafka SDK plugin. When the job is running, it will obtain the Kafka SDK version that the current job should use from a dynamic configuration repository, and pass it to the backend of Baichuan. This SDK version corresponds to an image in the docker repository, and the image only contains one SDK corresponding to the JAR package, when Baichuan's backend renders the pod template, it will load the image in the InitContainer stage, and at the same time move its Kafka JAR package to a specified directory in the Flink container to complete the loading.

In the new mode, the job status maintenance mechanism is refactored, and a headless service and a status DB are introduced. In the JobManager module, the JobManager status listener is used to continuously monitor job status changes, and upload the changes to the job ststusDB. Baichuan Platform can obtain the task status through Query DB. In addition, in some scenarios, Baichuan may not be able to obtain the status of the task due to the failure to upload the job status. Baichuan can still follow the original path and access the JobManager through Ingress to obtain the status of the task. The difference between the Ingress at this time and the previous one is that it is bound to a headless service and does not need to occupy the Cluster IP of the cluster, which solves the problem of insufficient K8s ClusterIP and nodePort in the previous mode.

After completing the above optimization work, the biggest problem faced is how to smoothly migrate tasks from the old version to the new version of Flink 1.13, which is actually a very challenging task. Mainly done the following four aspects of work:

First, compatible conversion tools. This tool will convert the SQL to ensure that the syntax verification of SQL running in 1.13 will not go wrong. 1.10 to 1.13 has undergone several major version changes, and the definition of SQL is incompatible in many aspects. For example, in 1.10 and 1.11, the value of Kafka connector is 0.11. After 1.13, the corresponding value has become universal. Raw SQL definitely won't work on 1.13 without any transformation.
Second, it is compatible with detection tools. The purpose of this tool is to check if SQL can be recovered from a lower version savepoint when running on 1.13. Mainly checked from the following aspects: whether the name has changed after the operator ID is upgraded; whether the max parallelism corresponding to the old and new versions has changed, because when the max parallelism changes, there is no change in some scenarios. The way to restore from an old savepoint.
Third, precompile. Precompile the converted SQL on 1.13 to see if the compiled result can pass normally. In the process of compatibility detection tools, a lot of incompatibilities from low version to high version were also found, and a new data type mechanism was introduced. 1.11 did not use ExternalSerializer, while 1.12 and later used ExternalSerializer for packaging; BaseRowSerializer has been implemented in Flink 1.11 At that time, it was renamed to RowDataSerializer; there is a seriaVersionUID in the data type, which was a random long type number before, but was fixed to 1 in 1.13. The above incompatibilities make it impossible for 1.13 to recover directly from a lower version of savepoint. Therefore, in response to these problems, some modifications have been made on the engine side.
Fourth, migration tools. There are three main goals of this tool:
- First of all, the impact time on user jobs should be minimized as much as possible. In order to achieve this goal, we have made a relatively large transformation to the application mode of Native Flink on K8s. The native application mode is to apply for resources while scheduling. In order to reduce the impact on user jobs during the upgrade process, in the application mode, it is possible to apply for resources in advance and complete the compilation of SQL (ie, the pre-start of the JobManager). After this process is completed , stop the old job and start a new job, the impact of the whole process on user jobs can be controlled within 30 seconds (medium-scale tasks).
- Secondly, in the process of migration, it is necessary to ensure that the state is not lost, because all migrations are started based on savepoint, so this piece of data will not be lost.
- Finally, if an exception occurs during the upgrade process, it can support automatic rollback under abnormal conditions.

In the actual Application mode application process, some problems with native Flink were also found, and corresponding solutions were made.

For example, when the JobManager fails over, it will pull up a new batch of TMs, which will double the resources of the TaskManager. If the resources of the resource pool are not enough to meet the double demand, it may cause failover to fail. In addition, even if the failover is successful this time, the newly started job will be recovered based on the recovery path specified at the first startup. The location at this time may already be a location ten days ago, which will lead to repeated data consumption. The problem. In response to this problem, when a failover of the JobManager is detected, it will directly fail the job on the engine side and give an alarm, and then handle it through manual intervention.

5. Flow-batch integrated operation management and control platform

The flow-batch integrated job management and control platform mainly provides the functions of the following modules: job development and operation and maintenance, version management, monitoring and alarming, resource management, data lineage, metadata management and SDK. Among them, resource management is mainly divided into resource isolation and resource recommendation. Data lineage is mainly used to display the relationship between the upstream and downstream of Flink tasks. Metadata management is mainly for the user catalog table.

The upper part of the above figure is the SQL development interface. The main part of the page is the SQL editor. The right side has the basic information of the task, version information, job parameters and some interface elements related to resource configuration.

The lower part is the task operation and maintenance interface, which provides many routine operations, such as stopping the task, or hitting savepoint first and then stopping the task.

Job version management is divided into Flink SQL tasks and Flink JAR tasks. On the SQL task interface, you can see that SQL has undergone many releases, and the "More" button provides a rollback operation. For Flink JAR tasks, there are currently two methods for submitting JAR tasks. You can directly upload the user's JAR package to a distributed storage path, or you can specify the version of the JAR package by specifying the code repository tag.

Resource management is mainly divided into resource isolation and resource recommendation. The concept of resource pool is introduced here, and it is segmented based on the following dimensions:

The first factor is the cloud environment in which it runs;
The second factor is the type of business;
The third factor is whether the resource pool will be used by the stream or by the batch task.

In addition, for a task that has been running for a period of time, it will combine the CPU, memory, delay lag and other indicator information during its historical running period to give the recommended result of the optimal K8s resource configuration required by the current task.

The Rugal scheduling platform is a benchmarking product within the company. It can create tasks regularly and submit them to the Baichuan platform through the SDK provided by Baichuan. On the left side of the above figure is an SQL editing template, in which many parameter information is displayed in the form of variables. When calling the SDK, you can pass in the actual values corresponding to these variables, and use these values to render the specific SQL to be executed, thereby generating a specific execution instance.

6. Future Outlook

Finally, there are plans for future work.

First, dynamic resource adjustment. At present, once a Flink job is submitted to run, the resources occupied by an operator cannot be modified during running. Therefore, I hope that the resources occupied by an operator can be adjusted without restarting the job in the future.
Second, cross-cloud multi-active solutions. At present, the company's core P0 operations are basically dual-link, but only on a single cloud. It is hoped that for these core tasks, a cross-cloud active-active solution can be implemented. When a task on one cloud has a problem, it can be stably switched to another cloud.
Third, batch task resource scheduling optimization. Because most batch tasks start to be executed after the early morning, and many tasks are scheduled at the same time, some tasks may not be able to run in time because they cannot preempt resources. There is still room for optimization in the task scheduling execution strategy.

Flink CDC Meetup Online

Time: May 21st 9:00-12:25

Live streaming on PC: https://developer.aliyun.com/live/248997

It is recommended to pay attention to the ApacheFlink video number and make an appointment for viewing on the mobile terminal

The practice of Native Flink on Kubernetes in Xiaohongshu

1. Multi-cloud deployment architecture

2. Business Scenario

3. Helm cluster management mode

4. Native Flink on Kubernetes

5. Flow-batch integrated operation management and control platform

6. Future Outlook

ApacheFlink

引用和评论

Flink x Paimon 在抖音集团生活服务的落地实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

数据无界、湖仓无界，Apache Doris 湖仓一体典型场景实战指南（下篇）

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统