Kangaroo Cloud: The overall architecture and key technical points of building a real-time computing platform based on Flink

Background of platform construction

Traditional offline data development has poor timeliness and cannot meet the rapid iterative Internet needs. With the rapid development of real-time technology represented by Flink, real-time computing is used by more and more enterprises, but various problems also arise in use. For example, developers have high barriers to use, the quality of the output business data is not guaranteed, and the enterprise lacks a unified platform management that is difficult to maintain. Under the influence of many unfavorable factors, we decided to use the existing Flink technology to build a complete real-time computing platform.

Overall platform architecture

From the perspective of the overall architecture, the real-time computing platform can be roughly divided into three layers:

Computing platform
Scheduling platform
Resource platform.

Each layer is responsible for corresponding functions, and at the same time there is interaction between layers, which conforms to the design atom of high cohesion and low coupling. The architecture diagram is as follows:

Computing platform

Directly for developers to use, can access various external data sources according to business needs, and provide follow-up tasks. After the data source configuration is completed, you can do data synchronization and SQL-based data calculation based on the Flink framework on it, and you can perform multi-dimensional monitoring and alarm on running tasks.

Scheduling platform

After this layer receives the task content and configuration from the platform, the next step is the core work, which is also the focus of the following content. Here is a general introduction, and different plug-ins will be used for analysis according to different task types.

Data synchronization task: After receiving the json from the upper layer, it enters the FlinkX framework, generates the corresponding DataStream according to the difference between the data source end and the write target end, and finally converts it into a JobGraph.
Data calculation task: After receiving the SQL from the upper layer, it enters the FlinkStreamSQL framework, parses the SQL, registers it into a table, generates a transformation, and finally converts it into a JobGraph.

The scheduling platform submits the obtained JobGraph to the corresponding resource platform to complete the task submission.

Resource platform

Currently, multiple sets of different resource clusters can be connected, and different resource types can be connected, such as yarn and k8s.

Data synchronization and data calculation

In the scheduling platform, after receiving the user's task, a series of subsequent conversion operations are started, and finally the task is run. We look at how to build a real-time computing platform based on Flink from the underlying technical details, and how to use FlinkX and FlinkStreamSQL for one-stop development.

FlinkX

As the first and most basic step of data processing, let's take a look at how FlinkX is doing secondary development on the basis of Flink. The user only needs to pay attention to the json script and some configuration of the synchronization task, and does not need to care about the details of calling Flink, and FlinkX supports the functions shown in the figure below.

Let's first look at the process involved in Flink task submission. The interaction flow chart is as follows:

So how does FlinkX encapsulate and call the above components on the basis of Flink, making Flink easier to use as a data synchronization tool?

Mainly expand from the three parts of Client, JobManager, TaskManager, and the content involved is as follows:

Client side

FlinkX has done some customized development of the native Client. Under the FlinkX-launcher module, there are mainly the following steps:

Parsing parameters, such as: parallelism, savepoint path, program entry jar package (usually written in Flink demo), configuration in Flink-conf.yml, etc.;
Generate PackagedProgram through the program's entry jar package, external input parameters, and savepoint parameters;
Call the main method of the entry jar package of the program specified in PackagedProgram through reflection. In the main method, the corresponding plug-in is loaded according to the difference between the reader and writer configured by the user;
Generate JobGraph, add the required resources (jar packages required by Flink, reader and writer jar packages, Flink configuration files, etc.) to the shipFiles of YarnClusterDescriptor, and finally YarnClusterDescriptor can interact with yarn to start JobManager;
After the task is submitted successfully, the client can get the applicationId returned by yarn, and then the status of the task can be tracked through the application.

JobManager side

After the client side submits, then yarn starts jobmanager, jobmanager will start some of its own internal services, and will build ExecutionGraph.

In this process, FlinkX mainly did the following two things:

Use different plug-ins to rewrite the createInputSplits method in the InputFormat interface to create slices. When the amount of upstream data is large or multiple parallelism readings are required, this method plays the role of setting different slices for each parallelism. For example: when reading MySQL in two parallel degrees, through the configured sharding field (such as auto-incrementing the primary key ID).
- The first degree of parallelism to read SQL is: select * from table where id mod 2=0;
- The second degree of parallelism to read SQL is: select * from table where id mod 2=1;
After the shards are created, they are assigned to each concurrent instance in order through getInputSplitAssigner.

TaskManager side

After the TaskManager receives the task scheduled by the JobManager, it starts its own life cycle call, which mainly includes the following important stages:

initialize-operator-states(): loops through all operators of the task and calls the initializeState method that implements the CheckpointedFunction interface. In FlinkX, it is DtInputFormatSourceFunction and DtOutputFormatSinkFunction. This method will be called when the task is started for the first time. The function is to restore the state. When the task fails, the reading position can be restored from the last checkpoint, so as to achieve the purpose of continuing to run, as shown in the following figure:

open-operators(): This method calls the open method of all StreamOperators in OperatorChain, and finally calls the open method in BaseRichInputFormat. This method mainly does the following things:
- Initialize the accumulator, record the number of reads and writes, and the number of bytes;
- Initialize custom Metric;
- Turn on the speed limiter;
- Initialization state
- Open the connection to read the data source (depending on the data source, each plug-in implements it separately).
run(): calls the nextRecord method in InputFormat and the writeRecord method in OutputFormat to process data.
close-operators(): does some closing operations, such as calling the close methods of InputFormat, OutputFormat, etc., and doing some cleanup work.

The above is the overall life process of StreamTask in TaskManager. In addition to how FlinkX calls the Flink interface described above, FlinkX also has the following features.

custom accumulator: accumulator is a distributed statistics or aggregation information from user functions and operations. Each parallel instance creates and updates its own Accumulator object, and then merges and collects different parallel instances, which are merged by the system at the end of the job, and the results can be pushed to Prometheus, as shown in the figure:

supports offline and real-time synchronization: We know that FlinkX is a framework that supports offline and real-time synchronization. Here we take MySQL data source as an example to see how it is implemented.
- Offline task: In the run method of DtInputFormatSourceFunction, the open method of InputFormat will be called to read the data records into the resultSet, and then the reachedEnd method will be called to determine whether the data of the resultSet has been read. If the reading is completed, follow the close process.
- Real-time task: The open method is the same as offline. When reachedEnd, it is judged whether it is a polling task. If it is, it will enter the branch of interval polling, and the largest incremental field value read in the last poll will be entered. As the starting position of this polling, and the next polling, the polling flowchart is as follows:

dirty data management and error control: records the error data when writing to the data source, classifies the cause of the error, and writes it into the configured dirty data table. There are currently four types of errors: type conversion errors, null pointers, primary key conflicts, and other errors. Error control is based on Flink's accumulator, which records the number of error records during operation, and then determines whether the number of error records has exceeded the configured maximum value in a separate thread. If it exceeds, an exception is thrown to make the task fail. In this way, different tasks with different requirements for data accuracy can be performed, and different error control can be performed. The control flow chart is as follows:

limiter: of pressure on the downstream database, so some rate control needs to be done at the source end. FlinkX uses the token bucket current limit method to control the rate. As shown in the figure below, when the rate of data generated by the source reaches a certain threshold, no new data will be read. The rate limiter is also initialized in the open phase of BaseRichInputFormat.

The above is the basic principle of FlinkX data synchronization, but data synchronization in the data business scenario is only the first step. Since the current version of FlinkX only has EL in ETL, it does not have the ability to convert and calculate data, so it needs to be generated. The data flows into the downstream FlinkStreamSQL.

FlinkStreamSQL

Based on Flink, its real-time SQL is extended, mainly to extend the join of streams and dimension tables, and to support all the syntax of native Flink SQL. Currently, the FlinkStreamSQL source can only be connected to Kafka, so the default upstream data source is Kafka.

Next, let's take a look at how FlinkStreamSQL is based on Flink. Users only need to pay attention to the business SQL code and how to call Flink api to shield the bottom layer. The overall process is basically similar to the FlinkX introduced above, but the difference is on the Client side. It mainly includes three parts: SQL parsing, registry, and SQL execution.

Parse SQL

Here is mainly to parse the four SQL statements written by users, create function, create table, create view, and insert into, and encapsulate them into a structured SQLTree data structure. SQLTree includes a collection of custom functions, a collection of external data source tables, a collection of view statements, and a collection of data writing statements.

Table registration

After obtaining the SQLTree parsed above, the external data source set corresponding to the create table statement in SQL can be registered as a table in tableEnv, and the user-defined UDF can be registered in tableEnv.

Execute SQL

After registering the data source as a table, you can execute the SQL statement of the later insert into. There are two situations for executing SQL:

If there is no associated dimension table in SQL, execute SQL directly;
Dimension tables are associated in SQL. Since the dimensional table join syntax is not supported in the early version of Flink, we have made an extension here. However, after FlinkStreamSQL v1.11, it has been consistent with the community and supported the syntax of dimensional table join. According to the different types of dimension tables, different association methods are used:
- Full-dimension table: Take upstream data as input and use RichFlatMapFunction as query operator. When initializing, the entire data table is retrieved into the memory, and then combined with the input data group to obtain the widened data, and then a large table is re-registered for subsequent follow-up SQL usage.
- Asynchronous dimension table: use upstream data as input, use RichAsyncFunction as query operator, and use LRU cache for the query data, and then combine with the input data group to get the widened data, and then re-register a large table for subsequent SQL use.

The above is the difference between FlinkX and FlinkStramSQL on the client side. Since the source side only has Kafka and the community’s native Kafka-connector is used, there is no data fragmentation logic on the jobmanager side. The taskmanager logic is basically similar to FlinkX. Introduce again.

Task operation and maintenance

After using FlinkX and FlinkStreamSQL to develop the business, the next step is to enter the task operation and maintenance stage. In the operation and maintenance phase, we mainly monitored the dimensions of task operation information, data entry and exit indicators, data delay, back pressure, and data tilt.

Task running information

We know that FlinkStreamSQL is encapsulated based on FlinkSQL, so when the task is submitted to run, FlinkSQL analysis, verification, logical plan, logical plan optimization, and physical plan are the final step, and finally the task is run, and we get the DAG we often see picture:

However, because FlinkSQL has made many optimizations to the task, we can only see the general DAG diagram as shown in the figure above, and we can't visually see what happened in some details in the sub-DAG diagram. Therefore, we have made a certain transformation in the original way of generating DAG graphs, so that we can intuitively see what happens in each Operator and each parallelism in the sub-DAG graph. With the detailed DAG graph, other Some of the monitoring dimensions can be displayed intuitively, such as: data input and output, delay, back pressure, data tilt, which can be specifically located when a problem occurs, as shown in the following figure:

After understanding the above structure, let's take a look at how it is implemented. We know that when the Client submits a task, a JobGraph is generated. The taskVertices collection in JobGraph encapsulates the complete information of the above figure. After we generate json from taskVertices, we combine LatencyMarker and related metrics to generate the above image on the front end, and Make corresponding warnings.

In addition to the DAG above, there are also custom metrics, data delay acquisition, etc., which are not specifically introduced here. Students who are interested can refer to the FlinkStreamSQL project.

Use case:

After passing the above introduction, let's take a look at the actual cases used on the platform. The following shows a complete case: use FlinkX to synchronize new user data in MySQL to Kafka in real time, and then use FlinkstreamSQL to consume Kafka to calculate the number of new users per minute in real time, and the output results are stored in downstream MySQL for business use.

real-time synchronization of MySQL new data

Real-time calculation of the number of new users per minute

running information

The overall DAG can visually display multiple indicators mentioned above

After analyzing the detailed DAG diagram, you can see multiple indicators inside the sub-DAG

The above is the overall architecture and some key technical points of Flink's Kangaroo cloud real-time computing platform. If there are any shortcomings, you are welcome to point out.

Kangaroo Cloud: The overall architecture and key technical points of building a real-time computing platform based on Flink

Background of platform construction

Overall platform architecture

Computing platform

Scheduling platform

Resource platform

Data synchronization and data calculation

FlinkX

Client side

JobManager side

TaskManager side

FlinkStreamSQL

Parse SQL

Table registration

Execute SQL

Task operation and maintenance

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

小米基于 Apache Paimon 的流式湖仓实践

物化视图详解：数据库性能优化的利器

基于Flink的配置化实时反作弊系统

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元