Application and Practice of Apache Flink in Car Home

This article is compiled from the topic "The Application and Practice of Apache Flink in Autohome" shared by Di Xingxing, the person in charge of the real-time computing platform of Autohome in Flink Forward Asia 2020. The main contents include:
Background and current situation
AutoStream platform
Real-time ecological construction based on Flink
Follow-up planning

1. Background and current situation

1. The first stage

Before 2019, most of the real-time business of Autohome was running on Storm. As the early mainstream real-time computing engine, Storm captured a large number of users with its simple Spout and Bolt programming models and the stability of the cluster itself. We built the Storm platform in 2016.

With the increasing demand for real-time computing and the gradual increase in the scale of data, Storm has highlighted shortcomings in development and maintenance costs. Here are a few pain points:

High development cost
We have always used the Lambda architecture. We will use T+1 offline data to correct the real-time data, that is, the offline data will eventually prevail, so the calculation caliber must be completely consistent with the offline in real time. The requirement document for real-time data development is offline SQL. The core work of real-time developers is to translate offline SQL into Storm code. Although some general-purpose bolts are encapsulated to simplify development, it is still very challenging to accurately translate hundreds of lines of offline SQL into code. Each run has to go through a series of cumbersome operations of packaging, uploading, and restarting, and debugging costs are high.

Computationally inefficient
Storm is not good at supporting state. It usually needs to rely on kv storage such as Redis and HBase to maintain intermediate state. We used to rely heavily on Redis. For example, in the common scenario of calculating UV, the easiest way is to use Redis sadd command to determine whether the uid already exists, but this method will bring high network IO, and if there is no big promotion or activity reported in advance If the traffic doubles, it is easy to fill up the Redis memory, and the operation and maintenance students will be caught off guard. At the same time, the throughput of Redis also limits the throughput of the entire job.

Difficult to maintain and manage
Because it is developed by writing Storm code, it is difficult to analyze metadata and blood relationship, at the same time, the readability is poor, the calculation caliber is opaque, and the business handover cost is high.

Not friendly to the digital warehouse
The data warehouse team is a team that directly meets business requirements. They are more familiar with the Hive-based SQL development model, and are usually not good at the development of Storm operations. This leads to some originally real-time requirements, and they can only choose T+1. Way to give the data.

At this stage, we support the most basic real-time computing requirements, because the development threshold is relatively high, and many real-time services are completed by our platform development. Both the platform and the data development are very scattered.

2. The second stage

We began investigating the Flink engine in 2018. Its relatively complete SQL support and natural state support attracted us. After studying and researching, we began to design and develop the Flink SQL platform in early 2019, and launched the AutoStream 1.0 platform in mid-2019. . When the platform was launched, it was applied to the warehouse team, monitoring team, and operation and maintenance team. It can be quickly used by users mainly due to the following points:

Low development and maintenance costs: Most of the real-time tasks of Autohome can be implemented with Flink SQL + UDF. The platform provides commonly used Source and Sink, as well as commonly used UDFs for business development, and users can write UDFs by themselves. The development is completed based on the "SQL + configuration" method, which can meet most of the needs. For custom tasks, we provide a convenient SDK to help users quickly develop custom Flink tasks. The users of the platform are no longer just professional data developers. Ordinary developers, testing, and operation and maintenance personnel can complete daily real-time data development work on the platform and realize platform empowerment after basic learning. The data assets are manageable, and the SQL statement itself is structured. By analyzing the SQL of a job and combining the DDL of the source and sink, we can easily know the upstream and downstream of the job and naturally retain the blood relationship.

High performance: Flink can perform calculations entirely based on state (memory, disk). Compared with previous scenarios that rely on external storage for calculations, the performance is greatly improved. During the 818 activity pressure test, the modified program can easily support real-time calculations of dozens of times the original flow, and the horizontal expansion performance is very good.

Comprehensive monitoring and alarm: users host tasks on the platform, the platform is responsible for the survival of the tasks, and users can focus on the logic development of the task itself. For SQL tasks, SQL is extremely readable and easy to maintain; for custom tasks, based on our SDK development, users can focus more on sorting out business logic. Whether it is SQL tasks or SDKs, we have embedded a large number of monitoring and associated with the alarm platform, so that users can quickly find, analyze, locate and repair tasks, and improve stability.

Enabling business: support the data warehouse layered model, the platform provides good SQL support, data warehouse personnel can use SQL to apply offline data warehouse construction experience to the real-time data warehouse construction, since the platform went online, the data warehouse gradually Start to meet real-time computing requirements.

Pain points:

Ease of use needs to be improved. For example, users cannot manage UDFs on their own, and can only use the built-in UDFs on the platform or send the jar package to the platform administrator to handle uploading issues manually.

With the rapid growth of platform workload, platform on-call costs are very high. First of all, we often face some basic problems for new users:
1. Problems with the use of the platform;
2. Problems encountered in the development process, such as why the package reports an error;
3. Problems with the use of Flink UI;
4. The meaning of monitoring graphics, how to configure alarms.

There are also some questions that are not easy to answer quickly:

Jar package conflict;
Why is the consumption of Kafka delayed;
Why did the task report an error.

Especially for latency issues, our common data skew, GC, and back pressure issues can directly guide users to Flink UI and monitoring charts to view, but sometimes it is necessary to manually view jmap, jstack and other information on the server, and sometimes it needs to be generated. Flame graphs to help users locate performance issues.

In the initial stage, we did not cooperate with the operation team. It was our developers who dealt with these issues directly. Although a large amount of documentation was added during the period, the overall on-call cost was still very high.

When Kafka or Yarn fails, there is no quick recovery plan. When faced with some heavy insurance business, it is somewhat stretched. As we all know, there is no environment or component that is always stable and non-failure. When a major failure occurs, a response plan to quickly restore the business is required.

Resources are not reasonably controlled, and there is a serious waste of resources. As the number of users who use the platform for development tasks continues to increase, the number of tasks on the platform also continues to increase. Some users cannot control the use of cluster resources well, and often have the problem of applying too many resources, resulting in low job operation efficiency or even idleness, resulting in a waste of resources.

At this stage of the AutoStream1.0 platform, the SQL-based development method has greatly reduced the threshold of real-time development. Each business party can realize the development of real-time business by itself. At the same time, the students in the data warehouse can start to connect to the real-time business after simple learning. Our platform side is released from a large number of business needs, allowing us to concentrate on platform work.

3. Current stage

In view of the above aspects, we have made the following upgrades in a targeted manner:

Introduce Jar Service: Support users to upload UDF jar packages and quote them in SQL fragments to realize self-management of UDFs. At the same time, the custom job can also configure the Jar in the Jar Service. In the case of multiple jobs sharing the same Jar, the user only needs to configure the jar package path in the Jar Service in the job to avoid repeated uploading of the Jar every time it goes online. Tedious operation;
Self-service diagnosis: We have developed functions such as dynamic adjustment of log levels and self-service viewing of flame graphs to facilitate users to locate problems by themselves and reduce our daily on-call costs;
Job health check function: analyze from multiple dimensions, score each Flink job, and give corresponding suggestions for each low score item;
Fast disaster recovery at the job level of Flink: We have built two YARN environments, each of which corresponds to a separate HDFS. The two HDFSs previously used SNAPSHOT to perform two-way replication of Checkpoint data. At the same time, we added the function of switching clusters on the platform. , In the case that a YARN cluster is unavailable, users can choose the Checkpoint of the standby cluster on the platform by themselves;
Kafka multi-cluster architecture support: use our self-developed Kafka SDK to support fast switching of Kafka clusters;
Docking the budget system: The resources occupied by each job directly correspond to the budget team, so as to ensure that resources will not be occupied by other teams to a certain extent. At the same time, the budget administrator of each team can view the budget usage details and understand their own budget support Which businesses within the team.

At present, users have become familiar with the use of the platform. At the same time, with the launch of self-service health check and self-diagnosis, the daily on-call frequency of our platform is gradually decreasing, and it has gradually entered the stage of a virtuous circle of platform construction.

4. Application Scenarios

The data used by Autohome for real-time calculations is mainly divided into three categories:

Client log, which is what we say internally, click stream log, including startup log, duration log, PV log, click log, and various event logs reported by the client. This type of log is mainly user behavior log, which is our real-time data warehouse. The basis of the medium-flow wide meter, UAS system, and real-time portrait. On top of this, it also supports online services such as smart search and smart recommendation; at the same time, the basic flow data is also used to support the flow analysis of each business line, real-time effect statistics, and support Daily operational decisions.
Server logs, including nginx logs, logs generated by various back-end applications, and logs of various middleware. These log data are mainly used in scenarios such as health monitoring and performance monitoring of back-end services.
There are three main types of real-time change records of the business database: MySQL binlog, SQLServer CDC, TiDB TiCDC data. Based on these real-time data change records, we have established a content center and platform through the abstraction and standardization of various content data. Basic services such as resource pools; there are also some real-time statistical scenarios of business data that do simple logic, and the result data is used for real-time large screens, compasses, etc., for data presentation.

The above three types of data will be written to the Kafka cluster in real time. The Flink cluster is calculated for different scenarios, and the result data is written to Redis, MySQL, Elasticsearch, HBase, Kafka, Kylin and other engines to support upper-level applications.

Some application scenarios are listed below:

5. Cluster size

At present, Flink has 400+ cluster servers, the deployment mode is YARN (80%) and Kubernetes, the number of running jobs is 800+, the daily calculation volume is 1 trillion, and the peak processing data per second is 20 million.

Two, AutoStream platform

1. Platform Architecture

The above is the current overall architecture of the AutoStream platform, mainly the following parts:

AutoStream core System
This is the core service of our platform. It is responsible for the integration of metadata services, Flink client services, Jar management services, and interactive result query services, and exposes platform functions to users through the front-end page.

It mainly includes SQL and Jar job management, database table information management, UDF management, operation record and historical version management, health check, self-diagnosis, alarm management and other modules. At the same time, it provides the ability to connect to external systems and supports other systems through interfaces Mode management database table information, SQL job information and job start and stop operations, etc. The life cycle management and scheduling system based on Akka tasks provides efficient, simple, and low-latency operation guarantees, and improves user efficiency and ease of use.

Metadata Service (Catalog-like Unified Metastore)
It mainly corresponds to the back-end implementation of Flink Catalog. In addition to supporting basic library table information management, it also supports database table granularity access control, combined with our own characteristics, and supports user group-level authorization.

At the bottom, we provide the Plugin Catalog mechanism, which can be used to integrate with Flink's existing catalog implementations, and it can also be convenient for us to embed custom catalogs. Through the Plugin mechanism, HiveCatalog, JdbcCatalog, etc. can be easily reused, thus ensuring the library table The consistency of the cycle.

At the same time, the metadata service is also responsible for parsing the DML statements submitted by users, identifying the dependent table information of the current job, and using it for job analysis and submission process, and at the same time, it can record blood relationship.

Jar Service
The various SDKs provided by the platform are managed uniformly on the Jar Service. At the same time, users can also submit custom Jars, UDF jars, etc. on the platform to the Jar Service for unified management, and then reference them through configuration or DDL in the job.

Flink Client Service (Customed Flink Job Client)
Responsible for converting jobs on the platform into Flink Jobs and submitting them to Yarn or Kubernetes. We have abstracted Yarn and Kubernetes at this layer, unifying the behavior of the two scheduling frameworks, exposing unified interfaces and standardized parameters, and weakening Yarn and Kubernetes. The difference in Kubernetes has laid a good foundation for seamless switching of Flink jobs between the two frameworks.

The dependencies of each job are not the same. In addition to the management of basic dependencies, we also need to support personalized dependencies. For example, different versions of SQL SDK, Jars and UDFs uploaded by users themselves, so the submission phase of different jobs needs to be isolated.

We adopt the Jar service + process isolation method. By connecting with Jar Service, we select the corresponding Jar according to the type and configuration of the job, and submit it for execution in a separate process to achieve physical isolation.

Result Cache Service
It is a simple caching service for online debugging scenarios in the development phase of SQL jobs. When we analyze the user's SQL statement, the result set of the Select statement is stored in the cache service; then the user can view the result data corresponding to the SQL in real time by selecting the SQL serial number on the platform (each complete SELECT statement corresponds to a serial number) , It is convenient for users to develop and analyze problems.

Built-in Connectors (Source & Sink)
The rightmost part is mainly the implementation of various Sources and Sinks. Some are reusing connectors provided by Flink, and some are connectors developed by ourselves.

For each type of connector, we have added the necessary metric and configured it as a separate monitoring chart, which is convenient for users to understand the operation status of the job, and also provides a data basis for locating problems.

2. SQL-based development process

On the basis of the above functions provided by the platform, users can quickly realize the development of SQL jobs:

Create a SQL task;
Write DDL declaration Source and Sink;
Write DML to complete the realization of main business logic;
Check the results online. If the data meets expectations, add an INSERT INTO statement and write it to the specified sink.

By default, the platform will save the record of every change of SQL. Users can view the historical version online. At the same time, we will record various operations for the job. During the job maintenance stage, we can help users trace the change history and locate problems.

The following is a Demo, used to count the PV and UV data of the day:

3. Metadata management based on Catalog

The main content of metadata management:

Support access control: In addition to supporting basic database table information management, it also supports table granularity access control, combined with our own characteristics, and supports user group-level authorization;
Plugin Catalog mechanism: It can combine a variety of other catalog implementations and reuse existing catalogs;
The life cycle behavior of the library tables is unified: users can choose the life cycle of the tables on the platform and the underlying storage to be unified, avoiding separate maintenance on both sides and repeated table creation;
The new and old versions are fully compatible: Since we did not introduce the Metastore service separately in AutoStream 1.0, in addition, the DDL SQL parsing module in the 1.0 period was a self-developed component. Therefore, when building the MetaStore service, it is necessary to consider the compatibility of historical operations and historical database table information.
1. For library table information, the new MetaStore converts the new and old version of the library table information into a unified storage format at the bottom layer, thereby ensuring the compatibility of library table information.
2. For jobs, here we use abstract interfaces and provide two implementation paths, V1Service and V2Service, respectively, to ensure the compatibility of new and old jobs at the user level.

The following is a schematic diagram of the interaction between several modules and Metastore:

4. UDXF Management

We have introduced the Jar Service service to manage various Jars, including user-defined jobs, platform internal SDK components, UDXF, etc. Based on Jar Service, we can easily implement UDXF self-management. In the On k8s scenario, We provide a unified image. After the Pod is started, the corresponding Jar will be downloaded from the Jar Service to the inside of the container to support the startup of the job.

If the SQL submitted by the user contains Function DDL, we will parse the DDL in the Job Client Service and download the corresponding Jar to the local.

In order to avoid dependency conflicts with other jobs, we will start a separate sub-process each time to complete the job submission operation. The UDXF Jar will be added to the classpath. We have made some modifications to Flink. When the job is submitted, the Jar will be uploaded to HDFS. At the same time, the AutoSQL SDK will register the UDF according to the function name and class name of the current job.

5. Monitoring alarm and log collection

Thanks to Flink's perfect metric mechanism, we can easily add metrics. For Connector, we have embedded a wealth of metrics and configured a default monitoring board. Through the board, you can view CPU, memory, JVM, network transmission, checkpoint, Monitoring charts of various Connectors. At the same time, the platform is connected with the company's cloud monitoring system to automatically generate a default alarm strategy to monitor key indicators such as survival status and consumption delay. At the same time, users can modify the default alarm strategy in the cloud monitoring system and add new alarm items to achieve personalized monitoring and alarm.

Logs are written to the Elasticsearch cluster through the cloud Filebeat component, and Kibana is opened for users to query.

The overall monitoring alarm and log collection framework is as follows:

6. Health Check Mechanism

With the rapid growth of the number of jobs, there have been many unreasonable use of resources, such as the aforementioned waste of resources. Most of the time, users are meeting new requirements and supporting new services, and rarely go back to evaluate whether the resource allocation of the job is reasonable and optimize the use of resources. Therefore, the platform has planned a version of the cost evaluation model, which is now called the health check mechanism. The platform will perform multi-dimensional health scores for tasks every day. Users can check the scores of individual tasks and the scores of the last 30 days on the platform at any time Curve.

Low-scoring assignments will be prompted when users log on to the platform, and regular emails will be sent to remind users to optimize and rectify. After optimizing assignments, users can proactively trigger a re-score to view the optimization effect.

We have introduced a multi-dimensional, weight-based scoring strategy, targeting CPU, memory usage, whether there is an idle slot, GC status, Kafka consumption delay, single-core processing data per second, and other multi-dimensional indicators combined with computing topology maps. Analysis and evaluation will eventually produce a comprehensive score.

Each low score item will display the reason for the low score and the reference range, and display some guidance suggestions to assist users in optimization.

We have added a new metric, using a number from 0% to 100% to reflect the TaskManagner CPU utilization. In this way, the user can intuitively evaluate whether the CPU is wasted.

The following is the general process of job scoring: First, we will collect and organize the basic information and Metrics information of running the job. Then apply the rules we have set to get the basic score and basic suggestion information. Finally, the score information and suggestions are integrated, comprehensively judged, and a comprehensive score and final report are obtained. Users can view the report through the platform. For assignments with low scores, we will send an alarm to the user who belongs to the assignment.

7. Self-diagnosis

As the pain points mentioned earlier, when users locate online problems, they can only turn to our platform, which causes a lot of on-call workload and poor user experience. In view of this, we launched the following functions:

Dynamically modify the log level: We learnt from Storm's method of modifying the log level, and implemented similar functions on Flink. By extending the REST API and RPC interface methods, we support modifying the specified Logger to a certain log level, and support setting an expiration. Time, when it expires, the log of the changed Logger will be restored to the INFO level;
Support self-service viewing of thread stack and heap memory information: Flink UI already supports online viewing of thread stack (jstack), we directly reuse this interface; an additional interface for viewing heap memory (jmap) has also been added to facilitate users to view online;
Support online generation and viewing of flame graphs: Flame graphs are a great tool for locating program performance problems. We use Ali’s arthas component to add the ability to view flame graphs online for Flink. Users can quickly evaluate performance when they encounter performance problems. bottleneck.

8. Fast disaster recovery based on Checkpoint replication

When real-time computing is used in important business scenarios, once a single Yarn cluster fails and is unrecoverable in the short term, it may have a greater impact on the business.

In this context, we have built a Yarn multi-cluster architecture. Two independent Yarns correspond to a set of independent HDFS environments. Checkpoint data is regularly replicated between the two HDFSs. At present, the delay of checkpoint replication is stable within 20 minutes.

At the same time, at the platform level, we open the function of switching clusters directly to users. Users can view the checkpoint replication status online, select the appropriate checkpoint (of course, you can also choose not to restore from the checkpoint) to switch the cluster, and then restart the job to achieve The relatively smooth migration of jobs between clusters.

3. Real-time ecological construction based on Flink

The core scenario of the AutoStream platform is to support the use of real-time computing developers, making real-time computing development simple, efficient, monitorable, and easy to operate and maintain. At the same time, with the gradual improvement of the platform, we began to explore how to reuse the AutoStream platform and how to apply Flink in more scenarios. Reusing AutoStream has the following advantages:

Flink itself is an excellent distributed computing framework, with high computing performance, good fault tolerance and mature state management mechanism, the community is booming, and the function and stability are guaranteed;
AutoStream has a complete monitoring and alarm mechanism. The job runs on the platform and does not need to be connected to the monitoring system separately. At the same time, Flink is very friendly to Metric support, and it is easy to add new Metrics;
With a large amount of technical precipitation and operating experience, through more than two years of platform construction, we have achieved a relatively complete management of the full life cycle of Flink operations on AutoStream, and built basic components such as Jar Service. Through simple upper-level interface packaging, It can be connected to other systems, so that other systems have real-time computing capabilities;
Support Yarn and Kubernetes deployment.

Based on the above points, when building other systems, we prioritize the reuse of the AutoStream platform and interface in the way of docking. The entire life cycle of the Flink job process is fully hosted on the AutoStream platform. Each system gives priority to implementing its own business logic. Can.

The AutoDTS (access and distribution tasks) and AutoKafka (Kafka cluster replication) systems in our team are currently built on the basis of AutoStream. Briefly introduce the way of integration, taking AutoDTS as an example:

Flink the tasks, and the access and distribution tasks on AutoDTS are all in the form of Flink jobs;
Docking with the AutoStream platform, calling the interface to realize the creation, modification, start, and stop of Flink jobs. Here the Flink job can be either Jar or SQL job;
The AutoDTS platform builds personalized front-end pages and personalized form data according to business scenarios. After the form is submitted, the form data can be stored in MySQL; at the same time, it is necessary to assemble job information and Jar package address and other information into the format defined by the AutoStream interface , Through the interface call to automatically generate a Flink task on the AutoStream platform, and save the ID of the Flink task at the same time;
To start an access task of AutoDTS, directly call the AutoStream interface to start the job.

1. AutoDTS data access distribution platform

The AutoDTS system mainly includes two parts of functions:

Data access: Write the change data (Change log) in the database to Kafka in real time;
Data distribution: The data connected to Kafka is written to other storage engines in real time.

1.1 AutoDTS data access

The following is the architecture diagram of data access:

We maintain the Flink-based data access SDK and define a unified JSON data format, which means that after MySQL Binlog, SQL Server, and TiDB change data are connected to Kafka, the data format is consistent. When using downstream business, it is based on Uniform format for development, no need to pay attention to the type of original business library.

When data is connected to Kafka Topic, Topic will be automatically registered as a flow table on the AutoStream platform, which is convenient for users to use.

Data access based on Flink construction has an additional benefit, that is, it can be based on Flink's precise one-time semantics, and realize accurate one-time data access at low cost. This is a necessary condition for supporting services that require high data accuracy. .

At present, we are working to integrate all the data in the business table into Kafka Topic. Based on Kafka's compact mode, it is possible to realize that the Topic contains both inventory data and incremental data. This is very friendly for data distribution scenarios. At present, if you want to synchronize data to other storage engines in real time, you need to access the full amount of data once based on the scheduling system, and then start the real-time distribution task to distribute the changed data in real time. . With Compact Topic, the operation of full access can be omitted. Flink version 1.12 already supports Compact Topic, introducing the upsert-kafka Connector [1]

[1] https://cwiki.apache.org/confluence/display/Flink/FLIP-149%3A+Introduce+the+upsert-kafka+Connector

The following is a sample data:

The flow table registered on the platform by default is Schemaless, and users can use JSON-related UDF to obtain the field data in it.

The following is an example of using a flow table:

1.2 AutoDTS data distribution

We already know that the data connected to Kafka can be used as a flow table, and the data distribution task is essentially to write the data of this flow table to other storage engines. In view of the fact that the AutoStream platform already supports a variety of Table Sink ( Connector), we only need to implement data distribution by assembling SQL based on the type and address of the downstream storage that the user fills in.

By directly reusing the Connector, the duplication of development work is avoided to the greatest extent.

The following is a SQL example corresponding to a distribution task:

2. Kaka multi-cluster architecture

In the actual application of Kafka, some scenarios need to be supported by Kafka multi-cluster architecture. Here are a few common scenarios:

Data redundancy and disaster recovery, real-time replication of data to another standby cluster, when a Kafka cluster is unavailable, the application can be switched to the standby cluster to quickly restore business;
Cluster migration. When the contract of the computer room expires or when the cloud goes to the cloud, the migration of the cluster is required. At this time, the entire cluster data needs to be copied to the cluster in the new computer room to make the business migration relatively smooth;
In the read-write separation scenario, when using Kafka, in most cases, more reads and less writes. To ensure the stability of data writing, you can choose to build a Kafka read-write separation cluster.

We currently have built a Kafka multi-cluster architecture, and there are two main content related to Flink:

The data replication program between Kafka clusters runs in the Flink cluster;
The Flink Kafka Connector was transformed to support fast switching of Kafka clusters.

2.1 Overall architecture

Let's first look at the data replication between Kafka clusters, which is the basis for building a multi-cluster architecture. We use MirrorMaker2 to implement data replication. We transform MirrorMaker2 into a normal Flink job and run in a Flink cluster.

We introduced Route Service and Kafka SDK to enable clients to quickly switch the Kafka cluster accessed by the client.

The client needs to rely on the Kafka SDK released by us, and the bootstrap.servers parameter is no longer specified in the configuration, but the cluster.code parameter is set to declare the cluster that it wants to access. The SDK will access the Route Service to obtain the real address of the cluster according to the cluster.code parameter, and then create a Producer/Consumer to start producing/consuming data.

The SDK will monitor the changes in routing rules. When you need to switch clusters, you only need to switch the routing rules in the Route Service background. When the SDK finds that the routing cluster has changed, it will restart the Producer/Consumer instance and switch to the new cluster.

If the consumer has switched clusters, because the offsets of Topic in Cluster1 and Cluster2 are different, you need to obtain the offset of the current Consumer Group in Cluster2 through the Offset Mapping Service, and then start consumption from these offsets to achieve relatively smooth cluster switching .

2.2 Data replication between Kafka clusters

We use MirrorMaker2 to implement data replication between clusters. MirrorMaker2 was introduced in Kafka version 2.4. The specific features are as follows:

Automatically recognize new Topic and Partition;
Automatically synchronize Topic configuration: Topic configuration will be automatically synchronized to the target cluster;
Automatically synchronize ACL;
Provide offset conversion tool: support to obtain the offset information corresponding to the group in the target cluster according to the source cluster, target cluster and group information;
Support extended black and white list strategy: flexible customization and dynamic effect.

clusters = primary, backup

primary.bootstrap.servers = vip1:9091

backup.bootstrap.servers = vip2:9092

primary->backup.enabled = true

backup->primary.enabled = true

This configuration completes the two-way data replication from the primary to the backup cluster. The data in topic1 in the primary cluster will be copied to the primary.topic1 topic in the backup cluster. The Topic naming rule of the target cluster is sourceCluster.sourceTopicName, which can be implemented by implementing ReplicationPolicy. Interface custom naming strategy.

2.3 Introduction to Topic Related to MirrorMaker2

Topic in the source cluster
heartbeats: stores heartbeat data;
mm2-offset-syncs.targetCluster.internal: stores the correspondence between the source cluster (upstreamOffset) and the offset (downstreamOffset) of the target cluster.

in the target cluster
mm2-configs.sourceCluster.internal: The connect framework comes with it to store the configuration;
mm2-offsets.sourceCluster.internal: The connect framework comes with it, which is used to store the offset currently processed by WorkerSourceTask. In the mm2 scenario, it is for the current data to synchronize to which offset of the topic partition of the source cluster. This is more like Flink's checkpoint concept ；
mm2-status.sourceCluster.internal: The connect framework comes with it and is used to store the connector status.

The above three use the KafkaBasedLog tool class in the connect runtime module. This tool class can read and write topic data in compact mode. At this time, MirrorMaker2 uses topic as KV storage.

sourceCluster.checkpoints.internal: Record the offset corresponding to the sourceCluster consumer group in the current cluster. mm2 will periodically read the offset submitted by the consumer group corresponding to the topic from the source Kafka cluster and write it to the sourceCluster.checkpoints.internal topic of the target cluster.

2.4 Deployment of MirrorMaker2

The following is the running process of the MirrorMaker2 job. To create a data replication job on the AutoKafka platform, the AutoStream platform interface will be called, and an MM2 type job will be created accordingly. When the job is started, the interface of the AutoStream platform is called to submit the MM2 job to the Flink cluster for operation.

2.5 Routing Service

The Route Service is responsible for processing the client's routing request, matching appropriate routing rules according to the client's information, and returning the final routing result, which is the cluster information, to the client.

Supports flexible configuration of routing rules based on cluster name, Topic, Group, ClientID, and client-defined parameters.

The following example is to route the consumer whose Flink job ID is 1234 to the cluster_a1 cluster.

2.6 Kafka SDK

It is impossible to communicate with Route Service using native kafka-clients. The client needs to rely on the Kafka SDK (SDK developed in-house by Carhome) to communicate with Route Service to achieve the effect of dynamic routing.

The Kafka SDK implements the Producer and Consumer interfaces, which is essentially a proxy for kafka-clients. Kafka SDK can be introduced with fewer changes to the business.

After the business relies on the Kafka SDK, the Kafka SDK will be responsible for communicating with the Route Service and monitoring routing changes. When the routing cluster is found to change, it will close the current Producer/Consumer, create a new Producer/Consumer, and access the new cluster.

In addition, Kafka SDK is also responsible for reporting the metrics of Producer and Consumer to prometheus of the cloud monitoring system. By viewing the pre-configured dashboard of the platform, you can clearly see the production and consumption of the business.

At the same time, the SDK will collect some information, such as application name, IP port, process number, etc., which can be found on the AutoKafka platform to facilitate us and users to locate problems together.

2.7 Offset Mapping Service

When the Consumer's route changes and the cluster is switched, the situation is a bit complicated, because currently MirrorMaker2 consumes data from the source cluster first, and then writes it to the target cluster. The same piece of data can be written to the same partition of the target topic. , But the offset is different from the source cluster.

For this kind of offset inconsistency, MirrorMaker2 will consume the __consumer_offsets data of the source cluster, plus the offset corresponding to the target cluster, and write it to the sourceCluster.checkpoints.internal topic of the target cluster.

At the same time, the mm2-offset-syncs.targetCluster.internal topic of the source cluster records the mapping relationship between the offset of the source cluster and the target cluster. Combining these two topics, we built the Offset Mapping Service to complete the conversion of the offset of the target cluster.

Therefore, when the Consumer needs to switch clusters, it will call the Offset Mapping Service interface to obtain the offsets of the target cluster, and then actively seek to these locations to start consumption, thus achieving relatively smooth cluster switching.

2.8 Integration of Flink and Kafka multi-cluster architecture

Since the Kafka SDK is compatible with the usage of kafka-clients, users only need to replace the dependencies, and then set parameters such as cluster.code and Flink.id.

After the cluster switch of Producer/Consumer occurs, due to the creation of a new Producer/Consumer instance, Kafka's metric data is not re-registered, resulting in the metric data cannot be reported normally. We added the unregister method to the AbstractMetricGroup class. When listening to the Producer/Consumer switching event, just re-register kafka metrics.

So far we have completed Flink's support for Kafka's multi-cluster architecture.

Four, follow-up planning

At present, most of the data statistics scenarios we support are based on traffic data or user behavior data. These scenarios do not have high requirements for precise one-time semantics. With the current community’s gradual improvement of Change Log support, our data access system is It supports precise one-time semantics, and is working on the function of fully accessing the business table to Kafka, so the follow-up can realize accurate one-time data statistics to support transaction, clues, and financial statistics.
Some companies have put forward the concept of integrating the lake and warehouse. Data lake technology can indeed solve some of the pain points of the original data warehouse architecture. For example, the data does not support update operations and cannot achieve quasi-real-time data query. We are currently doing some attempts to integrate Flink with Iceberg and Hudi. We will look for scenarios in the company and implement them in the future.

For more Flink-related technical issues, you can scan the QR code to join the community DingTalk exchange group;

Get the latest technical articles and community dynamics in the first time, please follow the public account~

Application and Practice of Apache Flink in Car Home

1. Background and current situation

1. The first stage

2. The second stage

3. Current stage

4. Application Scenarios

5. Cluster size

Two, AutoStream platform

1. Platform Architecture

2. SQL-based development process

3. Metadata management based on Catalog

4. UDXF Management

5. Monitoring alarm and log collection

6. Health Check Mechanism

7. Self-diagnosis

8. Fast disaster recovery based on Checkpoint replication

3. Real-time ecological construction based on Flink

1. AutoDTS data access distribution platform

1.1 AutoDTS data access

1.2 AutoDTS data distribution

2. Kaka multi-cluster architecture

2.1 Overall architecture

2.2 Data replication between Kafka clusters

2.3 Introduction to Topic Related to MirrorMaker2

2.4 Deployment of MirrorMaker2

2.5 Routing Service

2.6 Kafka SDK

2.7 Offset Mapping Service

2.8 Integration of Flink and Kafka multi-cluster architecture

Four, follow-up planning

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Application and Practice of Apache Flink in Car Home

1. Background and current situation

1. The first stage

2. The second stage

3. Current stage

4. Application Scenarios

5. Cluster size

Two, AutoStream platform

1. Platform Architecture

2. SQL-based development process

3. Metadata management based on Catalog

4. UDXF Management

5. Monitoring alarm and log collection

6. Health Check Mechanism

7. Self-diagnosis

8. Fast disaster recovery based on Checkpoint replication

3. Real-time ecological construction based on Flink

1. AutoDTS data access distribution platform

1.1 AutoDTS data access

1.2 AutoDTS data distribution

2. Kaka multi-cluster architecture

2.1 Overall architecture

2.2 Data replication between Kafka clusters

2.3 Introduction to Topic Related to MirrorMaker2

2.4 Deployment of MirrorMaker2

2.5 Routing Service

2.6 Kafka SDK

2.7 Offset Mapping Service

2.8 Integration of Flink and Kafka multi-cluster architecture

Four, follow-up planning

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈