Design and implementation of Banyu&#39;s data integration platform based on Flink

Data warehouse has four basic characteristics: subject-oriented, integrated, relatively stable, and reflect historical changes. Among them, data integration is the primary prerequisite for data warehouse construction, which refers to the integration of multiple scattered and heterogeneous data sources to facilitate subsequent data analysis. Platformization of the data integration process will greatly improve the efficiency of data developers. The main content of this article is:
Data integration VS data synchronization
Integration requirements
Data Integration V1
Data Integration V2
Online effect
Summarize

Flink Chinese learning website
https://flink-learning.org.cn

A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions.
—— Bill Inmon

One, data integration VS data synchronization

"Data integration" is often confused with "data synchronization" in concept. For this reason, we distinguish between the two.

"Data Integration" specifically refers to the data synchronization process for the ODS layer of the data warehouse,
"Data Synchronization" for the generalized Source to Sink data transmission process.

The relationship between the two is shown in the figure below:

The "data synchronization platform" provides basic capabilities without specific business logic. The "Data Integration Platform" is built on the "Data Synchronization Platform". In addition to synchronizing the original data, it also contains some aggregation logic (such as restoring snapshot data through database log data, which will be detailed below) And content related to data warehouse specifications (such as data warehouse ODS layer library table naming specifications), etc. At present, the construction of the "data synchronization platform" is under our planning, but this does not affect the construction of the "data integration platform". Some synchronization requirements can be created in the "real-time computing platform" in advance, and decoupled by "agreement" .

It is worth mentioning that "data integration" should also cover two parts: "data collection" (supported by specific tools) and "data cleaning" (determined by collection granularity, log specifications, etc.). These two parts are different from each company. Has its own implementation, this article will not introduce it in detail.

2. Integration requirements

At present, Banyu's internal data integration requirements are mainly reflected in three areas: Stat Log (business standardization log or statistical log), TiDB and MongoDB. In addition, there are some Service Logs, Nginx Logs, etc., which are not representative and will not be introduced in this article. In addition, since the real-time data warehouse is under construction, the current "data integration platform" only covers offline data warehouses (Hive).

Stat Log: The log of business placement will be collected by the FileBeat component to Kafka. Because the log is of Append Only type, Stat Log integration is relatively simple. You only need to synchronize Kafka data to Hive.
DB (TiDB, MongoDB): DB data is relatively troublesome. The core demand is that there can be a mirror of the business database in the data warehouse, that is, a snapshot of the data at a certain time (day level or hour level) in the business database, and of course there are sometimes data changes. Process analysis needs. Therefore, DB data integration needs to take these two aspects into consideration.

Since the above two types of data integration methods are quite different, they will be discussed separately below.

Three, data integration V1

Banyu's early "data integration platform" has taken shape, and this stage is mainly achieved with a series of open source tools. As time progresses, the problems exposed by this version have gradually increased. Next, V1 will be explained mainly from the perspective of data flow, and more detailed issues will be reflected in the design of the V2 version.

3.1 Stat Log

The log integration is not connected to the platform, but a chimney-style development method. The data integration link is shown in the following figure:

v1.0_log

The data in Kafka is first synchronized to HDFS through Flume, and then the data is imported from HDFS to Hive by the Spark task and partitions are created. The overall link is long and the introduction of third-party components (Flume) increases the cost of operation and maintenance. In addition, the redundant storage of Kafka's original data in HDFS also increases the storage overhead.

3.2 DB

The integration of DB data is mainly based on query mode (batch mode, full table scan through Select query to obtain snapshot data), and the link is shown in the following figure:

v1.0_db

Users submit integration tasks through the platform, and Airflow scheduled tasks scan the integrated platform meta-database to generate corresponding access tasks (TiDB data is through the Sqoop tool, and MongoDB data is through the Mongoexport tool). It can be seen that the V1 version does not obtain the log data of the database change, which cannot meet the analysis requirements for the data change process.

Since the Sqoop task ultimately obtains data from the business database of the TiDB production environment, a large amount of data is bound to have a certain impact on the business database. The Mongoexport task directly acts on the hidden nodes of MongoDB (no business data request), and the impact on online business is negligible. Based on this, the DBA separately built a set of TiDB big data clusters to synchronize larger business databases (based on TiDB Pump and Drainer components). Therefore, some Sqoop tasks can pull data from this cluster to eliminate the need for business databases. Impact. From the perspective of data flow, the whole process is shown in the following figure:

v1.0

Whether to synchronize the TiDB business database in the production environment to the TiDB big data cluster is determined by the requirements of the data warehouse and the DBA's evaluation of the data volume. It can be seen that this form also has a large amount of data redundancy, and the resources of the cluster reach the bottleneck with the increase of synchronization tasks. And with subsequent evolution, the TiDB big data cluster also covers a part of the business database of the data application production environment, and the scope of the cluster is gradually blurred.

Four, data integration V2

In the V2 version, we introduced Flink to simplify the synchronization link. The DB data integration was changed from the previous query-based method to the log-based method (streaming method), which greatly reduced redundant storage.

4.1 Stat Log

With the support of Hive Integration after Flink version 1.11, we can easily write Kafka data into Hive, so the integration of Stat Log becomes very simple (compared to the V1 version, the dependency on Flume components is removed, and the data is redundant. The rest is also eliminated), and the semantics of Flink Exactly-Once also ensure the accuracy of the data. From the perspective of data flow, the whole process is shown in the following figure:

v2.0_log

Currently, log partitions are generated according to the hourly granularity. The configuration parameters of several Flink tasks are as follows:

checkpoint: 10 min
watermark: 1 min
partition.time-extractor.kind: ‘custom’
sink.partition-commit.delay: ‘3600s’
sink.partition-commit.policy.kind: ‘metastore,success-file’
sink.partition-commit.trigger: ‘partition-time’

4.2 DB

Log-based integration of DB data means that you need to collect DB log data. In our current implementation, TiDB is based on Pump and Drainer components (currently the production environment database cluster version does not support opening TICDC), and MongoDB is based on the MongoShake component. The collected data will be sent to Kafka. In this way, on the one hand, the query pressure on the business database is reduced, on the other hand, the process of data change can be captured, and redundant data storage is also eliminated. However, since the original data is log data, certain methods need to be used to restore the snapshot data. The new link is shown in the figure below:

v2.0_db

After the user submits the integration task, three tasks will be created simultaneously:

Incremental task (stream): "Incremental task" synchronizes DB log data from Kafka to Hive. Since the collection components are collected according to the granularity of the cluster, and the number of clusters is limited, the synchronization tasks are currently created manually on the "real-time computing platform". When the integration task is created, it is assumed that the synchronization task is ready by default. After landing, more automated operations and verification can be done simultaneously.
Inventory task (batch): To restore the snapshot data, at least one copy of the initial snapshot data is required. Therefore, the purpose of the "inventory task" is to pull the initial snapshot data of the integration data from the business database.
Merge task (batch): "Merge task" aggregates the stock data and incremental data to restore the snapshot data. The restored snapshot data can be used as the stock for the next day, so the "stock task" only needs to be scheduled and executed once to obtain the initial snapshot data.

"Stock task" and "Merge task" are scheduled and executed by the offline scheduling platform Dolphinscheduler (DS for short). The required information will be obtained from the metadata database of the integrated task during the execution of the task. Currently, "Merge tasks" are scheduled at an hourly granularity, that is, to restore snapshot data every hour.

From the perspective of data flow, the whole process is shown in the following figure:

v2.0

Compared with Stat Log, the data integration of DB is more complex. Next, we will take TiDB's data integration as an example to describe some of the key points in the design process (the MongoDB process is similar, the difference lies in inventory synchronization tools and data analysis).

4.2.1 Demand expression

For users, integration tasks need to provide the following two types of information:

TiDB source information: including clusters, libraries, and tables
Integration mode: The integration mode represents the aggregate granularity of snapshot data, including full and incremental. The full amount means that the existing snapshot data needs to be aggregated with today's incremental log data, and the incremental means that only today's incremental log data needs to be aggregated (even if the incremental method does not need to aggregate with the existing snapshot data, the initial inventory is still obtained) It is necessary, and the specific form of use is determined by the data warehouse personnel).

4.2.2 Inventory tasks

Although the inventory tasks are performed and executed only once, in order to completely eliminate the impact of data integration on the business database, we choose the database backup-recovery mechanism to achieve it. The company's internal database backup and recovery operations have been platformized. The cluster will be backed up regularly (day granularity). The latest backup of the cluster can be queried through the platform, and the backup recovery operation can be triggered by the interface, so the acquisition of the inventory can directly affect the recovery. database.

Since the time point of the database backup and the time point of the integration task submission are not necessarily the same day, there is a certain time difference between them, which will cause the stock snapshot data to not meet our expectations. The relationship between each time point is shown in the following figure:

time

According to our setting, the stock snapshot data should contain all the data before T4, and the actual backup snapshot data only contains all the data before T1. There is a data difference of these N days.

Note: here does not say that the data difference set is the data in the T1 to T4 interval, because the incremental Binlog data is partitioned on the whole point, and the partition data of the whole point and the stock data are also processed during Merge. Aggregate and support data deduplication. Therefore, the merge result of the inventory data at T1 and the incremental data between T0-T3 is equivalent to the merge result of the inventory data at T0 and the incremental data between T0-T3. Therefore, the data difference set from T1 to T4 is equivalent to the data difference set from T0 to T3, which is the N-day data in the figure.

For the missing part of the data, it can actually be completed in the "Inventory Task", and careful analysis can actually be achieved by the complement operation of the executed "Merge Task".

The workflow of the entire "stock task" is shown in the figure below:

stock

Synchronization triggers the database platform to perform backup and recovery, and generates a receipt ID.
Receipt ID is used to rotate the backup and recovery status. If the recovery fails, the DBA needs to locate the exception, so the entire workflow will be offline. After the recovery is successful, the "stock task" can be resumed on the platform. When the recovery is in progress, the workflow exits directly and waits for the next wake-up with the help of DS timing scheduling. The recovery is successful, and the follow-up logic is entered.
Pull the inventory from the recovery library to determine whether there is a data gap in the inventory. If it does, perform the complement operation of the Merge task. The entire operation can be idempotent. If it fails, exit the workflow and wait for the next scheduling.
Success, the entire workflow is offline, and the task is completed.

4.2.3 Merge task

The premise of the Merge task is that both the stock data and the incremental data are ready, and we mark them through the _SUCCESS file. The workflow of the entire "Merge task" is shown in the figure below:

merge

Check whether the file mark exists. If it does not exist, the data is not ready, alarm and exit the workflow to wait for the next scheduling.
Execute the Merge operation, fail to report to the police and exit the workflow to wait for the next scheduling.
Success, exit the workflow and wait for the next scheduling.

The Merge operation is implemented through the Flink DataSet API. The core logic is as follows:

Load inventory and incremental data, unified data format (core field: the primary key Key is used as the aggregation field of the same data; CommitTs identifies the submission time of binlog, and the default data of inventory is 0 earlier than incremental data; OpType identifies the type of data operation, including: Insert, Update, Delete, the existing data is the Insert type by default), and the two data are unioned.
Aggregate according to the primary key.
Keep the data entry with the largest CommitTs after aggregation, and discard the rest.
Filter data items whose OpType is Delete.
Output aggregation results.

Core code:

allMergedData.groupBy(x -> x.getKeyCols())
             .reduce(new ReduceFunction<MergeTransform>() {
                 
                 public MergeTransform reduce(MergeTransform value1, MergeTransform value2) throws Exception {
                     if (value1.getCommitTS() > value2.getCommitTS()){
                         return value1;
                     }
                     return value2;
                 }
             })
             .filter(new FilterFunction<MergeTransform>() { //增量：过滤掉 op=delete
                 
                 public boolean filter(MergeTransform merge) throws Exception {
                     if (merge.getOpType().equals(OPType.DELETE)){
                         return false;
                     }
                     return true;
                 }
             })
             .map(x -> x.getHiveColsText())
             .writeAsText(outPath);

The main idea is "late comers come first". For Insert and Update operations, the latest value directly overwrites the old value, and for Delete operation, it is directly discarded. This method also naturally realizes the data deduplication operation.

4.2.4 Fault tolerance and data consistency guarantee

We can generally verify the fault tolerance of the solution from the processing methods in the three task failure scenarios.

"Inventory task" failed abnormally: usually caused by the failure of backup and restoration, the DS task will send a failure alarm, because the "database platform" does not support restoration retry temporarily, manual intervention is required. At the same time, the "Merge task" cannot detect the stock _SUCCESS mark, and the workflow will not advance backward.
"Incremental task" failed abnormally: Flink's own fault tolerance mechanism and the external detection mechanism of the "real-time computing platform" ensure the fault tolerance of "incremental tasks". If the "incremental task" has not been restored during the scheduled execution of the "Merge task", it will mistakenly assume that there is no incremental data for the hour and skip the execution, which is equivalent to a snapshot update delay (Merge aggregates all day's incremental data and inventory , If the "incremental task" is restored at a later scheduling point in time, the latest snapshot can be aggregated), or the "Merge task" can be artificially triggered after the "incremental task" is restored.
"Merge task" failed abnormally: The task is idempotent. The fault tolerance is ensured by setting the retry mechanism after the DS task fails, and the failure alarm is sent at the same time.

Above, the correct execution of the entire workflow is ensured through the automatic recovery mechanism and the alarm mechanism. Next, we can look at the plan's guarantee of consistency from the data point of view.

The consistency of the data is reflected in the Merge operation. The two pieces of data are aggregated, and the correctness of the algorithm can be ensured from the code level (this is verifiable and testable), then the only situation that may cause data inconsistency appears on the two pieces of input data, that is, the stock and the increment. There are two situations:

There is overlap between inventory and incremental data: It is reflected in the incremental data aggregation scenario of the initial inventory and the whole point. Due to the natural deduplication of the algorithm, the consistency of the data can be ensured.
There is a lack of inventory and incremental data: it is reflected in the lack of incremental data, and the incremental data is written by Flink Kafka data into Hive. In this process, there is a certain possibility of data inconsistency, that is, after the partition is submitted Out of order data. Although it is said that the partition will be submitted again at the next checkpoint time point after the out-of-order data arrives, downstream tasks generally detect that the first partition submission will trigger execution, resulting in inconsistent data for downstream tasks.

Two methods can be adopted for out-of-order data processing in the process of Flink streaming Hive: One is that Kafka sets up a single partition, and multiple partitions are the root cause of disorder, and data disorder is eliminated by avoiding multiple partitions. The second is alarm compensation. Once a streaming task is generated in disorder, it cannot be completely avoided (the disorder tolerance time can be set through watermark, but there is a limit), and then only afterwards compensation can be done through the alarm. The problem is transformed into how to perceive disorder, we can further analyze, since the disorder data will trigger the second submission of the previous partition, then only need to check whether the previous partition exists when the partition is submitted _SUCCESS mark to know whether It is out-of-sequence data and triggering an alarm.

Five, online effect

overview

home

Inventory task

ds_stock

Merge task

ds_merge

Six, summary

This article explains the core design ideas of Banyu's "Data Integration Platform". There are still some details of the entire plan that are not reflected in the article, such as data schema changes, DB log data analysis, etc. These details are also crucial to platform construction. At present, most of Banyu's integration tasks have been switched to a new way and run stably. We are also advancing the access of real-time data warehouse integration tasks to provide a more unified experience.

Original: Companion Fish Data Integration Platform Design and Implementation

recent hot spots

Flink Forward Asia 2021 postponed, meet online
bonus! latest entry guide for Flink Forward Asia Hackathon

For more Flink related technical issues, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics in the first time, please follow the public account~

Design and implementation of Banyu's data integration platform based on Flink

One, data integration VS data synchronization

2. Integration requirements