Baixin Bank&#39;s real-time data lake evolution solution based on Apache Hudi

This article introduces the construction of Baixin Bank's real-time computing platform, the scheme and practical methods of building a real-time data lake on Hudi, and the way the real-time computing platform integrates Hudi and uses Hudi. content include:
background
Design and practice of Baixin Bank's real-time computing platform based on Flink
Integration practice of Baixin Bank's real-time computing platform and real-time data lake
The future of Baixin Bank's real-time data lake
to sum up

1. Background

Baixin Bank, the full name of "CITIC Baixin Bank Co., Ltd.", is the first direct-selling bank approved as an independent legal entity. As the first state-controlled Internet bank, Baixin Bank has higher requirements for data agility than the traditional financial industry.

Data agility requires not only the accuracy of the data, but also the real-time data arrival and the security of data transmission. In order to meet the needs of our bank's data agility, the Big Data Department of Baixin Bank has assumed the responsibility of building a real-time computing platform to ensure that data is delivered quickly, safely and in a standard online manner.

Benefiting from the development and update iterations of big data technology, the two pillars of the well-known batch-flow integration are: "Unified Computing Engine" and "Unified Storage Engine".

Flink, as a leader in the field of big data real-time computing, the release of version 1.12 allows it to further enhance the capabilities of the unified computing engine ;
At the same time, with the development of the data lake technology Hudi, the unified storage engine also ushered in a new generation of technological changes.

Based on the development of the Flink and Hudi communities, Baixin Bank built a real-time computing platform, and at the same time integrated the real-time data lake Hudi on the real-time computing platform. Combined with the idea of data governance in the industry, the goal of real-time online data, safe and reliable, unified standards, and an agile data lake has been achieved.

2. Design and practice of Baixin Bank's real-time computing platform based on Flink

1. Positioning of real-time computing platform

As a row-level real-time computing platform, the real-time computing platform is independently developed by the big data IaaS team. It is an enterprise-level product that realizes "end-to-end" online data processing of real-time data.

Its core functions include real-time collection, real-time calculation, real-time warehousing, complex time processing, rule engine, visual management, one-key configuration, autonomous online, and real-time monitoring and early warning.
Currently, it supports scenarios such as real-time data warehouse, breakpoint recall, intelligent risk control, unified asset view, anti-fraud, and real-time feature variable processing.
Moreover, it serves many business lines in the industry, such as small and micro businesses, credit, anti-fraud, consumer spending, finance, and risk.

Up to now, there are 320+ real-time tasks running stably online, and the daily QPS of tasks running online reaches about 170W.

2. Architecture of real-time computing platform

According to the function, the architecture of the real-time computing platform is mainly divided into three layers:

■ 1) Data collection layer

The acquisition layer is currently mainly divided into two scenarios:

The first scenario is to collect the Binlog log of the MySQL standby database to Kafka. The data collection scheme used by our bank did not adopt the existing CDC schemes commonly used in the industry such as Canal and Debezium.
1. Because our MySQL version is the internal version of Baixin Bank, the Binlog protocol is different, so the existing technical solutions are not compatible with us to obtain Binlog logs.
2. At the same time, in order to solve the situation that the backup database of our data source MySQL may cause the loss of collected data due to the switch of multiple computer rooms at any time. We self-developed the Databus project to read MySQL Binlog, and we also transformed the Databus logic into a Flink application, and deployed it to the Yarn resource framework, so that Databus data extraction can be highly available and resource controllable.
The second scenario is that we are connected to a third-party application. This third-party application writes data to Kafka, and there are two ways to write to Kafka:
1. One way is based on the Json shcema protocol defined by us.

(UMF protocol: {col_name:"",umf_id":"","umf_ts":,"umf_op_":"i/u/d"})
The protocol defines "unique id", "time stamp" and "operation type". According to this agreement, the user can specify the type of operation for the message, which are "insert", "update" and "delete", so that the downstream can perform targeted processing on the message.

  2、另外一种方式，用户直接把 JSON 类型的数据写到 kafka 中，不区分操作类型。

■ 2) Data calculation conversion layer

Consume Kafka data for a layer of conversion logic, support user-defined functions, standardize data, and perform desensitization and encryption of sensitive data.

■ 3) Data storage layer

Data is stored in HDFS, Kudu, TiDB, Kafka, Hudi, MySQL and other storage media.

In the architecture diagram shown in the figure above, we can see that the main functions supported by the overall real-time computing platform are:

Development level:
1. Support the standardized DataBus collection function. This function has been adapted to support the synchronization of MySQL Binglog to Kafka, and does not require user intervention to configure too much. Users only need to specify the instance of the data source MySQL to complete the standardized synchronization to Kafka.
2. Support users to visually edit FlinkSQL.
3. Support user-defined Flink UDF functions.
4. Support complex event processing (CEP).
5. Support users to upload, package and compile Flink applications.
Operation and maintenance level:
1. Support status management of different types of tasks and savepoint.
2. Support end-to-end delay monitoring and alarm.

During the upgrade and iteration of the real-time computing platform, there are some downward incompatibility between the community Flink versions. In order to smoothly upgrade the Flink version, we have unified the multi-version modules of the computing engine, and strictly isolated the JVM level between the multi-versions, so that there will be no Jar package conflicts between the versions, and Flink Api incompatibility. .

As shown in the figure above, we encapsulate different Flink versions into an independent virtual machine, and use Thrift Server to start an independent JVM virtual machine. Each version of Flink will have an independent Thrift Server. During use, as long as the user displays the specified Flink version, the Flink application will be started by the specified Thrift Server. At the same time, we also embed the real-time computing back-end service into a commonly used Flink version to avoid excessive startup time due to starting Thrift Server.

At the same time, in order to meet the high availability and multiple backup requirements of the financial system, the real-time computing platform has also developed support for multiple Hadoop clusters, allowing real-time computing tasks to be migrated to the standby cluster after failure. The overall plan is to support multi-cluster checkpoint and savepoint. After the task fails, the real-time task can be restarted in the standby computer room.

3. Integration practice of Baixin Bank real-time computing platform and real-time data lake

Before introducing this content, let's first understand some of the current status of our bank in the data lake. For the current real-time data lake, our bank still uses the mainstream Lambda architecture to build a data warehouse.

1. Lambda

Under the Lambda architecture, the disadvantages of data warehouse:

For the same requirements, develop and maintain two sets of code logic: both batch and stream logic codes need to be developed and maintained, and the merged logic needs to be maintained, and they need to be online at the same time;
Computing and storage resources occupy a lot: the same computing logic is calculated twice, the overall resource occupation will increase;
Data is ambiguous: two sets of calculation logic, real-time data and batch data are often not matched, and the accuracy is difficult to distinguish;
Reuse Kafka message queue: Kafka retention is often reserved by day or month, and data cannot be retained in full, and the existing adhoc query engine cannot be used for analysis.

2. Hudi

In order to solve the pain points of the Lambda architecture, our bank prepared a new generation of data lake technology architecture. At the same time, we also spent a lot of time investigating the existing data lake technology, and finally chose Hudi as our storage engine.

Update/Delete records: Hudi uses fine-grained file/record level indexing to support Update/Delete records. It also provides transaction guarantees for write operations and supports ACID semantics. The query will process the last submitted snapshot and output the result based on this;
Change flow: Hudi provides flow support for obtaining data changes. It can obtain the incremental flow of all the updated / inserted / deleted records in a given table from a given point in time, and can query status data at different times;
Unified technology stack: Compatible with our existing adhoc query engines presto and spark.
The community update and iterate fast: Flink has supported two different ways of reading and writing operations, such as COW and MOR.

As you can see in the new architecture, we write all real-time and batch-processing source layer data to Hudi storage and rewrite it to the new data lake layer datalake (Hive's database). For historical reasons, in order to be compatible with the previous data warehouse model, we still retain the previous ODS layer, and the historical data warehouse model remains unchanged, except that the data of the ODS paste source layer needs to be obtained from the datalake layer.

First of all, we can see that for the warehousing logic of the new table, we use Flink to write it into the datalake through the real-time computing platform (new source layer, Hudi format storage), which can be used directly by data analysts and data scientists The data in the datalake layer is used for data analysis and machine learning modeling. If the data warehouse model needs to use the data source of datalake, a layer of logic to convert ODS is required. The conversion logic here is divided into two cases:
1. In the first type, for the incremental model, the user only needs to query the partition of the latest datalake into the ODS using the snapshot query.
2. For the second type, for the full model, the user needs to merge the results of the ODS previous day's snapshot and the latest snapshot query result of datalake to form the latest snapshot and put it in the current partition of ODS, and so on.

The reason we did this is that we don't need to modify the existing data warehouse model, but just replace the data source of ODS with datalake, which is highly time-sensitive. At the same time, it meets the demands of data analysis and data scientists to obtain data in quasi-real time.

In addition, for the original ODS data, we developed a script to initialize the data of the ODS layer into the datalake.
1. If the ODS layer data is a full snapshot every day, we only initialize the latest snapshot data to the same partition of the datalake, and then enter the link access of the datalake in real time;
2. If the data of the ODS layer is incremental, we will not initialize for the time being, only rebuild a real-time link into the lake in the datalake, and then do an incremental daily switch to the ODS once a day.
Finally, if it is a one-time data entry into the lake, we can use the batch import tool to import it into the datalake.

The logic of the overall lake warehouse conversion is shown in the figure:

3. Technical Challenge

In the early days of our research, Hudi's support for Flink was not very mature. We did a lot of development and testing on Spark-StrunctStreaming. From the results of our PoC test,
1. If you use the COW write method without partition, you will find that the write becomes slower and slower when the write volume is tens of millions;
2. Later, we changed the way of writing without partition to incremental partition, which improved the speed a lot.

The reason for this problem is that spark reads the basefile file index when writing. The larger the file, the more the file index will be read, and the slower the file index will be. Therefore, the slower and slower writing will occur.

At the same time, as Flink's support for hudi gets better and better, our goal is to integrate the functions of Hudi into the lake into the real-time computing platform. Therefore, we integrated and tested Hudi with the real-time computing platform, and encountered some problems during the period. Typical problems are:
1. Class conflict
2. Cannot find the class file
3. Rocksdb conflict

In order to solve these incompatibility issues, we restructured an independent module based on Hudi's dependency. This project just packaged Hudi's dependency into a shade package.

  4、当有依赖冲突时，我们会把 Flink 模块相关或者 Hudi 模块相关的冲突依赖 exclude 掉。
  5、而如果有其他依赖包找不到的情况，我们会把相关的依赖通过 pom 文件引入进来。

In the Hudi on Flink solution, related problems have also been encountered, such as failure caused by too long checkpoint time due to too large checkpoint. This problem is solved by setting the TTL time of the state, changing the full checkpoint to an incremental checkpoint, and increasing the degree of parallelism.
The choice of COW and MOR. At present, most of the Hudi tables we use are COW. The reason why we choose COW,
1. The first is because our current historical stock of ODS data is imported into the datalake data table at one time, and there is no write amplification.
2. Another reason is that COW's workflow is relatively simple and does not involve additional operations such as compaction.

If it is new datalake data, and there are a lot of updates, and the real-time requirements are high, we choose the MOR format to write more, especially when the QPS is relatively large, we will use asynchronous compaction operations , To avoid writing magnification. In addition to this situation, we will still prefer to write in COW format.

4. The future of Baixin Bank's real-time data lake

In the real-time data lake architecture of our bank, our goal is to build the entire link of the real-time data warehouse on Hudi. The architecture system is shown in the figure:

Our overall goal plan is to replace Kafka, use Hudi as an intermediate storage, build data warehouses on Hudi, and use Flink as a stream-batch integrated computing engine. The advantages of this are:

MQ no longer serves as an intermediate storage medium for real-time data warehouse storage, while Hudi is stored on HDFS, which can store massive data sets;
The middle layer of the real-time data warehouse can use the OLAP analysis engine to query the intermediate result data;
In the true sense of batch flow integration, the problem of data T+1 delay is solved;
Schema no longer needs to strictly define the Schema type when reading, and supports schema evolution;
Support primary key index, data query efficiency has been increased several times, and ACID semantics are supported to ensure that data is not duplicated and not lost;
Hudi has the function of Timeline, which can store more state data in the middle of the data, and the data completeness is stronger.

Five, summary

This article introduces the construction of Baixin Bank's real-time computing platform, the scheme and practical methods of building a real-time data lake on Hudi, and the way the real-time computing platform integrates Hudi and uses Hudi.

In the process of using Hudi, I also encountered some problems. I sincerely thank the community students for their help. Special thanks to the community Danny chan and leesf for answering questions. Under the real-time data lake architecture system, we are still trying to build our real-time data warehouse, and the integrated solution of streaming and batching.

Baixin Bank's real-time data lake evolution solution based on Apache Hudi

1. Background

2. Design and practice of Baixin Bank's real-time computing platform based on Flink

1. Positioning of real-time computing platform