Introduction: Based on the transformation of stream-batch integration, no matter in real-time or offline, only one set of computing framework needs to be maintained, which saves a lot of human resources for business developers, platform providers and computing engine supporters.
Abstract: This article is compiled from the speech delivered by Xiaomi software development engineer Jin Feng at the Flink Forward Asia 2021 stream-batch integration session. This article is mainly divided into three parts:
- The evolution of Xiaomi's big data
- Stream-batch integrated platform construction
- Application Scenario of Streaming Batch Integration
- future plan
Click to view live replay & speech PDF
1. Xiaomi's big data development and evolution
- Before 2019, Xiaomi's real-time computing was mainly based on Spark Streaming, a small part of Storm, and offline computing was mainly based on Spark.
- In 2019, Flink began to be connected, and it is widely used in scenarios such as information flow search and recommendation, real-time advertisement samples, real-time ETL, etc., gradually replacing the original SparkStreaming job. Thanks to the various excellent features of the Flink framework, we are correct in the job. Performance, real-time performance, and resource utilization efficiency have been greatly improved.
- In 2020, FlinkSQL will be accessed and used, and it will be widely used in the construction of real-time data warehouses and the development of real-time ETL operations. The real-time data warehouse of FlinkSQL reduces the data link from T+1 to the second level.
- In 2021, it began to access the data lake Iceberg, based on Flink and Iceberg to build a real-time data warehouse solution integrating streaming and batching, and implemented it in some of Xiaomi's internal businesses, proving that the integration of streaming and batching can empower businesses and improve job development efficiency. , It is feasible to simplify the link to save resources.
The picture above shows Xiaomi's current real-time and offline frameworks, which are currently in a state where multiple frameworks coexist. Business developers need to maintain at least two sets of codes whether they write SQL jobs or Jar package jobs. The company's internal computing engine team also needs to spend two manpower to maintain different computing frameworks, and the platform layer also needs to make different adaptations to different computing engines.
Based on the transformation of stream-batch integration, no matter in real-time or offline, only one set of computing framework needs to be maintained, which saves half of the human resources for business developers, platform providers and computing engine supporters.
Second, the platform construction of the integration of flow and batch
In order to explore the integration of flow and batch, we have also done a lot of related exploration and practice.
For the platform construction of the integration of flow and batch, it is mainly divided into four aspects, namely metadata management, rights management, job scheduling and Flink's ecological construction.
2.1 Metadata Management
We have unified metadata management based on Metacat, and Metacat is unified to connect different downstream storage systems and upstream computing engines.
Based on Metacat, all internal systems are uniformly divided into a three-level structure, which corresponds to the three-level structure of FlinkSQL.
- The first-level catalog is mainly composed of the service name and the cluster name.
- The second-level Database, which is consistent with the Database of most systems. Systems without Database use default instead.
- The third-level table is also consistent with the system table, such as the topic name of the message queue and the index name of Elasticsearch.
After building a unified metadata management, you only need to write a DML statement to complete the development of a real-time message queue data entry into the lake.
2.2 Permission management
After implementing unified metadata management, when developing Flink SQL jobs, all systems are abstracted into a three-level structure table, and businesses can use the three-level table name to refer to any system table. At the same time, we also do unified permission management based on Ranger, and manage all resource permissions at the SQL layer.
We have unified permission management at the computing engine layer, covering both Flink SQL and Flink Jar. A Flink SQL job can obtain the Source and Sink tables referenced by SQL and the field names of the Source table of select when generating the physical execution plan. Based on the above information, field-level authentication can be implemented. At the same time, we provide a unified tool class for Flink Jar users, and also connect to the Flink Catalog, so we can verify the permissions of the Jar package job.
As shown in the figure above, after unified management of metadata and permissions, business developers can easily select tables from different systems when developing Flink SQL jobs, including Doris, Kudu, Hive, etc. The jobs are submitted and submitted by the backend. Authentication. When the assignment is submitted, we can also easily obtain the bloodline of the assignment.
2.3 Job Scheduling
Xiaomi has also made some attempts in job scheduling. As shown in the SQL on the left side of the figure above, it is a batch job in offline scheduling mode, but it is a stream job in real-time scheduling. In batch-stream hybrid scheduling, batch jobs are started first, and stream jobs are started after execution is complete.
Batch stream mixing is a real-time job to the scheduler. Our main change is to start an SQL batch job first in the Flink SQL template job, and then start the Flink SQL real-time job after the execution is complete.
2.4 Ecological Construction of Flink
Flink's plug-in connector design can easily expand different connectors. Whether it is Flink official or other communities, it provides a lot of connector support. Xiaomi has also implemented many types of connectors internally. Only by perfecting Flink's ecological construction can its computing power of cross-platform design be truly reflected.
For the Iceberg connector, the community has implemented functions related to batch read and write and streaming into the lake. In addition, streaming consumption is also a relatively important function. If streaming consumption is not supported, only the ODS layer can be modified in the link of the data warehouse, and the downstream link can only be processed in batches, and the whole chain cannot be achieved. Road true real-time processing. Therefore, supporting Iceberg's incremental consumption is an essential part of the real-time link.
Also important for the integrated flow-batch ecosystem are Hybrid Source and CDC Sink.
Hybrid Source has been implemented in the community. It can combine two different sources. Most of the combinations are limited stream + unlimited stream, so that batch stream mixing can be achieved.
Xiaomi has managed the tables of all systems at the platform layer, so when implementing Hybrid Source, there is no need to fill in the structure information of the corresponding table and the more complicated parameter information, and only need to configure the table names to be read in the parameters in order. , Hybrid Source will read the required tables in turn according to the configuration order. In addition, different systems can be combined. The most commonly used is the combination of MySQL and message queue. First, the data in MySQL is fully consumed, and then the data in the message queue is incrementally consumed.
CDC Sink is mainly used with Hybrid Source. CDC Sink also connects to the internal Catalog, which uniformly manages schema changes. When the data reaches the downstream connector, there is no need to deal with the cumbersome schema change logic, just write the real data into the corresponding system with the real schema.
Whether it is Hybrid Source or CDC Sink, the field type in the Flink framework layer has a barrier field, which can encapsulate data of any structure and can also make schema changes. But some field type mismatches are only exposed at runtime.
3. Application Scenario of Streaming Batch Integration
Most companies have data import and export requirements. Based on Flink's rich ecosystem, we can easily implement data integration in different scenarios, including offline integration, real-time integration, and batch-stream hybrid data integration.
The first is offline data integration. We replaced the previous Data X with Flink SQL Batch jobs. With the help of Flink's ecology, we can easily meet the data import and export requirements of different systems, and also obtain a richer Source Sink ecology. At the same time, based on Flink SQL, it is very convenient to implement field mapping. At the same time, Flink SQL, as a distributed framework, can easily provide the requirements for concurrent derivatives.
The second is real-time data integration, which is mainly divided into two parts:
- The first part is the collection of real-time data. Xiaomi is mainly divided into two categories: log data and DB Binlog data. Here we mainly introduce Binlog data collection of DB system. Initially, we used Xiaomi's self-developed LCS Binlog service for unified Binlog collection, similar to the Canal service, through which Binlog data is uniformly collected into the message queue.
- The second part is the data dump, which will use the Spark Streaming task to import the data in the message queue into other systems, such as Kudu or HDFS.
Now we use Flink to transform Binlog's collection and dump links. Use Flink CDC to collect Binlog data and write it to the message queue. At the same time, the data of the message queue is dumped to other systems through Flink, such as Kudu, Doris, Iceberg, etc.
In actual use, it is often necessary to mix streams and batches to apply to different scenarios, such as the scenario of sub-database and sub-table, the scenario of redoing some links, and adding database tables. Currently, the Flink CDC task is used to collect Binlog data at the database level. If it is collected at the table level, it will cause great pressure on the MySQL service. After the data is collected in the message queue, different jobs are started for dumping according to different collection scenarios.
The data collected by Flink CDC is at the database level. When the data of a table needs to be redone, the data at the database level cannot be redone, because it will affect other tables. Therefore, for the full data of a single table, we will read it directly from MySQL, and then read the incremental data from the message queue. Therefore, we need to use Hybrid Source to read the data in MySQL and the message queue respectively.
Another data integration scenario where batch and stream are mixed is the mix of batch jobs and stream jobs.
When supporting data collection and dumping of TiDB, we cannot use Hybrid Source, because the full data of TiDB is often very large, and we need a large amount of concurrency to speed up the dumping of the full amount of data, while incremental data requires only a small concurrency, i.e. Can. In the full data part, we use Flink SQL Batch job to complete, which can flexibly adjust concurrency, and is more efficient than real-time job processing, and the incremental part can be dumped with a smaller concurrency.
Another important business scenario is the construction of real-time data warehouses. Xiaomi has also experienced from traditional offline data warehouse to Lambda architecture to current real-time warehouse based on data lake. These three scenarios have different advantages and disadvantages and will be applied to different business scenarios. There are two typical cases, namely Xiaomi mobile phone activation statistics and Xiaomi sales service real-time data warehouse.
The above picture shows the Xiaomi mobile phone activation business process. The first is the collection of activation data. The logs are collected through different channels, and the unified summary and cleaning are carried out. Through data collection and dimension table join, cases that are activated in advance can be detected. At the same time, data cleaning can be performed based on some dimension data to determine which are natural activation and which are normal active logs.
The overall architecture of the Xiaomi mobile phone activation data warehouse involves real-time links and offline links. The real-time links are mainly introduced here. The dimension tables we use are mainly HBase and FileSystem. HBase is used to store the unique ID of the full amount of history. Hive mainly stores a small amount of dimension data. The final result will be placed in Kudu in real time, and the business can check the real-time activation through the OLAP engine. data. At the same time, offline links are also essential, and the overall coincidence rate of real-time and offline data reaches 99.94%.
The most critical point in the above link is that HBase needs to be used to save the full amount of historical IDs for deduplication. Among them, the HBase table of joining the full historical data is the most critical place. At first, we used the synchronous lookup join method, but encountered a large performance bottleneck. Later, we switched to the asynchronous join method, and finally the overall processing speed was dozens of times. improvement.
Xiaomi's sales service involves multiple modules, including orders, logistics, commodities, and after-sales stores. We also encountered many problems during the construction process, and finally proved that the solution of building real-time data warehouses based on Flink SQL transformation of offline links is feasible of.
The above figure shows the overall structure of the sales service warehouse. The dimension tables used in the sales service mainly come from the message queue and FileSystem. In the scenario of selling services, both the order table and the commodity category table will be updated in real time. When correlating, no matter which stream has an update, the result needs to be updated. Therefore, most of the service warehouses sold use dual-stream join, and dual-stream join is accompanied by a state problem.
In Flink SQL, we cannot accurately control the state expiration policy of an operator, so we can only set a unified state expiration time. If some states are not accessed for a period of time, they will be cleaned up, but this scenario has limitations Naturally, for scenarios such as logistics and after-sales, the period of a single record in the entire real-time stream may exceed one month, but in general, we cannot set the status timeout period of Flink jobs to one month, and the amount of generated status will be too much. This leads to lower processing efficiency and is not conducive to the backtracking of job links. Once there is a problem with the job, the upstream data needs to be redone, and most of the data in the message queue will not be stored for more than 7 days.
In some scenarios of sales services, we have introduced offline jobs. According to the data in the result table, we can obtain the data whose status has not yet ended, and write the corresponding dimension data back to the real-time stream to ensure that these dimension data will not expire. When the data of the main table reaches the join operator, the correct data can be obtained, even if a record of the main table is changed after more than one month.
The above picture is the architecture of the near real-time data warehouse of Xiaomi APP. Collect logs into message queues uniformly through the log collection module. Since it is log data, only the Iceberg v1 table needs to be used. The intermediate DWD and DM layers use Flink SQL for real-time consumption and processing, and then write to the Iceberg v2 table.
For some data tables that need to save the full amount of historical data and ensure data accuracy, but do not require high timeliness, we adopt an offline processing idea and perform offline processing based on the DWD layer. Use the data of t-1 to correct the data of t-2 every day. By constantly correcting the historical data, the accuracy of these tables can be guaranteed to a large extent.
For systems where some old links or upstream data are provided by other business parties, and cannot be modified in the short term and cannot generate CDC data, we use Spark Merge Into job timing scheduling to generate incremental data and write it to Iceberg in real time v2 table. In production practice, the data delay generated by Merge Into is about 5~8 minutes, and the delay of the subsequent links can be controlled within minutes, so the delay of the whole link can be basically controlled within 10 minutes, compared with the previous The latency of t+1 has been greatly improved.
For systems that can generate CDC data from the source, we will write real-time data to the message queue, and then enter the lake in real-time. The overall architecture is shown in the figure above. The real-time part mainly uses the combination of Flink and message queue to achieve second-level delay, and then sinks the data of the message queue into Iceberg for Query query. At the same time, there is an offline link in the downstream that is continuously corrected through Merge Into to ensure the correctness of the results in Iceberg.
The entire link is based on the message queue to ensure the timeliness of the entire link, and the query timeliness of Query is guaranteed based on the data lake. In addition, the offline Merge Into is continuously revised to ensure the accuracy of the final result.
Although the overall architecture of the data warehouse is basically similar, the actual link complexity is still quite different for different business scenarios and different requirements. Therefore, when using Flink to build a data warehouse, it is necessary to choose a reasonable solution according to the actual needs.
4. Future planning
At present, the data lake solution of Flink + Iceberg has been initially implemented in Xiaomi. There is still a lot of room for improvement in the future. We will continue to follow up with the community and continue to promote the construction of Xiaomi's internal stream-batch integration.
In the future, we will use Flink SQL Batch for more complex scenarios. At present, Flink SQL Batch has limited scenarios and is mainly used in batch export scenarios. I believe it will play a greater value in the future.
Second, we will follow up with the community's built in dynamic table [1], combine message queues and data lakes, take into account timeliness and accuracy, and improve user experience.
At the same time, we will also upgrade the Hybrid Source connector. The Hybrid Source in the current community is adapted to the latest version of the Source interface. The new version of the Source interface is more flexible when connecting to other systems.
Click to view live replay & speech PDF
Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。