Abstract: This article is from Zhang Jian, head of real-time computing in Mihayou Big Data Department, sharing the application and practice of Flink in Mihayou. This article is mainly divided into four parts:
- Background introduction
- Real-time platform construction
- Real-time data warehouse and data lake exploration
- Future Development and Prospects
1. Background introduction
Mihayou was established in 2011 and is committed to providing users with beautiful products and content that exceed expectations. The company has successively launched a number of high-quality popular products, including "Honkai Academy 2", "Honkai 3", "Undecided Event Book", "Genshin Impact", the dynamic desktop software "Artificial Desktop" and the community product "Miyou". Company", and created a variety of products such as animation, comics, music, novels and peripherals around the original IP. Headquartered in Shanghai, China, it has a global layout in Singapore, the United States, Canada, Japan, South Korea and other countries and regions.
Flink has always played an important role in the development of Mihayou's big data. Since the establishment of the real-time computing platform, Flink, as a real-time computing engine, has gone through multiple stages of development, and the real-time computing platform has been iteratively improved. Inside Mihayou, the real-time computing platform is called Mlink, which is mainly based on Flink and is compatible with Spark Streaming tasks. From the initial Flink Jar package task-based to Flink Sql-based, it has continuously lowered the use threshold and improved the development efficiency of tasks; from the initial basic Flink task development, to the tasks of cross-region and cross-cloud vendors Multi-version management meets the needs of business development. In the process of development, Mihayou constantly pays attention to the development of the community, and maintains close contact with the community and Alibaba Cloud students.
Mlink is mainly a computing platform based on Yarn resource management, which supports data warehouse, algorithm, recommendation, risk control, large screen and other services. The number of tasks is 1000+, and Sql tasks account for about 80%. The Yarn Vcores used exceed 5,000 cores and the memory is about 10T. The peak throughput of a single task is 5 million QPS, and the daily throughput of data exceeds 100 billion.
2. Real-time platform construction
2.1 Problems encountered
In the process of Flink's exploration and development, we will encounter some pain points in the use of Flink, and everyone encountered it, and we also felt it in the process of our exploration and practice. To sum up, it can be roughly divided into the following five aspects:
- First, the development cost of the Jar task is high, and the use cost is too high for students who are not familiar with Flink code. At the same time, the maintenance cost of Jar tasks is high, and some code logic changes will involve actions such as repackaging, uploading, and going online;
- The second is the lack of task management functions. Among them, multi-tenancy, historical version backtracking, development version and online version management, UDF management, and blood relationship management are important contents of real-time platform management;
- The third is the management of the Flink engine itself, which mainly involves multiple Flink version management, task parameter configuration, secondary development of commonly used Connectors, and multi-resource environment management;
- The fourth is task alarm monitoring and management, task problem diagnosis;
- The fifth is to communicate with offline data warehouses, including Hive Catalog management, real-time and offline scheduling dependency management, etc.
The above five problems may be common problems, so each company will meet its own task development and management needs based on internal self-built or secondary development of open source projects. For Mihayou, in addition to the above five problems, there are also problems encountered in cross-region and cross-cloud vendors that need to be solved, mainly after cross-region, task online and submission efficiency, cross-cloud vendors, resource environment inconsistency, etc.
2.2 Solutions
The real-time platform construction mainly revolves around the above problems. The current real-time platform architecture is as follows:
Figure 1: Multi-cloud and multi-environment real-time platform architecture
The front end controls the switching of the cloud environment. Backend Service is mainly responsible for user rights management, task multi-version management, blood relationship management, task operation and maintenance, task online and offline, task monitoring and alarming. Executor Service is mainly responsible for task analysis, task submission and running, task offline, and interaction with various resource managers. Among them, the Backend Service and the Executor Service communicate through the Thrift protocol, and the implementation of the Executor Service can be extended in multiple languages. The architecture design mainly solves the problem of cross-region and cross-cloud vendors, and realizes the decoupling between task management and task operation.
Figure 2: Mlink platform development page
Figure 3: Mlink platform operation and maintenance page
Figure 4: Mlink platform synchronization task page
The Mlink real-time computing platform mainly designs modules such as overview, development, resource management, operation and maintenance, data exploration, synchronization tasks, user management, and executor management. The development page is mainly composed of user-written tasks and parameter configuration, including historical version management and other content. Resource management is mainly Jar package tasks and UDF management. Operation and maintenance mainly includes task start and stop, task operation monitoring, task alarm configuration, etc. The data exploration part mainly previews some data functions. For example, Kafka Topic supports previewing data by partition, time or offset. The synchronization task is mainly to facilitate the management of synchronization tasks, such as one-click synchronization and operation management from CDC to Iceberg. The executor is responsible for the operation and maintenance of the Executor, including Executor online and offline, health status monitoring, etc.
2.3 Challenges encountered
In the process of platform construction and iteration, we encountered many challenges and produced some good practices. It mainly shares four aspects.
The first is the development and maintenance of Executor Service.
Executor mainly involves the parsing and submission of Jar and Sql tasks. In order to solve the problem of cross-regional transmission efficiency, especially the transmission of large jar packages, the initial solution is to perform task parsing by the backend, and finally transmit the job graph to the Executor, which is then submitted through the resource manager API. This is because the backend parsing environment is inconsistent. The problem is that there will be actions during some task parsing, especially those involving Hive tables and Iceberg tables. Finally, the backend does not execute, but is parsed by Executor instead. During the parsing process of the Executor, it encountered the situation that the metaspace OOM occurs after the Executor runs for a long time. This is mainly because the Executor's continuous loading tasks require the Class class, which will cause the metaspace memory used to increase continuously. This is mainly solved by unloading the class loader and heap GC settings after the task resolution is completed.
The second is monitoring.
Monitoring adopts the solution of Influxdb and Grafana. With the increasing number of tasks, the series stored in Influxdb exceeds one million, which affects the stability of monitoring and viewing, and the query response is slow. One is to expand Influxdb, and the execution side allocates task metrics to report to different Influxdb through the consistent hashing scheme. It itself simplifies the reporting metrics of Flink tasks to a certain extent. Secondly, in terms of monitoring, such as Kafka consumption monitoring, it currently supports the delay monitoring of the number of consumption items, and the monitoring of Kafka consumption delay time is customized, mainly to collect the consumption time of the slowest parallelism of Kafka, which can reflect the maximum delay of Kafka consumption. Time, data that can reflect a certain point in time must be consumed.
Figure 5: Grafana monitoring example
The third is the secondary development of Connector.
Iterates on the basis of CDC 1.0 version, supports dynamic expansion of fields when Mysql is collected, and starts consumption sites based on time, collected library tables, sites and other schema information. On the basis of CDC 2.0 version, the full-scale read library table flow control and full-scale initialization functions that do not require MySQL to open Binlog are added. Among them, the synchronization of multiple CDC instances may put pressure on upstream Mysql. Kafka is used as the data transfer, and the primary key field of the database table is used as the key of the topic to ensure the order of Binlog, and there will be no data disorder in the downstream.
As a data lake solution, Iceberg mainly focuses on the support of the Iceberg V2 table, that is, the Upsert table. The Iceberg management center is established, which will be regularly optimized and cleaned according to the merge strategy. Flink writes mainly ensure the orderliness of the CDC to Iceberg V2 table. On how to reduce the delete file, the support of BloomFilter is added to the Iceberg write, which can significantly reduce the delete. File size. Iceberg management center, supports V2 table merge and Flink commit conflict problem.
In terms of Clickhouse, the writing code of Clickhouse has been refactored, the writing performance of Clickhouse has been optimized, and the writing of local tables and distributed tables has been supported.
The fourth is data entry into the lake and offline scheduling.
The real-time platform integrates Iceberg and supports multiple catalogs of Iceberg Hadoop, Hive, Oss, and S3. The inbound link from CDC to Iceberg has been launched in the production business of the department. When data enters the lake or warehouse, if the downstream table is used by the offline data warehouse, there will be a problem of dependency scheduling. When will the offline task be started? At present, we mainly ensure that the data has been stored into the lake by calculating the delay time and checkpoint time of the task. Take CDC or Kafka to Iceberg for example. First, the collection delay time of the CDC side is collected. Kafka collects the slowest parallelism delay time, and at the same time, the checkpoint time of the task is collected. Now that the Checkpoint is completed, the Iceberg version will not necessarily be updated. Based on this, the Iceberg writing has been transformed. For such a synchronization task, if there is no delay at the CDC collection end, the Checkpoint has also been completed, which can guarantee that the data of a certain hour must have been put into the warehouse. The real-time platform provides a task delay query interface. Offline scheduling uses this interface as the scheduling dependent node. This ensures the integrity of the warehousing data when the offline task is started.
3. Real-time data warehouse and data lake exploration
For real-time data collection, there are currently three main links:
- One is the log type, which is mainly collected and written to Kafka through Filebeat, and Es is used as the monitoring of Filebeat;
- The second is the Api interface reporting service, and the backend is connected to Kafka;
- The third is that CDC collects full and incremental Mysql data and writes it to Kafka or directly to Iceberg. Previously, Canal was used as the incremental acquisition scheme, but now it has all been changed to CDC.
The architecture design of real-time data warehouse is basically consistent with the industry, including ODS, DWD, DWS layers, and then output to various application systems, such as Clickhouse, Doris, Mysql, Redis, etc. At present, Kafka is mainly used as an intermediate carrier, and the use of Iceberg as an intermediate layer is also being explored. Although Iceberg has the function of stream reading, there has never been a better solution to the problem of the sequence of data during stream reading, and we are also in the process of exploration. There are two main directions of exploration:
- One is to use Kafka and Iceberg as the hybrid source scheme. After the Flink task reads the hybrid source, the read range and switching point are determined based on the Kafka site recorded by the Iceberg snapshot;
- The second is the introduction of Dynamic Table storage implementation proposed by the community Flip-188. Flink's built-in table consists of two parts, LogStore and FileStore. LogStore will meet the needs of a messaging system, while FileStore is a columnar format file system. At each point in time, the LogStore and FileStore store the exact same data for the most recently written data (the LogStore has a TTL), but with different physical layouts.
In terms of real-time data warehouse exploration, it is mainly the CDC to Iceberg lake entry task, which has been used in production. Among them, four main problems are solved:
- First, the CDC collection problem, especially multi-database multi-table collection, will be collected to Kafka in a centralized manner, reducing the impact of multiple CDC tasks on the same database;
- Second, Iceberg supports V2 table writing, including written index filtering to reduce Delete files, merge and commit conflicts in Iceberg management center;
- The third is to support data verification and data delay checking of sub-database and sub-table;
- The fourth is one-click task generation. For users, you only need to fill in the database-related information, the target Iceberg table database name and table name, and support the use of Kafka for transfer to avoid multiple CDC instances from collecting the same database instance.
By solving the above four problems, it is possible to achieve minute-level data entry of database data into the lake, and the data verification and data delay dependency of entry into the lake can be achieved, which facilitates the scheduling and initiation of downstream offline tasks.
Figure 6: Data into the lake link
4. Future Development and Prospects
There are four main points:
- First, Flink dynamic table storage can be implemented as soon as possible, realizing the integration of real real-time data warehouse and flow table;
- The second is the dynamic expansion and contraction of Flink tasks, active resource adjustment based on task diagnosis, and fine-grained resource adjustment;
- The third is Flink’s optimization of reading and writing batch tasks. At present, Flink’s use of batch tasks is not as good as Spark’s. If it can be supplemented here in the future, one engine can be streamed and batched, and development costs will be significantly reduced;
- Fourth, Flink and data lakes are better implemented and promoted.
Click to see more technical content
For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。