SmartNews: Practice of accelerating Hive daily watch production based on Flink

This article introduces the practice of SmartNews using Flink to accelerate the production of Hive daily watches and seamlessly integrating Flink into a batch processing system based on Airflow and Hive. Introduce in detail the technical challenges and solutions encountered during the process for the community to share. The main content is:
Background of the project
Definition of the problem
The goal of the project
Technical selection
Technical challenge
Overall plan and challenge response
Project achievements and outlook
postscript

1. Project background

SmartNews is an Internet company driven by machine learning. Established in Tokyo, Japan since 2012, and has offices in the United States and China. After more than 8 years of development, SmartNews has grown to rank first in Japan and the fastest growing news application in the United States, covering more than 150 countries in the world. According to statistics at the beginning of 2019, the cumulative global downloads of the iOS and Android versions of SmartNews have exceeded 50 million.

In the past 9 years, SmartNews has built a large number of data sets based on technology stacks such as Airflow, Hive, and EMR. As the amount of data grows, the processing time of these offline tables is gradually lengthening. In addition, with the acceleration of the iterative pace of the business side, higher requirements are put forward for the real-time performance of the watch. Therefore, SmartNews internally initiated the Speedy Batch project to speed up the production efficiency of existing offline tables.

This sharing is an example in the Speedy Batch project, accelerating the practice of user actions (actions) tables.

The user behavior log reported on the APP side generates a daily table through Hive jobs every day. This table is the source of many other tables and is very important. This job needs to run for 3 hours, which in turn increases the latency of many downstream tables, which obviously affects the experience of users such as data scientists and product managers. Therefore, we need to speed up these operations so that each table can be used earlier.

The company's business is basically on the public cloud. The original logs of the server are uploaded to the cloud storage in the form of files and are partitioned by day; the current job is scheduled to run on the EMR with Airflow, and the Hive daily table is generated, and the data is stored in the cloud storage.

2. Definition of the problem

1. Enter

The news server uploads an original log file every 30 seconds, and the file is uploaded to the cloud storage directory of the corresponding date and hour.

2. Output

After the original log is processed by ETL, the output is divided into two levels: day (dt) and action (action). There are about 300 action types, which are not fixed and often increase or decrease.

3. User

The use of this table is extensive and multi-channel. There are queries from Hive, as well as from Presto, Jupyter and Spark. We are not even sure that the above are all the access channels.

Three, the goal of the project

Reduce the time delay of the actions table from 3 hours to 30 minutes;
Be transparent to downstream users. There are two aspects to transparency:
- Functional aspect: users do not need to modify any code, so that they are completely insensitive
- Performance: The table generated by the new project should not cause performance degradation during downstream reading

Four, technology selection

Prior to this project, colleagues had made several rounds of improvements to the homework, but the effect was not very significant.

The tried schemes include increasing resources and investing more machines, but encountered the IOPS limit of cloud storage: each prefix supports up to 3000 concurrent reads and writes. This problem is particularly obvious in the output stage, that is, multiple reducers can send to the same one at the same time. It is easy to encounter this limitation when outputting the action subdirectory. In addition, I tried hourly preprocessing, and then merged into a daily table in the early morning of each day, but the merging process was also time-consuming, and the overall delay was still around 2.5 hours, which was not significant enough.

In view of the fact that the server-side logs are uploaded to cloud storage in near real-time, the team proposed the idea of streaming processing. Instead of the batch job waiting for one day and processing for 3 hours, the calculation is scattered throughout the day, thereby reducing the amount of time after the end of the day. Processing time. The team has a good background on Flink, and Flink has made many improvements to Hive recently, so it decided to adopt a Flink-based solution.

Five, technical challenges

The challenges are manifold.

1. Output RC file format

The file format of the current Hive table is RCFile. In order to ensure transparency to users, we can only do in-place upgrades on the existing Hive table, that is, we have to reuse the current table, then the file format output by Flink must also conform RCFile format, because a Hive table can only have one format.

RCFile belongs to bulk format (corresponding to row format), and must be output at one time at each checkpoint. If we choose to checkpoint once every 5 minutes, then each action must output a file every 5 minutes, which will greatly increase the number of result files and affect downstream read performance. Especially for low-frequency actions, the number of files will increase by hundreds of times. We understand Flink's file merging function, but that is the merging of multiple sink data in a checkpoint. This does not solve our problem. What we need is file merging across checkpoints.

The team considered outputting in row format (eg CSV) and then implementing a custom Hive SerDe to make it compatible with RCFile and CSV. But soon we gave up this idea, because in that case, we need to implement this Hybrid SerDe for each query scenario, for example, we need to implement it for Presto, implement it for Spark, and so on.

On the one hand, we cannot devote so many resources;
On the other hand, the kind of solution is also felt by users, after all, users still need to install this custom SerDe.
We previously proposed to generate a table in a new format, but it was rejected because it was not transparent enough to users.

2. Perceivability and completeness of Partition

How to make downstream operations perceive that the partition is ready for that day? The actions table is divided into two levels partition, dt and action. The action belongs to the dynamic partition of Hive, and the number is large and not fixed. The current Airflow downstream job waits for the Hive task of insert_actions to complete before starting execution. This is no problem, because at the end of insert_actions, all action partitions are already ready. But for Flink jobs, there is no end signal, it can only submit partitions one by one to Hive, such as dt=2021-05-29/action=refresh. Because of the large number of actions, the process of submitting partitions may last for several minutes. Therefore, we cannot allow Airflow jobs to perceive dt-level partitions, which may trigger downstream when there are only partial actions.

3. Streaming read cloud storage files

The input of the project is constantly uploaded cloud storage files, not from MQ (message queue). Flink supports FileStreamingSource, which can read files in streaming mode, but that is based on the timed list directory to discover new files. But this solution is not suitable for our scenario, because our directory is too large, and the cloud storage list operation cannot be completed at all.

4. Exactly Once Guarantee

In view of the importance of the actions table, users cannot accept any data loss or duplication, so the entire solution needs to be processed exactly once.

6. Overall plan and challenge response

1. Output RCFile and avoid small files

The solution we finally chose was a two-step process. The first Flink job was output in json (row format) format, and then another Flink job was used to convert Json to RC format. This solves the problem that Flink cannot happily output RC files of appropriate size.

Output the intermediate results of json, so that we can control the size of the output file through Rolling Policy, which can be large enough across multiple checkpoints, or long enough, and then output to cloud storage. Here Flink actually uses the Multi Part Upload (MPU) function of cloud storage, that is, every checkpoint Flink also uploads the data saved by the current checkpoint to the cloud storage, but the output is not a file, but a part. Finally, when multiple parts meet the size or time requirements, you can call the cloud storage interface to merge multiple parts into one file. This merge operation is completed on the cloud storage side, and the application side does not need to read the part again, merge it locally, and upload it. . Bulk format requires one-time global processing, so it cannot be uploaded in parts and then merged. It must be uploaded all at once.

When the second job senses that a new json file is uploaded, it loads it, converts it into RCFile, and uploads it to the final path. The delay caused by this process is small, and a file can be controlled within 10s, which is acceptable.

2. Elegant perceptual input file

On the input side, instead of using Flink's FileStreamingSource, it uses cloud storage event notifications to sense the generation of new files, and then actively load files after receiving this notification.

3. Perceivability and completeness of Partition

On the output side, we output a success file at the dt level to allow downstream to reliably perceive the ready of the daily table. We implement a custom StreamingFileWriter to output partitionCreated and partitionInactive signals, and implement a custom PartitionCommitter to determine the end of the daily table based on the above signals.

The mechanism is as follows. When each cloud storage writer starts to write a certain action, it will send a partitionCreated signal, and when it ends, it will send a partitionInactive signal. PartitionCommitter judges whether all partitions are inactive within a certain day. If it is, the data for one day is processed, and the success file at the dt level is output. Airflow judges whether Flink has completed the processing of the daily table by sensing this file.

4. Exactly Once

Cloud storage event notification provides At Least once guarantee. The file level is deduplicated in the Flink job. The job uses the Exactly Once checkpoint setting. The cloud storage file output based on the MPU mechanism is equivalent to supporting truncate. Therefore, the cloud storage output is equivalent to idempotence, so it is equivalent to end-to-end Exactly Once.

7. Project Achievements and Prospects

The project has been online, and the delay is maintained at around 34 minutes, including 15 minutes of waiting for late documents.

The first Flink job needs about 8 minutes to complete the checkpoint and output, and the json to rc job needs 12 minutes to complete all processing. We can continue to compress this time, but considering the timeliness and cost, we choose the current state.
The json to rc job takes longer than originally expected, because the last checkpoint of the upstream job outputs too many files, resulting in a long overall time. This can be linearly reduced by increasing the concurrency of the job.
The number of output files has increased compared with the number of files output by batch jobs, an increase of about 50%. This is the disadvantage of stream processing in batch processing. Stream processing needs to output a file when the time comes, and the file size may not meet expectations at this time. Fortunately, an increase in the number of files at this level does not significantly affect downstream performance.
The downstream was completely transparent. Before and after the launch, no abnormal feedback from users was received.

This project allowed us to verify the use of the streaming framework Flink to seamlessly intervene in the batch processing system in the production environment, and to achieve user-insensitive local improvements. In the future, we will use the same technology to accelerate the production of more other Hive tables, and provide more fine-grained Hive representation production, such as hourly. On the other hand, we will explore the use of data lake to manage batch-flow-integrated data and realize the gradual convergence of the technology stack.

8. Postscript

Due to the use of a completely different computing framework and the need to be completely consistent with the batch processing system, the team has stepped on a lot of pits, limited by space, it is impossible to list them all. Therefore, we select a few representative questions for readers to think about:

In order to verify that the output of the new job is consistent with the original Hive output, we need to compare the output of the two. So, how can we efficiently compare the consistency of two Hive tables? In particular, there are tens of billions of data every day, each with hundreds of fields, of course, it also contains complex types (array, map, array<map>, etc.).
Must the checkpoint mode of the two Flink jobs be Exactly Once? Which can not be and which must be?
StreamFileWriter only receives partitionCreated and partitionInactive signals at checkpoint, so can we output it to the downstream in its snapshotState() function (the downstream will be saved to state)?
One last question: Do you have a better plan for us to refer to?

SmartNews: Practice of accelerating Hive daily watch production based on Flink

1. Project background

2. Definition of the problem

1. Enter

2. Output

3. User

Three, the goal of the project

Four, technology selection

Five, technical challenges

1. Output RC file format

2. Perceivability and completeness of Partition

3. Streaming read cloud storage files

4. Exactly Once Guarantee

6. Overall plan and challenge response

1. Output RCFile and avoid small files

2. Elegant perceptual input file

3. Perceivability and completeness of Partition

4. Exactly Once

7. Project Achievements and Prospects

8. Postscript

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

小米基于 Apache Paimon 的流式湖仓实践