BIGO uses Flink for OLAP analysis and real-time data warehouse practice and optimization

This article is compiled from the sharing of BIGO Staff Engineer Zou Yunhe at Flink Forward Asia 2021. The main contents include:
business background
Land practice & feature improvement
Application scenarios
future plan

FFA 2021 Live Replay & Presentation PDF Download

1. Business Background

BIGO is an overseas-oriented short video live broadcast company. At present, the company's main business includes BigoLive (global live broadcast service), Likee (short video creation and sharing platform), IMO (free communication tool) three parts, in the global scope It has 400 million users. With the development of the business, the requirements for the processing capacity of the data platform are getting higher and higher, and the problems faced by the platform are becoming more and more prominent. Next, we will introduce the BIGO big data platform and the problems it faces. The data flow diagram of BIGO big data platform is as follows:

The user's behavior log data on the APP, Web page, and Binlog data of the relational database will be synchronized to the BIGO big data platform message queue and offline storage system, and then calculated by real-time, offline data analysis methods to apply It can be used in real-time recommendation, monitoring, ad hoc query and other usage scenarios. However, there are the following problems:

The OLAP analysis platform entrance is not unified: Presto/Spark analysis task entrances coexist, users do not know which engine is suitable for their SQL query execution, blind selection, poor experience; in addition, users will submit the same query at the same time on the two entrances, so as to speed up to obtain query results, resulting in a waste of resources;
Offline task calculation delay is high, and result output is too slow: a typical business such as ABTest often calculates results in the afternoon;
Each business party independently develops applications based on their own business scenarios, and develops real-time tasks in a chimney style, lacking data stratification and data kinship.

Faced with the above problems, the BIGO big data platform has built a OneSQL OLAP analysis platform and a real-time data warehouse.

Through the OneSQL OLAP analysis platform, unify the OLAP query entry, reduce the blind selection of users, and improve the resource utilization of the platform;
Build real-time data warehouse tasks through Flink, and perform data layering through Kafka/Pulsar;
Migrate some tasks with slow offline computing to Flink streaming computing tasks to speed up the output of computing results;

In addition, the real-time computing platform Bigoflow is built to manage these real-time computing tasks and build the blood relationship of real-time tasks.

2. Landing practice & characteristic improvement

2.1 OneSQL OLAP Analysis Platform Practice and Optimization

OneSQL OLAP analysis platform is an OLAP query and analysis engine that integrates Flink, Spark, and Presto. OLAP query requests submitted by users are forwarded to clients of different execution engines through the OneSQL backend, and then the corresponding query requests are submitted to different clusters for execution. Its overall architecture diagram is as follows:

The overall structure of the analysis platform is divided into entry layer, forwarding layer, execution layer, and resource management layer from top to bottom. In order to optimize the user experience, reduce the probability of execution failure, and improve the resource utilization of each cluster, the OneSQL OLAP analysis platform implements the following functions:

unified query entry: entry layer, users submit queries through the unified Hue query page entry using Hive SQL syntax as the standard;
Unified query syntax: integrates various query engines such as Flink, Spark, and Presto, and different query engines execute user SQL query tasks by adapting Hive SQL syntax;
Intelligent routing: In the process of selecting the execution engine, it will be based on the historical SQL query execution status (whether the execution is successful on each engine, and the execution time), the busyness of each cluster, and each engine's response to the SQL syntax. Whether it is compatible, select the appropriate engine to submit the query;
Fail to retry: OneSQL backend will monitor the execution of SQL tasks. If the SQL task fails during execution, it will select another engine to retry and submit the task;

In this way, through the OneSQL OLAP analysis platform, the BIGO big data platform realizes the unification of OLAP analysis portals, reduces users' blind choices, and makes full use of the resources of each cluster to reduce resource idleness.

2.1.1 Construction of Flink OLAP Analysis System

On the OneSQL analytics platform, Flink also acts as part of the OLAP analytics engine. The Flink OLAP system is divided into two components: Flink SQL Gateway and Flink Session cluster; SQL Gateway is used as the entry point for SQL submission. The query SQL is submitted to the Flink Session cluster for execution through the Gateway, and at the same time, it obtains the progress of the SQL execution query and returns the query result. to the client. The process of executing the SQL query is as follows:

First, the SQL submitted by the user is judged in the SQL Gateway: whether the result needs to be persistently written to the Hive table, if necessary, a Hive table will be created through the HiveCatalog interface to persist the calculation results of the query task; After that, the task executes SQL parsing on the SQL Gateway, sets the parallelism of job running, generates a Pipeline and submits it to the Session cluster for execution.

In order to ensure the stability of the entire Flink OLAP system and execute SQL queries efficiently, the following enhancements have been made in this system:

Stability:
- Based on zookeeper HA to ensure the reliability of Flink Session cluster, SQL Gateway monitors Zookeeper nodes and perceives the Session cluster;
- Control the data volume, partition number, and returned result data volume of the query scan Hive table to prevent the JobManager and TaskManager of the Session cluster from OOM;
performance:
- Flink Session cluster pre-allocates resources to reduce the time required to apply for resources after job submission;
- Flink JobManager parses Split asynchronously, and Split executes the task while parsing it, reducing the time that the task execution is blocked by parsing Split;
- Control the scanning partition and the maximum number of Splits during job submission, reducing the time required to set up task parallelism;
Hive SQL Compatible:
Improve the compatibility of Flink for Hive SQL syntax, the current compatibility for Hive SQL is about 80%;
monitoring alarm:
Monitor the memory, CPU usage, and task submission of the JobManager, TaskManager, and SQL Gateway of the Flink Session cluster. Once a problem occurs, alert and handle it in time;

2.1.2 Achievements of OneSQL OLAP Analysis Platform

Based on the OneSQL OLAP analysis platform implemented above, the following benefits have been achieved:

The unified query entry reduces the blind choice of users, the user execution error rate is reduced by 85.7%, and the SQL execution success rate is increased by 3%;
The SQL execution time is shortened by 10%, making full use of the resources of each cluster and reducing the time for queueing tasks;
As part of the OLAP analysis engine, Flink improves the resource utilization of real-time computing clusters by 15%;

2.2 Real-time data warehouse construction and optimization

In order to improve the output efficiency of some business indicators on the BIGO big data platform and better manage Flink real-time tasks, the BIGO big data platform has built a real-time computing platform Bigoflow, and migrated some slow computing tasks to the real-time computing platform. Executed by Flink streaming computing, data layering is performed through message queue Kafka/Pulsar, and real-time data warehouse is constructed; platform-based management of real-time data warehouse tasks is carried out on Bigoflow, and a unified real-time task access portal is established. And based on the platform to manage the metadata of real-time tasks, build the blood relationship of real-time tasks.

2.2.1 Construction plan

BIGO big data platform is mainly based on Flink + ClickHouse to build real-time data warehouse. The general scheme is as follows:

According to the data layering method of traditional data warehouse, the data is divided into four layers of data such as ODS, DWD, DWS, and ADS:

ODS layer: based on user behavior logs, business logs, etc. as raw data, and is stored in message queues such as Kafka/Pulsar;
DWD layer: This part of the data is aggregated by the Flink task according to the user's UserId, and then forms the behavioral data of different users and saves it to Kafka/Pulsar;
DWS layer: Kafka stream table of user behavior details and user Hive/MySQL dimension table perform stream dimension table JOIN, and then output the multi-dimensional detailed data generated after the JOIN to the ClickHouse table;
ADS layer: summarizes the multi-dimensional detailed data in ClickHouse according to different dimensions, and then applies it to different businesses.

In the process of building a real-time data warehouse according to the above scheme, we encountered some problems:

After converting offline tasks into real-time computing tasks, the computing logic is more complicated (multi-stream JOIN, deduplication), resulting in too large job status, OOM (out-of-memory) exceptions for jobs, or too much back pressure on job operators;
During the join process of the dimension table, the detail flow table and the large dimension table are joined, and there is too much data in the dimension table. After loading into the memory, OOM, the job fails and cannot be run;
Flink writes the multi-dimensional detailed data generated by the flow dimension table Join to ClickHouse, which cannot guarantee exact-once. Once the job fails over, the data will be repeatedly written.

2.2.2 Problem Solving & Optimization

optimize job execution logic, reduce state

The logic of offline computing tasks is relatively complex, involving join and deduplication operations between multiple Hive tables. The general logic is as follows:

When an offline job is converted into a Flink streaming task, the original scenario of joining multiple Hive tables offline is transformed into a scenario of joining multiple Kafka topics. Since the traffic of the Kafka topic of Join is relatively large, and the window time of Join is long (the longest window is 1 day), when the job runs for a period of time, a large amount of status is accumulated on the Join operator (the status will be lost after one hour). Close to 1T), in the face of such a large state, Flink jobs use Rocksdb State Backend to store state data, but still cannot avoid the problem that Rocksdb memory usage exceeds YARN kill, or there are too many states stored in Rocksdb State, and throughput drops Causes severe back pressure on the job.

In response to this problem, we perform Unoin all processing on these multiple topics according to the same Schema to obtain a large data stream, and then in this large data stream, we can judge according to the event_id of different event streams to know this. The topic of which event stream the piece of data comes from, and then perform aggregation calculation to obtain the calculation indicators on the corresponding event stream.

In this way, by replacing JOIN with UNION ALL, the impact of large State caused by JOIN calculation is avoided.

In addition, there are many count distinct calculations in computing tasks, similar to the following:

select
count(distinct if(events['a'] = 1, postid, null))
 as cnt1,
count(distinct if(events['b'] = 1, postid, null))
as cnt2
……
count(distinct if(events['x'] = 1, postid, null))
As cntx
From table_a
Group by uid

These count distinct counts are calculated in the same group by, and are deduplicated based on the same postid, so that these distinct states can share a set of keys for deduplication calculation, then a MapState can be used to store these counts The status of distinct is as follows:

The deduplication keys of these count distinct functions are the same, so they can share the key value in MapState, thereby optimizing the storage space; while the Value of Mapstate is a Byte array, each Byte has 8 bits, each bit is 0 or 1, the nth bit corresponds to the value of n count distinct functions on the key: 1 means that the count disitnct function needs to count on the corresponding key, 0 means that no count is required; when calculating the aggregate result, the nth of all keys The number of bits is added, which is the value of the nth count distinct, which further saves the storage space of the state.

Through the above optimization, the offline tasks of ABTest were successfully migrated to Flink streaming computing tasks, and the status of the job was controlled within 100GB, so that the job could run normally.

flow dimension table JOIN optimization

In the process of generating multi-dimensional detailed wide table, flow dimension table JOIN needs to be performed, and the function of Flink Join Hive dimension table is used. Join according to the data in Join Key and HashMap. However, in the face of hundreds of millions or billions of rows of Hive large-dimensional tables, the amount of data loaded into memory is too large, which can easily lead to OOM (memory overflow). In response to the above problems, we hash the Hive large-dimensional table according to the Join Key, as shown in the following figure:

In this way, the data of the Hive large-dimensional table is calculated by the Hash function and distributed to the HashMaps of different parallel subtasks of the Flink job. Each HashMap only stores a part of the data of the large-dimensional table. As long as the parallelism of the job is large enough, it can be Divide the data of the large-dimensional table into enough pieces to save in shards; for some too large dimension tables, you can also use Rocksdb Map State to save the sharded data.

When the data in the Kafka flow table is to be sent to different subtasks for Join, it is also calculated by the same Join Key according to the same Hash function, so that the data is allocated to the corresponding subtask for Join, and the result after Join is output. .

Through the above optimizations, some Hive large-dimensional table tasks have been successfully joined to perform flow-dimensional table Join calculations, and the largest dimension table has more than 1 billion rows.

Exactly-Once semantic support for ClickHouse Sink

In the process of outputting the multi-dimensional detailed data generated by the flow dimension table Join to the ClickHouse table, since the ClickHouse of the community does not support transactions, there is no way to guarantee the Exactly-Once semantics in the process of data sinking to ClickHouse. During this process, once the job Failover occurs, the data is repeatedly written to ClickHouse.

In response to this problem, BIGO ClickHouse implements a two-stage commit transaction mechanism: when writing data to ClickHouse, you can first set the writing mode to temporary, indicating that the data being written is temporary data; , return an Insert id, and then perform the Commit operation according to the Insert id, then the temporary data will be converted into formal data.

Based on the two-phase commit transaction mechanism of BIGO ClickHouse, combined with Flink's checkpoint mechanism, a ClickHouse Connector is implemented to ensure the Exactly Once write semantics of ClickHouse Sink, as follows:

In the case of normal writing, Connector randomly selects a shard of ClickHouse to write, writes a single copy or double copy according to the user configuration to perform the insert operation, and records the insert id after writing; There will be multiple such insert operations, resulting in multiple insert ids. When the checkpoint is completed, these insert ids will be submitted in batches, and the temporary data will be converted into formal data, that is, the data writing between the two checkpoints is completed;
Once the job fails over, after the Flink job Failover restarts, it will restore the state from the last completed checkpoint. At this time, the Operator State in ClickHouse Sink may contain the Insert id that has not been submitted last time. Retry submission; for the data that has been written into ClickHouse, but the insert id is not recorded in the Opeator State, because it is temporary data, it will not be queried in ClickHouse, after a period of time, it will be queried by ClickHouse The expired cleanup mechanism is cleaned up, thus ensuring that the data will not be repeated after the state is rolled back to the last checkpoint.

Through the above mechanism, the data is successfully written from Kafka through Flink to the end-to-end Exactly-Once semantics of the entire link of ClickHouse, and the data is neither duplicated nor lost.

2.2.3 Platform Construction

In order to better manage the real-time computing tasks of the BIGO big data platform, the company has built the BIGO real-time computing platform Bigoflow to provide users with unified Flink real-time task access. The platform construction is as follows:

Support Flink JAR, SQL, Python and other types of jobs; support different Flink versions, covering most of the company's internal real-time computing related businesses;
One-stop management: It integrates job development, submission, operation, historical display, monitoring, and alarming, so that it is convenient to check the running status of jobs and find problems at any time;
Blood relationship: It is convenient to query the data source, data purpose, and ins and outs of data calculation of each job.

3. Application scenarios

3.1 Application Scenario of Onesql OLAP Analysis Platform

The application scenario of Onesql OLAP analysis platform in the company is: applied to AdHoc query, as follows:

The SQL submitted by the user through the Hue page is forwarded to the Flink SQL Gateway through the OneSQL backend, and submitted to the Flink Session cluster to execute the query task. The Flink SQL Gateway obtains the execution progress of the query task and returns it to the Hue page, and returns the query result.

3.2 Real-time data warehouse application scenarios

The real-time data warehouse application scenario is currently mainly ABTest business, as follows:

The user's original behavior log data is aggregated by the Flink task to generate user detailed data, and then stream-dimensional table JOIN is performed with the dimension table data, and output to ClickHouse to generate a multi-dimensional detailed wide table, which is aggregated according to different dimensions and applied to different businesses. By transforming the ABTest business, the generation time of the result indicators of the business was advanced by 8 hours, and the resources used were more than doubled.

4. Future planning

In order to better build the OneSQL OLAP analysis platform and BIGO real-time data warehouse, the planning of the real-time computing platform is as follows:

Improve the Flink OLAP analysis platform, improve Hive SQL syntax support, and solve the problem of JOIN data skew in the calculation process;
Improve the construction of real-time data warehouses, introduce data lake technology, and solve the problem of small rerun and traceability of task data in real-time data warehouses;
Based on Flink, a data computing platform integrating stream and batch is built.

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

BIGO uses Flink for OLAP analysis and real-time data warehouse practice and optimization

1. Business Background

2. Landing practice & characteristic improvement

2.1 OneSQL OLAP Analysis Platform Practice and Optimization

2.1.1 Construction of Flink OLAP Analysis System

2.1.2 Achievements of OneSQL OLAP Analysis Platform

2.2 Real-time data warehouse construction and optimization

2.2.1 Construction plan

2.2.2 Problem Solving & Optimization

2.2.3 Platform Construction

3. Application scenarios

3.1 Application Scenario of Onesql OLAP Analysis Platform

3.2 Real-time data warehouse application scenarios

4. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

MyBatis-Plus结合Spring Boot实现数据权限

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

C++ 中 VS 项目引入公共配置文件