The extension and practice of Flink SQL in Kuaishou

Abstract: This article organizes the sharing from the technical experts Zhang Jing and Zhang Mang of the Kuaishou real-time computing team at Flink Forward Asia 2021. The main contents include:
Flink SQL in Kuaishou
function extension
performance optimization
Improved stability
future outlook

FFA 2021 Live Replay & Presentation PDF Download

1. Flink SQL is in Kuaishou

After more than a year of promotion, Kuaishou internal users' recognition of Flink SQL has gradually increased. Among the newly added Flink jobs this year, SQL jobs have reached 60%, which has doubled compared with last year, and the peak throughput has reached 6 billions per second.

2. Function expansion

In order to support internal business needs, Kuaishou has made a lot of functional extensions. This article focuses on sharing two of them around the window, one is the Group Window Aggregate extension, and the other is the Window Table-valued Function extension proposed in Flip-145.

Explain the difference and connection between the above two:

Group Window Aggregate is used for window aggregation in Flink 1.12 and earlier versions. It has two limitations. The first is that its syntax does not conform to the SQL standard. It needs to use special window functions and cooperate with windows. Helper function to complete job aggregation. In addition, it also restricts the window function to only appear in the group by clause, so it can only be used for aggregation;
Therefore, Flink proposed Window TVF in Flip-145, which is based on the syntax of polymorphic table functions proposed in the SQL standard in 2017. In addition to aggregation on windows, it can also do window association, TopN and deduplication. and so on.

2.1 Group Window Aggregate extension

You may ask, since there is already Window TVF, why do we need to extend the function on Group Window Aggregate? Because Kuaishou only started to upgrade from version 1.10 to 1.13 in the second half of this year, most businesses are still using version 1.10.

Kuaishou has made two extensions on Group Window Aggregate, one is to support multi-dimensional aggregation, and the other is to introduce higher-order window functions.

2.1.1 Support multidimensional analysis

Flink SQL has long supported multi-dimensional aggregation on infinite streams. Kuaishou also added multi-dimensional analysis functions to Group Window Aggregate, supporting standard Grouping Sets, Rollup and CUBE clauses, and also supports various window types, such as scrolling, Swipe, session window, etc.

For example, in the example above, it is necessary to count the cumulative UVs under the subject dimension and the total dimension. The group by clause of SQL contains two parts: one is the CUMULATE window function, and the other is the Grouping Sets clause. There are two elements in parentheses: one for the total dimension and one for the topic dimension.

2.1.2 Introducing higher-order window functions

Developers of data analysis often encounter such a need to draw a curve. The meaning of each point is the cumulative index from 0:00 of the day to the current time point. The horizontal axis represents the time, and the vertical axis represents the cumulative indicator. There are two solutions to such a requirement:

The first solution is to use infinite stream aggregation, normalize the time to minute granularity and use it as a column of the group key, but the business requires that the curve output to the screen does not change, and the output result of infinite stream aggregation is an update stream. So it does not meet the requirements.
The second option is to use a rolling window function of one day. In order to output the result in advance, it is still necessary to set the trigger in advance, and the time point is the time of the current machine or the largest timestamp in the historical input data. The disadvantage of this scheme is that the value of the ordinate of each point is not the cumulative value at this time point. This can lead to several anomalies, such as when a job fails to restart or the business actively goes back to the history, the historical curve at that time cannot be completely restored. Moreover, the cumulative value of the sub-dimension at each point does not add up to the value of the total dimension. There is also a disadvantage. When calculating UV, we often use two-stage aggregation to avoid the tilt of distinct keys, but when using this scheme, there may be pits on the original curve.

The above picture shows some abnormal curves caused by the second scheme:

The first curve is to perform historical backtracking. After the lag is eliminated, the curve becomes normal. When the lag is not completely eliminated, the curve is not smooth, and the historical curve cannot be restored.
The second curve is a pit in the auto-increment curve.

Because the output stream of the first-level aggregation is an update stream, Flink's current update mechanism is to send two separate messages for recall and update, rather than an atomic message, so the second aggregation may receive multiple concurrent upstream messages first. The retraction message sent down will cause the accumulated value to drop first and then rise, so a pit is formed.

We introduce the CUMULATE window to solve these problems.

This window function coincides with the CUMULATE window proposed in Flip-145, but this window type is introduced on the Group Window Aggregate syntactically. It has three required parameters: the time attribute column, the step size and max size of the window, and an optional parameter that specifies the offset at which the window starts.

Regarding the division logic of the CUMULATE window, it is assumed that the step size of the CUMULATE window is one minute, the max size is three minutes, the interval of window1 is 0~1 minutes, window2 is 0~2 minutes, window3 is 0~3 minutes, and window4 starts at 3 ~4 points, window5 is 3~5 points, and so on, and a data with a timestamp of 0 minutes and 30 seconds will be divided into three windows: window1, window2 and window3.

For example, a data curve needs to be drawn, one point per minute, and each point represents the cumulative UV of each sub-page on that day. The query statement uses the event time, the step size of the CUMULATE window function is one minute, the max size is one day, the group key of the business is the ID of the subpage, and the timestamp is the end time of the window.

As can be seen from the above figure, the curve drawn by the CUMULATE scheme is smooth whether it is normal consumption or historical retrospective.

Advantages of the

The first advantage is that the end time of the window is used as the abscissa of each point, and the ordinate of each point on the curve is the accumulated value at the time point corresponding to the abscissa, so no matter in the case of retrospective history or job failover The curves can be completely restored, and the value of the sub-dimension at each time point always adds up to the value of the total dimension.
The second advantage is the use of two-phase aggregation, which prevents distinct key skew. Since the data is sent at the end of the window, there is no retraction, and the output is the append stream, so there will be no pits on the auto-increment curve.

Dynamic cumulate window

The Dynamic cumulate window is also to solve the requirements of the curve class, such as calculating the cumulative indicators of the live broadcast room since the start of the broadcast. The difference from the previous requirements is that it is uncertain how long each live broadcast room can be on and off, but it is also a Calculation of cumulative indicators. It has two mandatory parameters: the time attribute column and the step size of the window, and an optional parameter, the gap of the window, which is used to define how long the window is considered to have ended without input data. It should be noted here that the end of a window will trigger the result output of the window, and the state will be cleared. If another piece of data with the same key comes, the late data will be discarded, and the data that is not late will be divided into a new window. The accumulated value also starts from zero.

As shown in the above example, a curve needs to be drawn. Each point represents the cumulative UV of each live broadcast room since the start of the broadcast. If a live broadcast room has no data inflow for one hour in a row, it is considered to be closed. The window function uses Dynamic cumulate window, the step size is 1 minute, the gap is 60 minutes, the Group key is the ID of the live broadcast room, and the timestamp still uses the end time of the window.

2.2 Window Table-valued Function extension

2.2.1 Enriching Window TVF operators

The Window Table-valued Function (window tvf) syntax proposed by the community in Flip-145, and implements window aggregation. On this basis, we have enriched window operators, including TopN, association and deduplication, and also supported a separate window Table-valued Function query statement. These functions have been successively pushed to various versions of the community. With these window operators, users can implement more complex business logic with Flink SQL.

As shown in the figure above, it is necessary to count the sales and buyers of the 100 most popular items that day, as well as the sales of the anchors to which these hot items belong. First, do a window aggregation to get the sales and the number of buyers of each product since 0:00, and then do a window aggregation to get the number of buyers of all the babies of each anchor. The two results are associated with the window, and finally the window TopN, get the top 100 explosives and its various cumulative indicators.

2.2.2 Support Window Offset

The window offset is mainly used to adjust the division logic of the window. It is an optional parameter. The default value is 0, which means that the zero point of the unix time is used as the starting time of the window division. Its value can be a positive number, which means the offset to the right. Can also be a negative number to offset to the left. But it only affects how the window is divided, not the watermark. In addition, for the same window, different offsets may produce the same offset effect. For example, for a 10-minute rolling window, shifting the starting point to the left by 4 minutes or to the right by 6 minutes, the impact on the window division is the same.

As shown in the example above, a data curve needs to be drawn, and one point per minute represents the cumulative UV of each page since this week. You can use the CUMULATE window function, using event time, a step size of 1 minute, and a max size of 7 days. Because the day of unix time zero is on Thursday, assuming the default offset is used, the window division is from this Thursday to the next Thursday, so we need to set the offset to 4 days, which means 4 days to the right, so it is from this Monday. until next Monday.

2.2.3 Support Batch Mode

Additionally we have added support for batch mode. The principle is to introduce a windows operator, attach the window attribute to the input data and send it to the downstream, and the downstream operator reuses the existing operators in the batch. For example, HashAgg or SortAgg is used for aggregation, and the association is Compared with HashJoin or SortMergeJoin, these batch operators do not require state, so they also have better performance in terms of throughput.

3. Performance optimization

This article mainly introduces two aspects of optimization, one is state optimization on aggregation, and the other is batch optimization on dimension table association.

3.1 State optimization on aggregates

Let's take an example to understand the state reuse of distinct states in the aggregation scenario. It is necessary to count the UVs of each sub-channel under the application. This use case has two characteristics, the channels are enumerable and the coincidence of the visitors of each channel is high.

The most primitive query statement is shown in the figure above. The group key is a channel, and a count distinct is used to calculate the UV of each channel. The device set first has a map state in the state. Suppose there are only three channels enumerated, A, B and other, the group key is the channel ID, the key of the map state is the device ID, and the value is a 64-bit long value. , each bit indicates whether the device appears in this channel. In a simple scenario, the Value value is 1.

In the picture above, there are two devices under channel A, the IDs are 1 and 2, the device with ID 1 accesses channel B at the same time, and the device with ID 2 accesses channel other at the same time. It can be found that the maps of different channels can have a lot of overlap. If you want to reuse these keys, you can manually rewrite the SQL using the methods provided by the community.

First, do a row-to-column operation, take the three channel values into the filter conditions of the three count distinct aggregation functions, and use a custom table function to do the column-to-row conversion before outputting.

The rewritten query statement, the state and storage of the device collection are shown in the figure above. The group key is empty, the map state key is the device ID, and the map state value is a 64-bit long type. Each bit indicates whether the device appears in each channel. For example, the value of the device with ID 1 is 110, indicating that the device has accessed the channel. A and B, the device with ID 3 accessed channels B and other.

This scheme greatly reduces the state, but also has two disadvantages. The first is the need to manually rewrite the SQL. If a dimension has multiple values or has multiple enumerable dimensions, the manually rewritten SQL will be very long. Another disadvantage is that a custom table function needs to be used for column-to-row conversion.

We propose a simplified SQL representation that achieves the benefits of state while reducing the burden on the data developer. The user only needs to tell the optimizer the enumeration value of the group key in one way in the query statement, and the optimizer will automatically rewrite it, perform column transfer and column transfer, and then reuse the distinct map state after rewriting. After rewriting the equivalent query statement, you only need to specify the enumeration value in the filter condition, and you can use in or or expressions.

The above performance optimizations can be used for infinite stream aggregation and window aggregation, and one enumerable dimension or multiple enumerable dimensions can be used, which can be used in simple aggregation queries or multi-dimensional aggregations.

But its limitation is that at least one key in the group key is enumerable, and the enumeration value must be static and can be explicitly written in the filter condition. In addition, the distinct keys in each dimension must overlap to achieve the effect of saving state. If you need to count the UVs of each province, you can basically think that visitors from different provinces have no intersection. At this time, it is not profitable to reuse the distinct key. In addition, when the window is aggregated, the window function must have row semantics, not set semantics. For a window with row semantics, which window the current data belongs to depends on the data itself; but for a window with set semantics, which window the current data belongs to depends not only on the data itself, but also on the historical data set received by this window. . This optimization adjusts the group key of the aggregation operator, which will affect the data set received by each window, so it is not applicable to the window with set semantics.

Finally, some users may ask why Calcite is not syntactically used to provide pivot/unpivot to explicitly express row-to-column and column-to-row. The first is that the conditions are not met, because calcite version 1.26 introduced pivot, and 1.27 introduced unpivot, and Flink has depended on Calcite 1.26 since version 1.12. The second reason is that if the syntax of pivot/unpivot is used, the SQL will be much longer than the current expression.

3.2 Batch optimization of dimension table associations

The batch optimization of dimension table association is to reduce the number of RPC calls. The principle is to call the batch query interface of the dimension table after saving a batch of data. In terms of syntax, we introduce a general Mini-Batch hint, which has two parameters: one indicates how long to save a batch, and the other indicates how many pieces of data to save a batch. . A valid Mini-Batch hint contains at least one parameter. We designed the hint to be generic, and hope it can be used not only for dimension table associations, but also for batch optimization of aggregations.

Looking at another example, you need to widen the order table and associate the customer information of the order. The query statement followed by a hint in the customers dimension table indicates that a batch of 5 seconds or a batch of 10,000 pieces of data can be saved. This optimization is far more complicated than the expression of SQL syntax in the implementation of the underlying operators and design.

4. Stability improvement

In terms of stability, it mainly introduces some optimizations and improvements made by Kuaishou for Group Window Aggregate to solve data skew and state compatibility after Flink SQL aggregation indicator adjustment.

4.1 Data skew of Group Window Aggregate

Window computing is widely used in Kuaishou. Kuaishou's business scenarios are prone to data skew, such as live broadcasts of big anchors and some major events. If the real-time calculation encounters data skew, the indicator will be delayed, and the data will be accidental. Therefore, we support the optimization of Mini-Batch, Local-Global, and Split Distinct on the Tumble window. Similar optimizations are also supported on other commonly used windows. . After these optimizations are launched, they can not only help businesses avoid data skew, but also bring good performance benefits.

4.2 Aggregate State Compatibility

First, let's look at the business scenarios that are compatible with Aggregate state on Flink SQL. With the development of the business, the daily tasks may need to add indicators or delete unnecessary indicators. In the process of major activities, the business needs to add new indicators that are accumulated at the day level or indicators that are continuously accumulated during the activity cycle.

If it is a change of the sky-level indicator, the developer can only discard the state, upgrade the task after 0:00, and then specify the task to start consumption from the data at 0:00, so as to ensure the continuity of the indicator. If it is an indicator that activities continue to accumulate, in order to avoid the impact on the original indicator, you can only choose to add a new task to calculate the new indicator separately, but this will lead to redundancy of resources.

The reason why such a complex operation is required is because Flink SQL has a relatively simple strategy for judging whether the state is compatible. It only depends on whether the data type of the state required by the engine is exactly the same as that of the state saved in the Savepoint. Complete consistency is compatible, otherwise it is not compatible. There is a loophole in this judgment method. The type of State has not changed but the aggregate function in SQL has changed. In this case, Flink will also consider the state to be compatible.

Based on this background, we propose the compatibility of aggregate state. The goal is to enable users to learn to use the state-compatible solution at a very low cost (or 0 cost), users can upgrade tasks at any time, and do not need to be stuck in zero-point operations anymore, and support new aggregation functions. Add and delete operations.

Aggregate state compatible operation scenarios can only add aggregate functions at the end of aggregate functions, delete aggregate functions at any position, modify the order of aggregate functions, and do not allow two operations to add and delete at the same time in one upgrade. Completed for two upgrades.

The right table in the above figure is the mapping relationship between indicator ID and state type. In order to easily judge whether the state is compatible, we save the mapping relationship between the indicator identifier and the state type in the meta of the state, because some aggregation functions may have more than one state, such as the avg function, which needs to be calculated by two states, sum and count. Auxiliary calculation, so this mapping relationship is very important.

When adding an aggregate function, the new state needs to be filled with the initial value. The initial value corresponding to different functions is different. For example, the initial value of count is 0, but the initial value of sum must be null.

The early-fire and late-fire scenarios of the window will introduce the Retract message, so that there is one more state to record the data that has been sent to the downstream. It has more time fields than the original state of the window, and needs to be dealt with when making judgments and state transitions.

As mentioned earlier, we save the mapping relationship between the indicator identifier and the state type in the meta information of the state, which will bring about the problem of forward compatibility of the state, and the new version of the program cannot correctly read the savepoint of the previous version. In order to solve this problem, it is necessary to modify the version information of the meta, and use the version information to distinguish the old and new versions of the state, so as to achieve the forward compatibility of the state.

In the aggregate scenario, the user may control the cleanup of invalid states by setting the state TTL. Aggregate state compatibility should also be processed in this scenario to ensure the state after the migration. The TTL timestamp must be consistent with the original data. make any changes.

The Aggregate state compatible solution has the advantage that the cost for users to learn and use is very low, almost unaware and does not rely on any external service architecture. Aggregate computing scenarios.

Finally, I will introduce the ultimate state-compatible solution that Kuaishou is doing. Users can add, delete state, or even modify the content of the state at any position of the Savepoint; they can also customize a complete state, such as the state initialization of the Flink on hudi task.

The advantage of the ultimate solution is that it does not invade Flink's source code, which is convenient for Flink version upgrades. Users can operate on the platform interface without the need to develop code and support the state compatibility of all scenarios. It is no longer limited to specific scenarios. The disadvantage is that the learning cost is relatively high for users. It is necessary to understand the more professional knowledge points such as Operator State and keyedState, and also know whether the operator contains state.

V. Future Outlook

In the future, Kuaishou will continue to expand functions in the direction of Stream SQL, improve performance, achieve the purpose of reducing costs and increasing efficiency, and explore state compatibility in more scenarios; in terms of streaming batches, Kuaishou will improve the capabilities of Flink Batch SQL and increase Optimizations such as speculative execution and adaptive query improve the stability and performance of Batch SQL, and continue to expand business application scenarios; in terms of data lakes and real-time data warehouses, they will continue to promote their implementation in more business scenarios.

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

The extension and practice of Flink SQL in Kuaishou

1. Flink SQL is in Kuaishou

2. Function expansion

2.1 Group Window Aggregate extension

2.1.1 Support multidimensional analysis

2.1.2 Introducing higher-order window functions

2.2 Window Table-valued Function extension

2.2.1 Enriching Window TVF operators

2.2.2 Support Window Offset

2.2.3 Support Batch Mode

3. Performance optimization

3.1 State optimization on aggregates

3.2 Batch optimization of dimension table associations

4. Stability improvement

4.1 Data skew of Group Window Aggregate

4.2 Aggregate State Compatibility

V. Future Outlook

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

数据无界、湖仓无界，Apache Doris 湖仓一体典型场景实战指南（下篇）

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统