Autohome explores cross-database real-time materialized views based on Apache Flink

This article introduces some practical experience and exploration of Autohome in real-time materialized views based on Flink, and tries to allow users to directly develop Flink Streaming SQL tasks with the idea of batch processing SQL. The main content is:
System analysis and problem disassembly
Problem solving and system implementation
Real-time materialized view practice
Limitations and deficiencies
Summary and outlook

Preface

The function of materialized views must be familiar to everyone. By using materialized views, we can update the result set in real time (transactionally) in the form of incremental iterations by using the complex SQL logic set in advance to query the result set. Avoid the complicated overhead of each query, thereby saving time and computing resources. In fact, many database systems and OLAP engines support materialized views to varying degrees. On the other hand, Streaming SQL itself has a deep connection with materialized views, so it is quite natural to build a real-time materialized view system based on Apche Flink (hereinafter referred to as Flink) SQL.

This article introduces some practical experiences and explorations of Auto Home (hereinafter referred to as Home) in real-time materialized views based on Flink, and tries to allow users to directly develop Flink Streaming SQL tasks with the idea of batch processing SQL. I hope to bring you some inspiration to explore this field together.

1. System analysis and problem disassembly

Flink has done a lot of work in the Table & SQL module. Flink SQL has implemented a mature and relatively complete SQL system. At the same time, we also have a lot of technology and product accumulation on Flink SQL, which is directly based on Flink SQL itself. Most of the problems in building a real-time materialization system have been solved, and the only problem we need to solve is how to generate a semantically complete Changelog DataStream corresponding to the data source table without failing, including two parts: incremental and full history.

Although there is only one problem left in the statute, it is still relatively difficult to solve this problem. Then we will continue to disassemble this problem into the following sub-problems:

1.  加载全量数据；
2.  加载增量数据；
3.  增量数据与全量数据整合。

2. Problem solving and system realization

Question 1: Incremental data reading based on the data transmission platform

Incremental data loading is relatively easy to solve. We directly reuse the infrastructure of the real-time data transmission platform. The data transmission platform ^[1] has written Mysql / SqlServer / TiDB and other incremental data to a specific Kafka Topic in a unified data format. We can read it as long as we obtain the corresponding Kafka Topic.

Question 2: Support full data loading of checkpoint

For full data loading, we have written two versions.

In the first edition, we wrote a set of BulkLoadSourceFunction with Legacy Source. The idea of this edition is relatively simple, that is, all queries are made from the data source table. This version can indeed complete the loading of the full amount of data, but the problem is also more obvious. If the job is restarted during the bulk load phase, we have to reload the full data. For tables with a large amount of data, the consequences of this problem are quite serious.

For the inherent problems of the first edition, we have not had a particularly good countermeasure until the release of ^[2] We refer to Flink-CDC's idea of supporting Checkpoint in the full data loading phase, and developed a new BulkLoadSource based on FLIP-27. The second edition has greatly improved compared with the first edition in terms of performance and usability.

Question 3: Lightweight CDC data integration algorithm based on the global version

Among these three sub-problems, the difficulty of question three is much greater than that of the previous two sub-problems. The naive idea of this problem may be very simple, we just need to cache all the data according to the Key, and then trigger the Changelog DataStream update according to the incremental data stream.

In fact, we have developed a version of integrated logic operators in accordance with this idea. This version of the operator is still relatively work for small tables, but for large tables, the inherent overhead of this idea has begun to become unacceptable. We have used a SqlServer table with a data volume of 1.2 billion and a size of about 120G for testing. The huge data itself and the inevitable expansion of the JVM make the state size more exaggerated. After this test, we agreed that such an extensive strategy does not seem to be suitable for release as a production version, so we had to start to rethink the algorithms and strategies for data integration.

Before talking about our algorithm design ideas, I have to mention the algorithm design of DBLog ^[3] . The core idea of this algorithm uses watermark to identify historical data and merge it with the corresponding incremental data to achieve no use The lock can complete the integration of the entire incremental data and historical data. Flink-CDC is also based on this idea to implement and improve. In the process of collecting and analyzing relevant data, we found that our algorithm idea is very similar to the core idea of DBLog's algorithm, but it is designed and specialized based on our scenario and situation.

First analyze our situation:

Incremental data needs to come from Kafka Topic of the data transmission platform;
The incremental data is at least once ;
Incremental data is stored in full serial version number .

Based on the analysis of the above situation, let's stipulate the goal that this algorithm must achieve:

Ensure that the Changelog Stream of the data is complete and the Event (RowKind) has complete semantics ;
Ensure that the overhead of the algorithm is controllable;
Ensure that the processing performance of the algorithm is sufficiently efficient;
Ensure that the algorithm implementation of does not rely on any system or function from outside Flink.

After everyone’s analysis and discussion, we designed a data integration algorithm named Global Version Based Pause-free Change-Data-Capture Algorithm .

3.1 Algorithm principle

We while read BulkLoadSource total amount of data RealtimeChangelogSource incremental data, and the Connect KeyBy primary key, and KeyedCoProcess phase after the core logic of the algorithm is mainly completed. Here are some key field values:

SearchTs: The timestamp of the full data from the data source query
based on the timestamp generated in the 161add393e5451 database based on the incremental data;
Version: The full serial version number, the full data is 0, that is, a certain minimum version.

After KeyedCoProcess receives the full amount of data, it will not send it directly, but will cache it first, and then send and clear the cache of the version0 version data after the Watermark value is greater than the SearchTs. During the waiting period, if there is a corresponding Changlog Data, all the cached Version0 data will be discarded, and then the Changelog Data will be processed and sent. In the entire data processing process, full data and incremental data are consumed and processed at the same time, and there is no need to introduce a pause phase for data integration.

             增量数据在全量数据发送 watermark 之前到来，只发送增量数据即可，全量数据直接丢弃

             全量数据发送 watermark 到达后，仍未有对应的增量数据，直接发送全量数据

3.2 Algorithm implementation

We decided to implement the algorithm in the form of Flink Connector, and we named the Connector after the Estuary Use DataStreamScanProvider to complete the concatenation of the internal operators of Source. The operator organization of Source is as shown in the figure below (the operators of the chain have been disassembled and shown).

BulkLoadSource / ChangelogSource mainly responsible for data reading and unified format processing;
BulkNormalize / ChangelogNormalize mainly responsible for the processing of data runtime information addition and coverage, primary key semantic processing, etc.;
WatermarkGenerator is an operator of Watermark generation logic customized for algorithm work requirements;
And VersionBasedKeyedCoProcess is the core operator for processing merge logic and RowKind semantic completeness.

There are still many points that need to be optimized or weighed in the process of algorithm implementation. After the full amount of data enters the CoProcess data, it will first check whether a larger version of the data has been processed, and if not, the data will be processed first. The data will first be stored in the State and be based on SearchTs + T (T is the inherent delay we set) Register EventTimeTimer. If no higher version data arrives, the timer triggers the sending of Version 0 data, otherwise it is directly discarded and replaced with higher version incremental data processed by RowKind semantics.

On the other hand, to avoid infinite growth of the state, when the system determines that the BulkLoad phase is over, it will end the use of the related Flink State, and the existing State only needs to wait for the TTL to expire.

In addition, we synchronized and downstream Sink in support for data scenario Upsert ability to develop a special mode optimized for ultra-lightweight, can be ultra-low overhead complete the entire amount + incremental data synchronization .

After the development is completed, we repeated testing, modification and verification to complete the development of the MVP version.

Three, real-time materialized view practice

After the MVP version was released, we worked with users and classmates to conduct a Flink-based materialized view pilot.

1. Real-time Data Pipeline based on complex logic of multiple data sources

The following is a real production requirement of the user: There are three tables, respectively from TiDB /. In SqlServer / Mysql, the number of data rows is tens of millions / 100 million / tens of millions, and the calculation logic is relatively complicated, involving deduplication and multi-table join. The original T+1 result table was generated through offline batch processing. The user hopes to reduce the delay of the pipeline as much as possible.

Since the TiCDC Update data we use does not yet contain the -U part, the integration algorithm of the TiDB table still uses the Legacy Mode to load.

We communicate with users and suggest that write Flink SQL , and output the detailed data of the results to StarRocks. With our assistance, the user completed the SQL development relatively quickly. The calculation topology of the task is as follows:

The result is quite surprising! We succeeded in reducing the original pipeline delay to a delay of about 10s while ensuring the accuracy of the data. Data has also changed from querying Hive to querying StarRocks. Whether it is from data access, data pre-calculation, or data calculation and query, it has achieved full real-time. On the other hand, the maximum increment of the three tables per second does not exceed 300, and the task does not have the problem of update and enlargement, so the resource usage is quite small. According to the monitoring feedback information, after the initialization phase is completed, only 1 CPU (on YARN) is required for the TM part of the entire task, and the normal CPU usage does not exceed 20%. Compared with the original batch processing resource usage, it is undoubtedly a huge improvement.

2. Data lake scenario optimization

As mentioned above, we have made special optimizations for data synchronization. You only need to use the dedicated Source table to start the synchronization of historical data + incremental data with one click, which greatly simplifies the process of data synchronization. We are currently trying to use this function to synchronize data to the Iceberg-based data lake, greatly improving the freshness of data from the data synchronization level.

Four, limitations and shortcomings

Although we have achieved certain results in this direction, there are still certain limitations and shortcomings.

1. The implicit dependency of the server clock

Carefully read the above algorithm principle, we will find that whether it is the generation of SearchTs or the generation of Watermark, in fact, it ultimately depends on the clock of the server system instead of relying on similar to the Time Oracle mechanism . Although we introduce inherent delay in the algorithm implementation to avoid this problem, if the server has a very serious clock inconsistency that exceeds the inherent delay, the watermark is unreliable at this time, which may cause errors in the processing logic.

After confirmation, the server clock of Zhijia will be calibrated.

2. Consistency and affairs

In fact, our current implementation does not have any transaction-related guarantee mechanism, and can only promise the final consistency of the results, which is actually a rather weak guarantee. Take for example the above-mentioned, if a table exists in which a two hour delay consumption, another table substantially no delay , two results at this time is actually generated table Join an intermediate state, In other words, it should be invisible to external systems.

In order to achieve a higher consistency guarantee and avoid the above problems, we naturally think of introducing a transaction commit mechanism. However, we have not yet found a better implementation idea, but we can discuss our current thinking.

2.1 How to define a transaction

Everyone knows the concept of affairs more or less, so I won’t repeat it here. How to define transactions within the database system is a particularly natural and necessary thing, but how to define transactions in this cross-data source scenario is actually a very difficult thing. Continuing with the above example, we can see that the data sources come from a variety of different databases. We actually record the corresponding transaction information for a single table, but there is really no way to define unified transactions from different data sources. Our current naive thinking is based on the time when the data is generated, combined with checkpoint to uniformly delineate Epoch, to achieve a submission mechanism similar to Epoch-based Commit. But doing this goes back to the problem mentioned earlier. It needs to rely on server time and cannot guarantee correctness from the root cause.

2.2 Cross-table transactions

For the issue of Flink materialized view consistency submission, TiFlink ^[4] has done a lot of related work. But our Source comes from different data sources and is read from Kafka, so the problem becomes more complicated. It is the example mentioned above. After the two tables are joined, if you want to ensure consistency, it is not just the Source and Sink operators. The entire relational algebraic operator system needs to consider the introduction of the concept and mechanism of transaction submission, so as to avoid the release of intermediate states to external systems.

3. Update zoom

This question is actually easier to understand. Now there are two tables join. For each row of data in the left table, there are n (n> 100) pieces of data corresponding to the right table. Then update any row of the left table now, there will be an update zoom of 2n.

4. State size

Although the overhead of the whole set of algorithms in the full synchronization stage is controllable, there is still room for optimization. Our current actual measurement shows that for a table with a data volume of about 100 million, in the full data stage, a State with a peak value of about 1.5G is required. We plan to continue to optimize the state size in the next version. The most straightforward idea is that BulkSource inform KeyedCoProcess which primary key sets have been processed, so that the corresponding Key can enter the full phase completion mode early, thereby further optimizing the state size.

V. Summary and Prospects

This article analyzes the problems and challenges based on the realization of Flink materialized views, focuses on the algorithm and implementation of processing and generating a complete Changelog DataStream and the benefits in the business, and fully explains the current limitations and deficiencies.

Although the results of this practice are not complete and there are some problems that need to be solved urgently, we still see huge breakthroughs and advancements, both in terms of technology and business use. We fully believe that this technology will become more mature in the future and be recognized and used by more and more people. Through this exploration, we have fully verified the unity of stream processing and batch processing.

Our current implementation is still in an early version, and there is still room and work for engineering optimization and bug fixes (for example, the skew of the two tables mentioned above is too big, you can try to introduce Coordinator for adjustment and alignment), but I believe it will follow With continuous iteration and development, this work will become more and more stable, thereby supporting more business scenarios and fully improving the quality and efficiency of data processing!

Special thanks to Zhang Qizi and Teacher Yun Xie for their help and corrections.

Quote

[1] http://mp.weixin.qq.com/s/KQH-relbrZ2GUqdmaTWx6Q

[2] http://github.com/ververica/flink-cdc-connectors

[3] http://arxiv.org/pdf/2010.12597.pdf

[4] http://zhuanlan.zhihu.com/p/422931694

For more Flink-related technical issues, you can scan the code to join the community DingTalk exchange group;

Get the latest technical articles and community dynamics in the first time, please follow the public account~

KubeCon + CloudNatvieCon + Open Source Summit China 2021

On December 9-10, 20 senior cloud native technical experts from CNCF and major manufacturers brought wonderful sharing.

https://www.lfasiallc.com/kubecon-cloudnativecon-open-source-summit-china/