Flink on TiDB —— Convenient and reliable real-time data business support

Author introduction: Lin Jia, head of real-time business of NetEase Interactive Entertainment Billing Data Center, real-time development framework JFlink-SDK and real-time business platform JFlink main program, Flink Code Contributor.

This article was shared by Teacher Lin Jia, the head of real-time business of NetEase Mutual Entertainment Billing Data Center. It mainly introduces why NetEase data center chooses Flink and TiDB when processing real-time business, and the combined application of the two.

Today, I will mainly talk to you from the perspective of development why NetEase data center chooses Flink and TiDB when processing real-time services.

First of all, TiDB is a hybrid HTAP distributed database, with one-click horizontal scalability, strong consistency of multiple copies of data security, distributed transactions, real-time OLAP and other important features. It is also compatible with the MySQL protocol and ecology, and is easy to migrate and maintain. The cost is extremely low. Flink is currently the most popular open source computing framework. In terms of processing real-time data, its high throughput, low latency and excellent performance, as well as the guarantee of Exactly Once semantics, provide convenient support for the real-time business processing of NetEase games.

What kind of business value can Flink on TiDB create? This article will share with you a real-time accumulated value story.

from a real-time accumulated value story

Students who have been in contact with online business should be very familiar with the above data. This is a classic online real-time business table. It can also be understood as a log or some monotonously increasing data, including the timestamp, account, and purchase of the fact. Items, quantity purchased, etc. For the analysis of this type of data, assuming that real-time computing frameworks such as Flink are used, it can be processed by bucketing, such as groupby user ID, groupby props, and then bucketing the time, which will eventually produce the following continuous data.

If the above continuous data falls into TiDB, at the same time TiDB still maintains the existing online dimension tables, such as account information, item information, etc. can quickly analyze the facts and statistics by doing a JOIN operation on the table represented by the time series data, and then connected to the visualization application, you can find many different things.

The whole process seems very simple and perfect. Flink solves the calculation problem, and TiDB solves the mass storage problem. But is this really the case?

Students who are actually exposed to online data may encounter similar problems, such as:

Multiple data sources : External system logs of various business parties, and some data are stored in the database, and some need to be called in the way of logs, and also called by the rest interface.
: The data format of each business or channel is completely different, some are JSON, some are Encoded URL.
out of order : The order of data arrival is disturbed.

Based on the above problems, we introduced Flink. Inside the data center, we have encapsulated a framework called JFlink-SDK, which is mainly based on Flink's modularization and configuration of ETL, out-of-order processing, grouping aggregation, and some common requirements, and then through the configuration of online data sources , Calculate some factual statistics or factual data, and finally enter it into TiDB, which can hold massive amounts of data.

However, when Flink is processing this batch of data, in order to recover from the failure, it will save the current calculation state of the data through CheckPoint. If a data calculation commit occurs during the two saves, that is, this part of the calculation result has been flushed out of TiDB, and then a failure occurs, then Flink will automatically return to the position of the previous CheckPoint, that is, return to the previous correct status. At this time, the 4 pieces of data as shown in the figure will be recalculated, and may be updated to TiDB after recalculation.

If the data is an accumulated value, you can see that its accumulated value has been accumulated twice by mistake. This is one of the possible problems when using Flink on TiDB.

Flink's accuracy guarantee

How does Flink provide accuracy guarantees? First, you need to understand Flink's CheckPoint mechanism. CheckPoint is similar to MySQL's transaction save point, which refers to the preservation of temporary status during real-time data processing.

CheckPoint is divided into At least Once and Exactly Once, but even if you choose to use Exactly Once, the problem of repeated calculation of the above accumulated value cannot be solved. For example, after reading data from Kafka, based on the above fact table, the account is 1000, the purchased item is a, and the purchase quantity is 1 and 2 respectively. At this time, the data processed by Flink will be divided into buckets. At the same time, another type of Key will be divided into another bucket by Keyby, which is equivalent to MySQL groupby, for calculation, and then flushed to TiDB Sink through the aggregate function.

Calculation state preservation

Flink uses the CheckPoint mechanism to ensure Exactly Once of the data. Suppose you need to perform a relatively simple execution plan DAG with only one source, and then flush the TiDB sink through MAP. In this process, Flink is linear. It is completed by inserting the CheckPoint barrier mechanism in the data stream, which is equivalent to triggering the operator save point in the linear execution plan wherever the CheckPoint barrier goes.

Assuming to start from the source, the source will be saved. If it is Kafka, you need to save the current consumption location of Kafka. After the node is saved, the state of the next operator needs to be saved. The MAP here assumes that the calculation is bucketed, then it has actually stored the accumulated data in the bucket.

After this, the CheckPoint barrier reaches the sink, and the sink also does the corresponding state storage. When the corresponding state storage is completed, the total Job Manager (equivalent to Master) reports that the CheckPoint of the state storage has been completed.

When the Master confirms that all the subtasks have completed the CheckPoint of the distributed task, a Complete message will be distributed. As shown in the model above, you can think of it as 2PC, a distributed two-phase commit protocol. Each distributed subtask submits its own transaction, and then submits the entire transaction as a whole. The saved state will be stored in RocksDB. When a failure occurs, the data can be recovered from RocksDB, and then the entire process can be recalculated from the breakpoint.

Exactly Once semantic support

Looking back at Exactly Once, can the above methods really achieve Exactly Once? Actually not, but why does Flink officially call this Exactly Once? The reasons for this will be detailed below.

As can be seen from the code in the above figure, Exactly Once CheckPoint cannot guarantee end-to-end, only the Exactly Once of the Flink internal operator. Therefore, when the calculation data is written to TiDB, if TiDB cannot be linked with Flink, the end-to-end Exactly Once cannot be guaranteed.

To compare what is end-to-end, Kafka actually supports this semantics, because Kafka exposes the 2PC interface to the outside, allowing users to manually adjust the interface to control the 2PC process of Kafka transactions. Therefore, you can use the CheckPoint mechanism to avoid miscalculations. .

But what if it cannot be controlled manually?

Let's take a look at the following example. Assuming that the user is still set to 1000 and the data of the purchase item A is written to the accumulation table of TiDB, the following SQL will be generated: INSERT VALUES ON DUPLICATE UPDATE. When CheckPoint occurs, can it be guaranteed that the statement will be executed to TiDB?

If you simply execute this SQL without special processing, there is no guarantee that this SQL will be executed. If it is not executed, an error will be reported and you will return to the previous CheckPoint. All are happy. Because it does not actually calculate, does not accumulate, and does not repeat calculations, it is correct. But if it has been written out, and then return to the previous CheckPoint repeatedly, there will be repeated accumulation of 3.

In order to solve this problem, Flink provides an interface that can manually implement SinkFunction to control the start of the transaction, Pre Commit, Commit, and Rollback.

The CheckPoint mechanism is essentially a 2PC. When the distributed operator is executing internal transactions, the operator is actually associated with Pre Commit. Similarly, assuming that in Kafka, Kafka transactions can be pre-committed through Pre Commit transactions. When the operator receives that the state saving of all the operator CheckPoint synchronized by the Job Manager (ie Master) has been completed, at this time Commit, the transaction must be successful.

If other operators fail, Rollback is required to ensure that the transaction is not successfully submitted to the remote end. If there is 2PC SinkFunction plus XA full section semantics, it can actually achieve Exactly Once in the strict sense.

But not all sinks support the two-phase commit protocol. For example, TiDB uses two-phase commit internally to manage and coordinate its transactions, but at present, the two-phase commit protocol is not provided to users for manual control.

idempotent calculation

So, how to ensure that the Exactly Once results of the business fall into TiDB? In fact, it is very simple, using At Least Once semantics plus a Unique Key, that is, idempotent calculation.

How to choose Unique Key? If a piece of data has a unique mark, we will naturally choose its unique mark. For example, a piece of data has a unique ID. When a table is synchronized to another table through Flink, this is the classic use of its Primary key to do insert ignore or replace into semantic deduplication. If it is a log, you can select the unique attributes of the log file. If you use Flink to calculate the aggregation result, you can use the aggregated Key plus the window boundary value, or other idempotent methods to calculate the value as the only key for the final calculation.

In this way, it can be achieved that the result is reentrant. Since it is re-entrant, coupled with CheckPoint's rollback feature, Flink can be combined with TiDB to write accurate Exactly Once results.

Flink on TiDB

In the Flink on TiDB part, our internal JFlink framework encapsulates Flink, and then what did we do in linkage with TiDB? This will be detailed below.

data connector design

First, the design of the data connector. Because Flink's support for TiDB or relational databases is relatively slow, Flink Conector JDBC only appeared in Flink version 1.11, and the time is not too long.

At present, we use TiDB as the data source and place the data in Flink for processing, mainly through the official CDC tool provided by TiDB, which is equivalent to monitoring the changes of TiDB and sending the data to Kafka. Kafka is a very classic streaming data pipeline, so the data is consumed and processed through Kafka, and then processed through Flink.

But not all businesses can use the CDC mode, such as adding some more complicated filter conditions when dropping data, or reading some configuration tables regularly when dropping data, or first needing to understand some external configuration items to know the segmentation. In this case, you may need to manually customize the source.

When JFlink encapsulates, it actually encapsulates a monotonic table of business fields for slice reading. Monotonous means that a certain table must have a certain field, monotonously changing, or append only.

In terms of implementation, between TiDB and Flink, JFlink TiDB Connect is encapsulated, and a link is used to create a link with TiDB. Then use the asynchronous thread to catch the data, and then block through the blocking queue. The role of blocking queues is mainly for flow control.

For the main thread of Flink, it mainly monitors the non-empty signal on the blocking queue. When a non-empty signal is received, the data is pulled out and used as the flow object of the entire real-time processing framework through the deserializer, and then various modularized UDFs can be docked. When implementing the source's At Least Once semantics, it becomes very simple with the help of Flink's CheckPoint mechanism.

Because we already have a major premise, that is, this table is a monotonic table composed of a certain field, when data segmentation is performed on the monotonic table, the current segmentation position can be noted. If a failure occurs, let the entire stream fall back to the previous CheckPoint, and the source will also fall back to the last saved slice position. At this time, data consumption can be guaranteed without leakage, that is, the At Least Once of the source is realized.

For sink, in fact, Flink officially provides a JDBC sink. Of course, the source also provides a JDBC sink. However, the implementation of the JDBC sink officially provided by Flink is relatively simple and uses the semantics of synchronous batch insertion.

In fact, synchronous batch insertion is relatively conservative. When the amount of data is relatively large, and there is no strict first-come, first-commit semantics, the performance of synchronous submission is relatively not very high at this time. If asynchronous submission is used, the performance will be improved. Many, equivalent to making full use of the features of the TiDB distributed database, supporting high concurrency of small transactions, and helping to improve QPS.

When we implement sink, the principle is actually very simple. Let's first talk about how Flink is officially implemented. Flink officially writes the main thread of Flink to a buffer, changes pages when the buffer is full, and pulls up a thread to synchronize data to TiDB.

Our improvement is to perform flow control through a blocking queue, and then write data to a buffer page. When the buffer page is full, immediately pull up an asynchronous thread to flush out, so that it can be guaranteed under non-FIFO semantics Improve the performance of QPS. Practice has proved that in this way, we can increase the official QPS from about 30,000 to nearly 100,000.

However, it is relatively complicated to implement the At Least Once semantics of sink. Recalling the CheckPoint mechanism, if we want to achieve the At Least Once of the sink, we must ensure that the sink is clean when the CheckPoint is completed, that is, all data is flushed out, so as to ensure its At Least Once. In this case, it may be necessary to add the CheckPoint thread, the main thread that is normally flushed, and other page changing threads. When the CheckPoint is triggered, the CheckPoint is completed after all the data is guaranteed to be flushed synchronously. In this way, once the CheckPoint is completed, the sink must be clean, which also means that all the data that has flowed before is correctly updated to TiDB.

After we optimized, we achieved an OPS of about 100k. In our test environment, there are probably three physical machines mixed with nodes such as PD, TiKV, and TiDB.

Business scenario

Our current technology center billing data center uses the combination of TiDB and Flink in many application scenarios. Such as:

Real-time formatting and storage of massive business log data;
Analysis and statistics based on massive data;
Real-time TiDB / Kafka dual-stream connection payment link analysis;
Connect data map;
Time series data.

Therefore, it can be seen that the application of Flink on TiDB in the business layer of Netease data center is blooming everywhere. Here is a quote, "Peaches and plums are not spoken, but the next is self-contained." Since it can be used so widely, it proves this way. In fact, it is very valuable.

Flink on TiDB —— Convenient and reliable real-time data business support

from a real-time accumulated value story

Flink's accuracy guarantee

Flink's accuracy guarantee

Calculation state preservation

Exactly Once semantic support

idempotent calculation

Flink on TiDB

data connector design

Business scenario

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式