TiFlink: Using TiKV and Flink to achieve a strongly consistent materialized view丨TiDB Hackathon project sharing

Editor's note:

This article is an introduction to the latest progress of the TiFlink project in the TiDB Hackathon 2020 competition. TiKV and Flink are used to implement the function of a strongly consistent materialized view.

Author Zhang Qizi, a fan of algorithms, distributed technology and functional programming. Personal blog: https://io-meter.com/

At the TiDB Hackathon earlier this year, a group of teammates and I tried to use Flink to add materialized views to TiDB and won the "Best Popularity Award". It can be said that the materialized view is a hot spot in this competition. There are only three or four teams that combine Flink to implement related functions. It must be admitted that at the end of the competition, the degree of completion of our project was very low. Although the basic ideas have been finalized, the final results are far from expected. After more than half a year of intermittent patching, today we can finally release a preview version for everyone to try. This article is an introduction to our ideas and results.
Compared with other teams, our main goal is to achieve a strong consistent materialized view construction. That is to ensure that the materialized view during query can reach an isolation level close to Snapshot Isolation, rather than the eventual consistency of the general stream processing system. The discussion on achieving consistency is described in detail below.

Introduction

Although it is an experimental project, we still explored some convenient and practical features, including:

Zero external dependencies: In addition to the TiDB cluster and Flink deployment environment, there is no need to maintain any other components (including Kafka cluster and TiCDC). This is because TiFlink reads and writes data directly from TiKV without passing through any intermediate layers, creating the possibility for higher throughput, lower latency, and easier maintenance.

Easy-to-use interface: Although TiFlink introduces some new concepts in order to achieve strong consistency, through the specially written TiFlinkApp interface, users can quickly start a task without manually creating and writing the target table.

Batch stream combination: After the task is started, the current data in the source table will be consumed in batches, and then automatically switched to CDC log consumption. This process will also ensure the consistency of the view.

For detailed information about TiFlink's usefulness, please refer to README. Below is a code snippet to quickly start a task:

TiFlinkApp.newBuilder()   .setJdbcUrl("jdbc:mysql://root@localhost:4000/test") // Please make sure the user has correct permission   .setQuery(       "select id, "           + "first_name, "           + "last_name, "           + "email, "           + "(select count(*) from posts where author_id = authors.id) as posts "           + "from authors")   // .setColumnNames("a", "b", "c", "d") // Override column names inferred from the query   // .setPrimaryKeys("a") // Specify the primary key columns, defaults to the first column   // .setDefaultDatabase("test") // Default TiDB database to use, defaults to that specified by JDBC URL   .setTargetTable("author_posts") // TiFlink will automatically create the table if not exist   // .setTargetTable("test", "author_posts") // It is possible to sepecify the full table path   .setParallelism(3) // Parallelism of the Flink Job   .setCheckpointInterval(1000) // Checkpoint interval in milliseconds. This interval determines data refresh rate   .setDropOldTable(true) // If TiFlink should drop old target table on start   .setForceNewTable(true) // If to throw an error if the target table already exists   .build()   .start(); // Start the app

Consistency of the materialized view (stream processing system)

The current mainstream materialized view (stream processing) system mainly uses eventual consistency. That is to say, although the final result will converge to a consistent state, the end user may still query some inconsistent results during the processing. Eventual consistency has proven to be sufficient in many applications, so is stronger consistency really needed? What does the consistency here have to do with Flink's Exact Once semantics? Some introduction is necessary.

ACID

ACID is a basic concept of the database. Generally speaking, the database that is the source of CDC logs has guaranteed these four requirements. But when using CDC data for streaming, some of the constraints may be broken.
The most typical situation is the loss of Atomic features. This is because in the CDC log, the modification of a transaction may cover multiple records, and the stream processing system may destroy the atomicity if it is processed in a unit of behavior. In other words, the transaction seen by the user who performs the query on the result set is incomplete.
A typical case is as follows:

Change Log and the atomicity of transactions

In the above case, we have an account table, and there will be transfer operations between the account tables. Since the transfer operation involves multiple line modifications, multiple records are often generated. Suppose we have the following materialized view defined by SQL to calculate the sum of all account balances:

SELECT SUM(balance) FROM ACCOUNTS;

Obviously, if we only have transfers between accounts in the table, the result returned by this query should always be a constant. However, because the current general stream processing system cannot handle the atomicity of the transaction, the result of this query may be constantly fluctuating. In fact, on a source table that is continuously modified concurrently, its fluctuations may even be unbounded.

Although under the eventually consistent model, the results of the above query will converge to the correct value after a period of time, the materialized view without atomicity guarantees still restricts the application scenario: suppose I want to implement a problem when the above query results are too biased I may receive a lot of false alarms if I use the tools to make alarms at any time. That is to say, there is no abnormality on the database side at this time, and the deviation of the value only comes from the inside of the stream processing system.
In a distributed system, there is another situation that destroys atomicity, when the side effects of a transaction modification are distributed across multiple different nodes. If you do not use 2PC and other methods for distributed submission at this time, the atomicity will also be destroyed: the changes on some nodes (partitions) take effect before other nodes, resulting in inconsistencies.

Linear consistency

Different from the CDC log generated by a stand-alone database (such as MySQL's Binlog), the log generated by a distributed database such as TiDB has linear consistency problems. In our scenario, the problem of linear consistency can be described as: Some operations performed sequentially from the user's point of view, and the side effects (logs) produced by them are processed by the stream processing system in a different order due to the delay in the delivery of the message system.
Suppose we have two tables: ORDERS and PAYMENTS. The user must create an order before making a payment. Therefore, the result of the following query must be a positive number:

WITH order_amount AS (SELECT SUM(amount) AS total FROM ORDERS),WITH payment_amount AS (SELECT SUM(amount) AS total FROM PAYMENTS)SELECT order_amount.total - payment_amount.totalFROM order_amount, payment_amount;

However, because the ORDERS table and the PAYMENTS table are stored on different nodes, the rate at which the stream processing system consumes them may be inconsistent. In other words, the stream processing system may have seen the payment information record, but the corresponding order information has not arrived yet. Therefore, it is possible to observe the negative results of the above query.

In a stream processing system, there is a watermark concept that can be used to synchronize the processing progress of data from different tables, but it cannot avoid the linear consistency problem mentioned above. This is because Watermark only requires that all records with a timestamp less than it have arrived, and does not require that all records with a timestamp greater than it have not arrived. In other words, even though the ORDERS table and PAYMENTS appear to have the same Watermark, the latter may still have some first-arrival records already in effect.
It can be seen that watermark alone cannot handle the linear consistency problem. It must cooperate with the time generation system and message system of the source database.

Need for stronger consistency

Although eventual consistency is sufficient in many scenarios, it still has many problems:

Misleading users: Because many users do not understand the knowledge related to consistency, or have a certain misunderstanding about it, they make decisions based on the query results that have not yet converged. This situation should be avoided when most relational databases default to strong consistency.
Poor observability: Since the final consistency does not guarantee the convergence time, and considering the existence of linear consistency problems, it is difficult to define the delay, data freshness, throughput and other indicators of the stream processing system. For example, the result of the JOIN seen by the user may be the result of the join between the current snapshot of table A and the snapshot of table B ten minutes ago. At this time, how should the latency of the query result be defined?
Restricted the realization of some requirements: As mentioned above, due to inconsistent internal states, some alarm requirements either cannot be realized or need to be delayed for a period of time. Otherwise, users have to accept a higher false alarm rate.

In fact, the lack of stronger consistency has also led to some operation and maintenance operations, especially DDL operations that are difficult to use the previously calculated results. With reference to the development history of relational databases and NoSQL databases, we believe that the current mainstream final consistency is only limited by the expediency of technological development. With the progress of related theories and technical research, stronger consistency will gradually become The mainstream of stream processing systems.

Brief introduction of technical solution

Here is a detailed introduction to TiFlink's technical solution considerations and how to implement a strong consistent materialized view (StreamSQL) maintenance.

TiKV and Flink

Although this is a TiDB Hackthon project, TiDB/TiKV related components will inevitably be selected, but in my opinion, TiKV as an intermediate storage solution for the materialized view system has many outstanding advantages:

TiKV is a relatively mature distributed KV storage, and the distributed environment is a scenario that the next-generation materialized view system must support. Using the Java Client provided by TiKV, we can conveniently operate it. At the same time, TiDB itself, as an HTAP system, just provides a playground for the materialized view requirement.
TiKV provides transaction support and MVCC based on the Percolator model, which is the basis for TiFlink to achieve strong consistent stream processing. As you can see below, TiFlink writes to TiKV mainly in the form of continuous transactions.
TiKV natively provides support for CDC log output. In fact, the TiCDC component uses this feature to achieve the CDC log export function. In TiFlink, in order to achieve batch flow integration and simplify the system process, we chose to directly call TiKV's CDC GRPC interface, so we also gave up some of the features provided by TiCDC.

Our initial idea was to directly integrate the calculation function into TiKV, but choosing Flink was the conclusion obtained after further thinking during the competition. The main advantages of choosing Flink are:

Flink is currently the most mature Stateful stream processing system on the market. It has a strong ability to express processing tasks and supports rich semantics. In particular, it supports the implementation of StreamSQL that supports batch and stream integration. We can concentrate on exploring the functions we are more concerned about, such as Strong consistency, etc.
Flink is a relatively complete Watermark, and we found that its Exactly Once Delivery semantics based on Checkpoint can be easily combined with TiKV to achieve transaction processing. In fact, some sinks provided by Flink that support Two Phase Commit are submitted in conjunction with Checkpoint.
Flink's stream processing (especially StreamSQL) itself is based on the theory of materialized views. The DynamicTable interface provided in the newer version is to facilitate the introduction of external Change Log into the system. It has provided support for multiple CDC operations such as INSERT, DELETE, and UPDATE.

Of course, choosing a heterogeneous architecture like TiKV+Flink will also introduce some problems, such as the mismatch of SQL syntax and the inability of UDF to share. In TiFlink, we take Flink's SQL system and UDF as the standard and use it as a plug-in system of TiKV, but at the same time it provides a convenient table building function.
The realization of strong consistent materialized views

This part will introduce how TiFlink achieves a relatively strong consistency level based on TiDB/TiKV: Stale Snapshot Isolation. Under this isolation level, the queryer always finds a consistent snapshot state in history. In traditional snapshot isolation, the queryer is required to observe all transactions whose Commit time is less than T at time T. Delayed snapshot isolation can only ensure that all committed transactions before T−Δt are observed.
The simplest way to implement a strongly consistent materialized view on a distributed database that supports transactions like TiDB is to update the view using one transaction after another. What the transaction reads at the beginning is a consistent snapshot, and the use of distributed transactions to update the materialized view is itself a strongly consistent operation, and has the characteristics of ACID, so consistency can be ensured.

Use continuous transactions to update materialized views

In order to combine Flink with such a mechanism and achieve incremental maintenance, we take advantage of some of the features already provided by TiKV itself:

TiKV uses Time Oracle to assign timestamps to all operations. Therefore, although it is a distributed system, the timestamps of the transactions in the CDC log generated by it are actually ordered.
The node (Region) of TiKV can generate continuous incremental log (Change Log), which contains various original information of the transaction and contains time stamp information.
TiKV's incremental log will periodically generate Resolved Timestamp, stating that the current Region will no longer generate messages with older timestamps. So it is very suitable for making Watermark.
TiKV provides distributed transactions, allowing us to control the visibility of a batch of changes.

Therefore, the basic realization idea of TiFlink is:

Using the feature of stream batch integration, the source table is snapshot read with a certain global timestamp, and a consistent view of all source tables can be obtained at this time.
Switch to incremental log consumption and use Flink's DynamicTable related interfaces to achieve incremental maintenance and output of materialized views.
Commit modifications at a certain rhythm, so that all modifications are written to the target table in an atomic transaction, thereby providing one update view after another for the materialized view.

The key to the above points is to coordinate various nodes to complete distributed transactions together, so it is necessary to introduce TiKV's distributed transaction execution principle.

TiKV's distributed transaction

TiKV's distributed transaction is based on the well-known Percolator model. The Percolator model itself requires the KV Store of the storage layer to have MVCC support, atomicity and optimistic locking (OCC) for single-row read and write. On this basis, it uses the following steps to complete a transaction:

Specify a transaction primary key (Primary Key) and a start timestamp and write the primary key.
Other rows are written in the form of a secondary key (Secondary Key) during Prewrite. The secondary key will point to the primary key and have the above start timestamp.
After all node Prewrite is completed, the transaction can be committed. At this time, the primary key of Commit should be Commit first, and a Commit timestamp should be given.
After the primary key Commit is successful, the transaction has actually been committed successfully, but at this time, for the convenience of reading, multiple nodes can concurrently Commit and perform cleanup work on the secondary key, and the rows written afterwards will become visible.

The above distributed transaction is feasible because the Commit to the primary key is atomic, and whether the secondary keys distributed on different nodes are submitted successfully depends on the primary key. Therefore, other readers read the Prewrite but have not yet Commit the row. , Will check whether the primary key has been Commit. Readers will also judge whether a row of data is visible based on the Commit timestamp. If the cleanup operation fails in the middle, subsequent readers can also do it.
In order to achieve snapshot isolation, Percolator requires writers to check concurrent Prewrite records when writing to ensure that their timestamps meet certain requirements before they can commit transactions. Essentially, transactions that require overlapping write sets cannot be committed at the same time. In our scenario, it is assumed that the materialized view has only one writer and the transaction is continuous, so there is no need to worry about this.
After understanding TiKV's distributed transaction principle, what we need to consider is how to combine it with Flink. In TiFlink, we use the Checkpoint mechanism to achieve globally consistent transaction submission.

Use Flink for distributed transaction commit

As can be seen from the above introduction, TiKV's distributed transaction commit can be abstracted as a 2PC. Flink itself provides a sink that implements 2PC, but it cannot be used directly in our scenario. The reason is that the Percolator model needs to have a globally consistent transaction start timestamp and commit timestamp when submitting. And just implementing 2PC on the sink side is not enough to achieve a strong consistent isolation level: we also need to cooperate on the source side to make each transaction just read the required incremental log.
Fortunately, Flink's 2PC submission mechanism is actually driven by Checkpoint: When Sink receives a Checkpoint request, it will complete the necessary tasks to submit it. Inspired by this, we can implement a pair of Source and Sink, let them use Checkpoint ID to share transaction information, and cooperate with Checkpoint process to complete 2PC. In order for different nodes to agree on transaction information (time stamp, primary key), etc., a global coordinator needs to be introduced. The interface between the transaction and the global coordinator is defined as follows:

public interface Transaction {  public enum Status {    NEW,    PREWRITE,    COMMITTED,    ABORTED;  };  long getCheckpointId();  long getStartTs();  default long getCommitTs();  default byte[] getPrimaryKey();  default Status getStatus();}public interface Coordinator extends AutoCloseable, Serializable {  Transaction openTransaction(long checkpointId);  Transaction prewriteTransaction(long checkpointId, long tableId);  Transaction commitTransaction(long checkpointId);  Transaction abortTransaction(long checkpointId);}

Using the above interface, each Source and Sink node can use CheckpointID to start a transaction or obtain a transaction ID, and the coordinator will be responsible for assigning primary keys and maintaining the status of the transaction. For convenience, the commit operation of the primary key during transaction Commit is also executed in the coordinator. There are many ways to implement the coordinator. At present, TiFlink uses the simplest implementation: start a GRPC service in the process where the JobManager is located. It is also possible to implement a distributed coordinator based on TiKV's PD (ETCD) or TiKV itself.

Coordinated execution of transaction and Checkpoint

The figure above shows the coordination relationship between executing distributed transactions in Flink and Checkpoint. The specific process of a transaction is as follows:

Source first receives incremental logs from TiKV, caches them according to the timestamp, and waits for the start of the transaction.
When the Checkpoint process starts, Source will receive the signal first. The Checkpoint on the Source side and the log receiving service run in different threads.
The Checkpoint thread first obtains the current transaction information (or starts a new transaction) through the global coordinator. In a distributed case, the transaction corresponding to a CheckpointID will only be started once.
After obtaining the start timestamp of the transaction, the Source node starts to send the submitted modification Emit in the Cache that is less than this timestamp to the downstream computing node for consumption. At this time, the Source node will also emit some watermarks.
When all the Source nodes complete the above operations, the Checkpoint is successfully completed on the Source node and will continue to propagate thereafter. According to Flink's mechanism, the Checkpoint will ensure that all events before its arrival have been consumed at each node.
When the Checkpoint arrives at the Sink, all the events that were previously propagated to the Sink have been Prewrite, and the transaction submission process can begin at this time. Sink persists transaction information in its internal state to facilitate recovery in the event of an error. After all Sink nodes complete this operation, they will call the Coordinator's Commit method in the callback to commit the transaction.
After committing the transaction, Sink will start the thread to clean up the Secondary Key and start a new transaction at the same time.

Note that before the start of the first Checkpoint, the Sink may have begun to receive the written data, and at this time it has no transaction information. In order to solve this problem, TiFlink directly starts an initial transaction at the beginning of the task, and its corresponding CheckpointID is 0, which is used to submit some initial writes. In this case, when the Checkpoint with CheckpointID=1 is completed, this 0 transaction is actually submitted. Transaction and Checkpoint coordinate execution in such a misaligned manner.
The following figure shows the structure of the entire TiFlink task including the coordinator:

TiFlink system architecture

Based on the above system design, we have obtained a materialized view that implements delayed snapshot isolation on TiKV.

Other design considerations

As we all know, KSQL is another popular stream processing system besides Flink. It is directly integrated with the Kafka message queue system. Users do not need to deploy two sets of processing systems, so it is favored by some users. Many users also use KSQL to achieve requirements such as materialized views. However, in my opinion, this kind of stream processing system that is strongly coupled to the message queue is not suitable for the use of materialized views.
KSQL can be said to be the representative of the Log Oriented data processing system. In this system, the source of data lies in log information, and all tables are views constructed by consuming log information for convenient query. This kind of system has the advantages of simple model, easy implementation, long-term preservation of log records and so on.
In contrast, Table Oriented data processing system, MySQL, TiDB/TiKV all belong to this type of system. All modification operations of this type of system act on the table data structure. Although a log will be generated during the period, the modification of the table data structure and the log are often coordinated together. The log here is mainly for persistence and transaction services, and it is often not retained for too long. Compared with the Log Oriented data processing system, this type of system is a bit more complicated for writing and transaction processing, but it has stronger scalability requirements.
In the final analysis, this is because the data in the Log Oriented system is stored in the form of logs. Therefore, it is often necessary to perform high-cost Rehash when expanding, and it is more difficult to achieve rebalancing. In the Table Oriented system, data is mainly stored in the form of tables, so it can be arranged in an orderly manner in certain columns, which facilitates the segmentation, merging and rebalancing of Ranges with the support of consistent Hash.
I personally think that in the materialized view scenario where batch and stream are integrated, it doesn't make much sense to save logs for a long time (because it is always possible to restore data from a snapshot of the source table). On the contrary, it is more important to continuously expand data processing tasks and views as the business develops. From this perspective, the Table Oriented system seems to be more suitable as a storage medium for materialized view requirements.
Of course, the partition merge or split that occurs when the incremental Log is consumed in real time is a more difficult problem to deal with. TiKV will throw a GRPC error in this case. TiFlink currently uses a relatively simple static mapping method to handle the relationship between tasks and partitions, and more reasonable solutions can be considered in the future.

Summarize

This article introduces the basic principles of using Flink to implement a strongly consistent materialized view on TiKV. The above principles have been basically implemented in the TiFlink system, and readers are welcome to try it out. All of the above discussions are based on the guarantee of Flink's final consistent model, that is: the result of stream calculation is only related to the consumed Event and their order in their own stream, and has nothing to do with the order in which they arrive in the system and the relative order between different streams.

The current TiFlink system has many points worthy of improvement, such as:

Support non-Integer primary keys and joint primary keys
Better mapping from TiKV Region to Flink tasks
Better Fault Tolerance and cleanup of TiKV transactions when tasks are interrupted
Complete unit testing

If you are interested in TiFlink, please try it out and provide feedback. It would be great if you can contribute code to help improve this system.

Thinking about the consistency of the materialized view system is one of my main gains this year. In fact, we did not pay attention to this aspect at first, but realized that this is a valuable and very challenging problem through constant communication. Through the implementation of TiFlink, it can be said that the feasibility of the above method to achieve the consistency of delayed snapshots is basically verified. Of course, due to the limited ability of individuals, if there are any mistakes, you are welcome to discuss.

Finally, if we assume that the above statement of delayed snapshot consistency is correct, then the method to achieve true snapshot isolation is ready to emerge. I wonder if readers can think of it?

TiFlink: Using TiKV and Flink to achieve a strongly consistent materialized view丨TiDB Hackathon project sharing

Introduction

Consistency of the materialized view (stream processing system)

ACID

Linear consistency

Need for stronger consistency

Brief introduction of technical solution

TiKV and Flink

TiKV's distributed transaction

Use Flink for distributed transaction commit

Other design considerations

Summarize

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式