Percolator model and its implementation in TiKV

1. Background

Percolator is a distributed transaction solution proposed by Google in the paper " Large-scale Incremental Processing Using Distributed Transactions and Notifications In the paper, this scheme is used to solve the incremental indexing problem of search engines.

Percolator supports ACID semantics and implements the transaction isolation level of Snapshot Isolation, so it can be regarded as a general distributed transaction solution. Percolator is implemented based on Google's own Bigtable, which is essentially a two-phase commit protocol that uses Bigtable's row transactions.

2. Architecture

Percolator consists of three components:

Client : Client is the control center of the entire protocol and the coordinator of the two-phase submission (Coordinator);
TSO : a global timing service that provides a globally unique and incremental timestamp (timetamp);
Bigtable : Distributed storage of actual persistent data;

2.1. Client

There are two roles in the two-phase commit algorithm, the coordinator and the participant. In Percolator, Client acts as a coordinator, responsible for initiating and committing transactions.

2.2. Timestamp Oracle (TSO)

Percolator relies on TSO to provide a globally unique and incremental timestamp to achieve Snapshot Isolation. At the beginning of the transaction and when it is submitted, the Client needs to get a timestamp from the TSO.

2.3 Bigtable

Bigtable can be understood as a multi-demensional ordered Map from the data model, with the following key-value pairs:

(row:string, column:string,timestamp:int64)->string

The key is composed of triples (row, column, timestamp), and the value can be regarded as a byte array.

In Bigtable, a row (row) can contain multiple (columns). Bigtable provides a single row of transactions across multiple columns. Percolator uses this feature to ensure that operations on multiple columns of the same row are atomic. Percolator's metadata is stored in a special column, as follows:

(Picture from: https://research.google )

We mainly need to pay attention to three columns, c:lock, c:write, c:data:

c:lock , during the transaction Prewrite, a record will be inserted into this column
c:write , when the transaction commits, a record will be inserted into this column
c:data , store the data itself

2.4 Snapshot Isolation

All read operations in the transaction will read a consistent snapshot data, which is equivalent to the Repeated Read isolation level;
When two concurrent transactions write to the same cell at the same time, only one transaction can commit successfully;
When a transaction is committed, if some data updated by the transaction is found to be modified by other transactions larger than its start time, then roll back the transaction, otherwise commit the transaction;
There is a write skew problem. The data sets read and written by the two transactions overlap, but the written data sets do not overlap. In this case, both transactions can successfully commit, but neither of them sees the new data written by the other party. This does not reach the isolation level of serializable. However, snapshot isolation has better read performance than serializable, because the read operation only needs to read the snapshot data without locking.

Three, transaction processing

3.1 Write logic

Percolator uses a two-phase commit algorithm (2PC) to commit transactions. The two phases are Prewrite and Commit.

in the Prewrite phase:

1) Obtain a timestamp from TSO and use it as the start_ts of the transaction;

2) For each row of data that needs to be written in the transaction, the start\_ts of the transaction will be written in the lock column, and new data will be written in the data column with start\_ts, such as 14: "value2" above . One of these locks will be selected as the primary lock, and the other locks are called secondary locks . Each secondary lock contains a pointer to the primary lock.

1. If there is already a new version of data larger than start_ts in the data to be written, then the current transaction needs to rollback;
2. If there is already a lock in the row data that needs to be inserted into the lock, then the current transaction needs to rollback.

in the Commit phase:

1) Obtain a timestamp from TSO and use it as the commit_ts of the transaction;

2) primary lock and write commit_ts in the write column. These two operations need to be atomic. If primary lock does not exist, then commit fails;

3) Repeat the above steps secondary locks

Let's look at a specific example, or a classic bank account transfer example, transfer 7 dollars from account Bob to account Joe:

1. Before the business starts, the two accounts Bob and Joe have 10 dollars and 2 dollars respectively.

(Picture from: https://research.google )

2. In the Prewrite phase, write a lock (7: I am primary) to Bob's lock column, this lock is the primary lock, and write data 7:$3 in the data column.

(Picture from: https://research.google )

3. In the Prewrite phase, continue to write secondary locks. Write lock (7:primary@Bob.bal) to Joe's lock column. This lock points to the previously written primary lock, and at the same time writes 7:$9 in the data column.

(Picture from: https://research.google )

4. In the commit phase, first clear the primary lock, and use the new timestamp (that is, commit_ts) in the write column to write a new record, and at the same time clear the data in the lock column.

(Image from: https://research.google )

5. In the commit phase, remove the secondary locks and write a new record with a new timestamp in the write column.

(Picture from: https://research.google )

3.2 Read logic

1) Get a timestamp ts.

2) Check whether there is a lock with a timestamp in the range of [0, ts] in the data we want to read.

If there is a lock with a timestamp in the range of [0, ts], it means that the current data is locked by a transaction that started earlier than the current transaction, but the current transaction has not yet been committed. Because it is currently impossible to determine whether the transaction that locks the data will be committed, the current read request cannot be satisfied, and you can only wait for the lock to be released before continuing to read the data.
If there is no lock, or the timestamp of the lock is greater than ts, then the read request can be satisfied.

3) Get the record of the largest commit\_ts in the range of [0, ts] from the write column, and then get the corresponding start\_ts accordingly.

4) According to the start_ts obtained in the previous step, obtain the corresponding record from the data column.

3.3 Handling Client Crash Scenarios

Percolator's transaction coordinator is on the Client side, and the Client may crash. If the Client encounters an exception during the commit process, the lock written before the transaction will be retained. If these locks are not cleared in time, subsequent transactions will be blocked indefinitely on the lock.

Percolator uses a lazy way to clean up locks. When transaction A encounters a lock left by transaction B, if transaction A determines that transaction B has failed, it will clean up the lock left by transaction B. However, it is difficult for transaction A to be 100% sure to determine that transaction B really failed, which may cause transaction A to clear the locks left by transaction B, but transaction B has not failed, and transaction commit is in progress.

In order to avoid this exception, the Percolator transaction model selects one of the locks written by each transaction as the primary lock as the synchronization point for the cleanup operation and transaction commit. The state of the primary lock is modified during the cleanup operation and transaction commit, because the operation of modifying the lock is performed under the row transaction of Bigtable. Only one of all cleanup operations and transaction commit will succeed, which avoids the aforementioned concurrency Anomalies that may occur in the scene.

According to the status of the primary lock, you can determine whether the transaction has been successfully committed:

If the Primary Lock does not exist and commit_ts has been written in the write column, it means that the transaction has been successfully committed;
If the Primary Lock still exists, it means that the transaction has not yet entered the commit phase, that is, the transaction has not successfully committed.

When transaction A encounters the lock record left by transaction B during the commit process, it needs to operate according to the state of the primary lock of transaction B.

If the Primary Lock of transaction B does not exist, and there is commit_ts in the write column, then the transaction

A needs to roll-forward the lock record of transaction B. The roll-forward operation is the reverse operation of the rollback operation, that is, the lock record is cleared and commit_ts is written in the write column.
If the Primary Lock of transaction B exists, then transaction A can determine that transaction B has not successfully committed. At this time, transaction A can choose to clear the lock record left by transaction B. Before clearing it, the Primary Lock of transaction B needs to be cleaned up. Lose.
If the Primary Lock of transaction B does not exist, and there is no commit_ts information in the write column, then transaction B has been rolled back. At this time, you only need to clear the lock left by transaction B.

Although the above operation logic will not be inconsistent, because transaction A may clean up the Primary Lock of surviving transaction B, resulting in transaction B being rolled back, which will affect the overall performance of the system.

In order to solve this problem, Percolator uses Chubby lockservice to store the survival status of each Client that is committing a transaction, so that it can be determined whether the Client has really died. Only after the Client really hangs, the conflicting transaction will really clear the Primary Lock and conflicting lock records. However, it is possible that the Client survives, but in fact it has been stuck and has not committed the transaction. At this time, if the lock records left by it are not cleaned up, other conflicting transactions will not be successfully submitted.

In order to deal with this scenario, a wall time is also stored in each survival state. If it is judged that the wall time is too old, the conflict lock record is processed. Long transactions need to update the wall time at regular intervals to ensure that their transactions will not be rolled back.

The final transaction conflict logic is as follows:

If the Primary Lock of transaction B does not exist and there is commit\_ts in the write column, then transaction A needs to record the lock of transaction B roll-forward. The roll-forward operation is the reverse operation of the rollback operation, that is, the lock record is cleared and commit\_ts is written in the write column.
If the Primary Lock of transaction B does not exist, and there is no commit_ts information in the write column, then transaction B has been rolled back. At this time, you only need to clear the lock left by transaction B.
If the Primary Lock of transaction B exists and the TTL has expired, then transaction A can choose to clear the lock record left by transaction B at this time. Before clearing it, the Primary Lock of transaction B needs to be cleared.
If the Primary Lock of transaction B exists and the TTL has not expired, then transaction A needs to wait for the commit or rollback of transaction B to continue processing.

Fourth, implementation and optimization in TiKV

4.1 Implementation of Percolator in TiKV

The underlying storage engine of TiKV uses RocksDB. RocksDB provides atomic write batch, which can fulfill Percolator's requirements for row transactions.

RocksDB provides a function called Column Family (CF) . A RocksDB instance can have multiple CFs. Each CF is an isolated key command space and has its own LSM-tree. However, multiple CFs in the same RocksDB instance share the same WAL, which ensures that writing multiple CFs is atomic .

In TiKV, there are three CFs in a RocksDB instance: CF_DEFAULT, CF_LOCK, and CF_WRITE, which correspond to the data column, lock column and write column of Percolator respectively.

We also need to store multiple versions of data for each key. How to represent the version information? In TiKV, we simply combine the key and timestamp into an internal key to store in RocksDB. The following is the content of each CF:

F_DEFAULT: (key,start_ts) -> value
CF_LOCK: key -> lock_info
CF_WRITE: (key,commit\_ts) -> write\_info

The method of combining key and timestamp is as follows:

Encode the user key as memcomparable ;
Invert timestamp bit by bit, and then encode it into big-endian form;
Add the encoded timestamp to the encoded key.

For example, key key1 and timestamp 3 will be encoded as "key1\\x00\\x00\\x00\\x00\\xfb\\xff\\xff\\xff\\xff\\xff\\xff\\xff \\xfe". In this way, different versions of the same key are adjacent in rocksdb, and the data with a larger version is in front of the data of the old version.

The implementation of Percolator in TiKV is slightly different from that in the paper. In TiKV, there are 4 different types of data in CF_WRITE:

Put , there is a corresponding data in CF_DEFAULT, the write operation is a Put operation;
Delete , indicating that the write operation is a Delete operation;
Rollback , when rolling back a transaction, we do not simply delete the record in CF\_LOCK, but insert a Rollback record in CF\_WRITE.
Lock

4.2 Optimization of Percolator in TiKV

4.2.1 Parallel Prewrite

For a transaction, we do not do Prewrite in the form of one by one. When we have multiple TiKV nodes, we will execute Prewrite on multiple nodes in parallel.

In the implementation of TiKV, when a transaction is submitted, the Keys involved in the transaction will be divided into multiple batches, and each batch will be executed in parallel in the Prewrite phase. does not need to pay attention to whether the first Prewrite of the primary key succeeds .

If a conflict occurs during the Prewrite phase of the transaction, the transaction will be rolled back. When performing a rollback, we insert a Rollback record in CF_WRITE instead of deleting the corresponding lock record as described in the Percolator paper. This Rollback record indicates that the corresponding transaction has been rolled back. When a subsequent Prewrite request comes, the Prewrite will not succeed. This situation may occur when the network is abnormal. If we let the Prewrite request succeed, the correctness can still be guaranteed, but the key will be locked until the lock record expires before other transactions can lock the key again.

4.2.2 Short Value in Write Column

When we access a value, we need to find the key corresponding to the latest version start\_ts from CF\_WRITE, and then find the specific record from CF_DEFAULT. If a value is relatively small, the cost of searching RocksDB twice is relatively large.

In the specific implementation, in order to avoid short values searching RocksDB twice, an optimization was made. If the value is relatively small, in the Prewrite stage, we will not put the value in CF\_DEFAULT, but put it in CF\_LOCK. Then in the commit phase, this value will be moved from CF\_LOCK to CF\_WRITE. Then when we access this short value, we only need to access CF_WRITE, which reduces one RocksDB lookup.

4.2.3 Point Read Without Timestamp

For each transaction, we need to allocate a start\_ts first, and then ensure that the transaction can only see the data submitted before start\_ts. But if a transaction only reads the data of one key, do we need to allocate a start_ts for it? The answer is no, we only need to read the latest data of this key.

4.2.4 Calculated Commit Timestamp

In order to ensure Snapshot Isolation, we need to ensure that all transactional reads are repeatable. commit_ts should be large enough to ensure that no transaction will be committed before a valid read, otherwise no repeatable read is guaranteed. E.g:

Txn1 gets start_ts 100
Txn2 gets start_ts 200
Txn2 reads key "k1" and gets value "1"
Txn1 prewrites "k1" with value "2"
Txn1 commits with commit_ts 101
Tnx2 reads key "k1" and gets value "2"

Tnx2 read "k1" twice, but got different results. If commit\_ts is allocated from PD, then this problem certainly does not exist, because if the first read operation of Txn2 occurs before the Prewrite of Txn1, the commit\_ts of Txn1 must occur after the completion of the Prewrite, then the commit\ of Txn2 _ts must be greater than start\_ts of Txn1.

However, commit\_ts cannot be infinite. If commit\_ts is greater than the actual time, the new transaction may read the data submitted by the transaction. If we do not ask the PD, we cannot be sure whether a timestamp exceeds the current actual time.

In order to ensure the semantics of Snapshot Isolation and data integrity, the valid range of commit_ts should be:

max(start_ts,max_read_ts_of_written_keys)<commit_ts<=now

The calculation method of commit_ts is:

commit_ts=max(start_ts,region_1_max_read_ts,region_2_max_read_ts,...)+

Among them, region\_N\_max\_read\_ts is the maximum timestamp of all read operations on region N, and region N is all regions involved in the transaction.

4.2.5 Single Region 1PC

For non-distributed databases, it is relatively easy to ensure the ACID properties of transactions. But for distributed databases, in order to ensure the ACID properties of the transaction, 2PC is a must. The Percolator algorithm used by TiKV is a 2PC algorithm.

On a single region, write batches can guarantee atomic execution. If all the data affected in a transaction is in a region, 2PC is not necessary. If the transaction has no write conflict, then the transaction can be submitted directly.

Five, summary

advantages:

Transaction management is built on the storage system, the overall system architecture is clear, the system has good scalability, and it is simple to implement;
In scenarios with fewer transaction conflicts, the read and write performance is not bad;

drawback:

In scenarios where there are many transaction conflicts, the performance is poor, because after conflicts occur, they need to be retried continuously, and the overhead is high;
In the case of using the MVCC concurrency control algorithm, there will also be a situation of read waiting. When there is a read-write conflict, it will have a greater impact on the read performance;

In general, the design of the Percolator model is remarkable, with a clear structure and simple implementation. In scenarios with fewer read-write conflicts, there can be decent performance.

Six, reference

1. Codis author first revealed the TiKV transaction model, Google Spanner open source implementation

2. Google Percolator transaction model pros and cons analysis

3. Large-scale Incremental Processing Using Distributed Transactions and Notifications – Google Research

4. Database · Principle Introduction · Google Percolator Distributed Transaction Implementation Principle Interpretation (taobao.org)

Author: vivo Internet database team-Wang Xiang

Percolator model and its implementation in TiKV

1. Background

2. Architecture

2.1. Client

2.2. Timestamp Oracle (TSO)

2.3 Bigtable

2.4 Snapshot Isolation

Three, transaction processing

3.1 Write logic

3.2 Read logic

3.3 Handling Client Crash Scenarios

Fourth, implementation and optimization in TiKV

4.1 Implementation of Percolator in TiKV

4.2 Optimization of Percolator in TiKV

4.2.1 Parallel Prewrite

4.2.2 Short Value in Write Column

4.2.3 Point Read Without Timestamp

4.2.4 Calculated Commit Timestamp

4.2.5 Single Region 1PC

Five, summary

Six, reference

vivo互联网技术

引用和评论

vivo Pulsar万亿级消息处理实践（1）-数据发送原理解析和性能调优

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式