Author: Tong Mu

introduction

In the period when a distributed database based on the Percolator submission protocol was proposed, a technology called deterministic database appeared in academic research, and various genres also appeared in the development of this technology. This article will explain the different academic deterministic matters and characteristics, and comprehensively talk about their advantages and problems.

This article will explain in the order presented:

  • Definition of deterministic database;
  • Scalable deterministic database Calvin;
  • BOHM & PWV, a deterministic database based on dependency analysis;
  • Aria, a deterministic database that values practice.

Definition of deterministic database

Deterministic The certainty of the database refers to the certainty of the execution result. In a word, given a transaction input set, and the database can have a unique result after execution.

Figure 1-Uncertainty when there is no partial order relationship

But this certainty needs to be based on the partial order relationship, which represents the order in which transactions are executed in the database system. In Figure 1, two transactions are executed concurrently, but the partial order relationship has not been confirmed, then the execution sequence of the two transactions has not been determined, so the execution order of the two transactions is also free, and the execution order is different Will bring different results.

Figure 2-Use the transaction manager to sort input transactions

The example in Figure 1 shows that in order to achieve deterministic results, we need to sort the transaction before it is executed. In Figure 2, a transaction manager is added. Before the transaction is executed, it will be removed from the transaction manager. To apply for an id, the global transaction execution can be seen as proceeding in the order of id. In the figure, T2 reads the write result of T1, and T3 reads the write result of T2, so these three transactions must follow T1 -> T2- > T3 sequence execution can produce the correct result.

Figure 3-The occurrence of deadlock

Another advantage of a deterministic database is that it can avoid the occurrence of deadlocks. The cause of deadlocks is that interactive transactions have intermediate processes. Figure 3 is an explanation of the generation process. T1 and T2 each write a Key, and then T1 Try to write the Key of T2. When T2 tries to write the Key of T1, a deadlock occurs and one of the transactions needs to be aborted. Imagine that if the input of the transaction is complete, then the database knows what operations the transaction will do at the beginning of the transaction. In the above example, it can also know that T2 and T1 will have a write dependency relationship when T2 is input. Wait for T1 to execute before executing T2, then the deadlock will not be discovered during the execution. When a deadlock occurs, which transaction will be aborted is not required, so in some databases, this may also produce uncertain results.

Uncertainty in the database system may also come from many aspects, such as network errors, io errors and other unforeseen circumstances. These situations often manifest as the failure of certain transaction execution. In the interpretation of deterministic databases, we will Discuss how to avoid uncertain results due to these unforeseen circumstances.

Determinism is a very constrained protocol. Once the order of transactions is determined, the result is determined. Based on this feature, the deterministic database can optimize the overhead caused by the copy replication protocol . Because it can guarantee the success of writing, in some implementations, the result of reading can also be predicted. But certainty is not a silver bullet. The protocol also has its corresponding cost . This article will analyze its flaws in specific cases and the difficulties faced by certainty databases.

Scalable deterministic database Calvin

Calvin proposed in 2012 and appeared in the same period as Spanner, trying to use the characteristics of a deterministic database to solve the scalability problem of the database at that time. This research result subsequently evolved into FaunaDB, a commercial database.

Figure 4-Calvin's architecture diagram

Figure 4 is the architecture diagram of the Calvin database. Although it is more complicated, there are two main problems we need to solve:

  • How to ensure certainty in a replica?
  • How to ensure consistency between replicas?

First look at the first question. In a replica, nodes are distributed according to partitions. Each node can be divided into three parts: sequencer, scheduler and storage:

  • Sequencer is responsible for copy replication, and packs the received transaction every 10ms, and sends it to the corresponding scheduler;
  • Scheduler is responsible for executing transactions and ensuring deterministic results;
  • Storage is a stand-alone storage database, which only needs to support the CRUD interface of KV.

Figure 5-Calvin execution process one

We illustrate with a series of transaction inputs. Suppose there are three sequencers, and they all receive some transactions, and the transactions are packaged into batches every 10ms. But in the first 10ms, you can see that T1 in sequencer1 conflicts with T10 in sequencer3. According to the requirements of the deterministic protocol, the execution order of these two transactions needs to be T1 -> T10. These transaction batches will be replicated through the Paxos algorithm before being sent to the scheduler. We will talk about the issue of replica replication later.

Figure 6-Calvin execution process two

In Figure 6, these batches are sent to the corresponding scheduler, because the id of T1 is smaller than that of T10, indicating that it should be executed earlier. Calvin will allocate locks on the scheduler. Once the batch's lock allocation is over, the transaction holding the lock can be executed. In our example, the lock allocation may have two situations. T1 tries to acquire and is occupied by T10 The lock of x is preempted, or T10 attempts to acquire the lock held by T1 and fails. In either case, it will be the result of T1 being executed first and T10 being executed later.

Figure 7-Calvin's uncertainty problem

Thinking about Calvin's execution process, you will doubt whether the problem as shown in Figure 7 will occur. If T1 is sent to the scheduler after T10 is completely executed, the execution order of T1 and T10 will still be uncertain. In order to solve this problem, Calvin has a global coordinator role, responsible for coordinating the work phase of all nodes. When there are nodes in the cluster that have not completed the phase of sending batches to the scheduler, all nodes will not enter the next phase of execution.

At the SQL level, the read and write set of some predicate statements is not determined before being executed. In this case, Calvin cannot analyze the transaction. For example, the scheduler does not know which nodes to send read and write requests to, or how to proceed. Locked. Calvin uses the OLLP strategy to solve this problem. Under OLLP, Calvin will send a tentative read to determine the read and write set when the transaction enters the sequencer. If the read and write set read in advance changes during execution , The transaction must be restarted.

We consider a problem of Calvin, how to ensure consistency between replicas. Under the deterministic agreement, only consistent input is needed to ensure the consistency of execution results among multiple copies. The consistent input includes the order of input.

Figure 8-Calvin's inconsistency problem

Figure 8 describes the inconsistency problem in Calvin. If T2 is synchronized to a copy before T1 and is executed, then the consistency between the copies is destroyed. In order to solve this problem, all replica synchronization needs to be performed within a Paxos group to ensure global sequential , which may become a bottleneck, but Calvin claims that it can achieve a synchronization efficiency of 500,000 transactions per second.

On the whole, Calvin and Spanner appeared in the same era. Calvin tried to implement a scalable distributed database through a deterministic protocol and achieved good results. This article believes that there are two problems in Calvin:

  • The global consensus algorithm may become a bottleneck or single point;
  • The use of Coordinator to coordinate the node's work phase will affect the overall situation because of a node's problem.

Deterministic database BOHM & PWV based on dependency analysis

Before we start talking about BOHM and PWV, let's review the following dependency analysis. Dr. uses dependency analysis (write-after-write, read-after-write, and read-after-write) to define the sequence of transactions , and judges whether the execution of the transaction destroys isolation by whether there is a ring in the dependency graph. This idea can also be used by the database kernel in turn. long as 160f794c3769f4 avoids relying on the loop in the graph during execution, the execution process is that meets the requirements of the isolation level given in advance. From this idea, you can Allow transactions that cannot be executed concurrently to be executed concurrently.

Example 1-Concurrent transactions that cannot be parallelized

Example 1 gives an example of a concurrent transaction that cannot be executed concurrently, where both T1 and T3 write to x, and T2 needs to read x=1 written by T1. In a normal database system, these three Each transaction needs to be executed in the order of T1 -> T2 -> T3, which reduces the degree of concurrency.

Figure 9-BOHM's MVCC

BOHM solves this problem by modifying MVCC to some extent, setting the validity period of each piece of data and a pointer to the previous version. In the figure, the validity period of the T100 data is 100 <= id <200, and the validity period of the T200 data is 200 <= id. MVCC provides the possibility for the concurrency of write conflicting transactions, plus deterministic transactions know the complete state of the transaction, BOHM realizes the concurrency of write transactions.

PWV is a read visibility optimization on BOHM, so that write transactions can be read earlier (before the complete commit). In order to achieve this goal, PWV performs transaction visibility and abort reasons了Analysis.

There are two types of transaction visibility:

  • Committed write visibility, the strategy used by BOHM, has high latency;
  • Speculative write visibility, there is a risk of chain abort.

Figure 10-Chain abort

Figure 10 shows the chain abort phenomenon of speculative visibility. The x written by T1 is read by T2, and the write by T2 is further read by T3. Then the write on y by T1 is found to violate the constraint (value <10 ), so T1 must be aborted. But according to the atomicity rules of the transaction, T1's writing to x also needs to be rolled back, so T2 that reads x needs to be aborted, and T3 that reads T2 also needs to be aborted following T2.

There are two reasons for abort in the database system:

  • Logic-induced abort, violation of constraints;
  • System-induced abort occurs, such as deadlock, system error, or write conflict.

But fortunately, the deterministic database can exclude the abort caused by the system. As long as the abort for logical reasons does not occur, a transaction must be successfully committed in the deterministic database.

Figure 11-The realization of the use of piece to split the transaction

Figure 11 is the division of the transaction by PWV. The transaction is divided into small units of pieces, and then the Commit Point is searched for. After the Commit Point, there is no possibility of logical reason abort. In the figure, T2 needs to read the write result of T1. It only needs to wait for T1 to execute to the Commit Point and then read it, without waiting for T1 to execute successfully.

Figure 12-PWV performance

By further subdividing the transaction execution process, PWV reduces the latency of read operations and further improves concurrency compared to BOHM. In Figure 12, RC is the BOHM's strategy of not reading in advance. From the performance test results, it can be seen that PWV has very high benefits under high concurrency.

BOHM and PWV obtain high performance in conflict scenarios through the analysis of inter-transaction dependencies, but this approach requires knowledge of global transaction information, and the computing node is a single point that cannot be expanded.

Aria, a deterministic database that values practice

Finally, let's talk about Aria. Aria believes that the existing deterministic database has many problems. Calvin's implementation is extensible, but BOHM and PWV based on dependency analysis do not perform well in this regard; and thanks to dependency analysis, BOHM and PWV perform better in preventing performance regression in conflict scenarios, but Calvin is here The performance under one situation is not ideal.

It is difficult to perform dependency analysis for concurrent execution in a distributed system, so Aria uses a reservation mechanism. The complete execution process is:

  • A sequence layer assigns a globally increasing id for the transaction;
  • Persist the input transaction;
  • Execute the transaction and store the mutation in the memory of the execution node;
  • Reservation for the node holding this key;
  • Conflict detection is performed in the commit phase, whether commit is allowed or not, and the transaction that does not conflict will return to the execution successfully;
  • Write data asynchronously.

Figure 13-Aria's architecture diagram

Figure 13 is Aria's architecture diagram, each node is responsible for storing part of the data. Aria's paper does not specify which layer the replication protocol should be made at. It can be implemented at the sequencer layer or the storage layer. The implementation at the sequencer layer can give full play to the advantages of the deterministic database. The implementation at the storage layer can simplify the sequencer layer Logic.

Figure 14-Aria execution process one

In Figure 14, the input transaction is assigned a globally increasing transaction id after passing through the sequencer layer. At this time, the execution result is already deterministic. After passing through the sequencer layer, the transaction is sent to node, T1 and T2 are on node1, and T3 and T4 are on node2.

Figure 15-Aria execution process two

In Figure 15, assume that T1 and T2 are packaged into a batch on node1, and T3 and T4 are packaged into a batch on node2. During execution, the execution result of the transaction in the batch will be placed in the memory of the node to which it belongs, and then proceed to the next step.

Figure 16-Aria execution process three

Figure 16 is the result of the reservation of the transaction in batch1. It should be noted that the node that executes the transaction does not necessarily own the transaction data, but the reservation request will be sent to the node that owns the data, so the node must be able to know and store it itself All reservation information related to the Key. In the commit phase, you will find that the read set of T2 on node1 conflicts with the write set of T1, so T2 needs to be aborted and placed in the next batch for execution. For T1, T3 and T4 without conflict, it will enter the writing phase. Because the input results have been persisted in the sequencer layer, Aria will first return to the client the successful execution of the transaction and write asynchronously.

Figure 17-Aria execution process four

Figure 17 is the result of the delayed execution of T2, and T2 is added to batch2. But in batch2, T2 enjoys the highest execution priority (the id in batch is the smallest) and will not be delayed indefinitely due to conflicts, and this strategy can guarantee the only result.

Figure 18-Aria's uncertainty problem

It is easy to think that Aria may also have the same uncertainty problem as Calvin. In Figure 18, there is a conflict between T1 and T2. T1 should be executed first and T2 should be executed first. If T2 tries to submit before T1 has started the reservation, then you will not be able to find that there is a conflict between T1 and T1, and the execution order becomes T2-> T1, destroys the requirement of certainty. In order to solve this problem, Aria and Calvin have the same coordinator role, using coordinator to ensure that all nodes are in the same stage. In Figure 18, before node1 where T1 is located, node2 cannot enter the commit phase.

Figure 19-Aria's reordering

One of the advantages of a deterministic database is that it can reorder according to input transactions. Aria also considers this mechanism. First, Aria believes that WAR dependencies (read after write) can be safely parallelized, and further perform reservation results in the commit phase. During conflict detection, the read-after-write dependency can be converted into the read-after-write dependency. In the upper figure of Figure 19, if it is executed in the order of T1 -> T2 -> T3, then these three transactions need to be executed serially. But in the result of parallel execution after rearrangement, the values read by T2 and T3 are all before the batch starts. In other words, the execution order becomes T3 -> T2 -> T1. In the lower part of Figure 19, even if the RAW dependency is converted to the WAR dependency, because the dependency has a loop, a transaction still needs to be aborted.

Compared with Calvin, the advantage of Aria's design is that the execution and reservation strategies have a higher degree of parallelism, and no additional OLLP strategy is required for exploratory reading, and Aria can provide a backup strategy to open it in high conflict scenarios. There is an additional stage of conflict transaction processing, which will not be described in detail in this article. Interested students can see the 160f794c37713b article on Zhihu.

Figure 20-Aria's barrier limit

Figure 20 is Aria's barrier limitation. The specific performance is that if the execution process of a transaction in a batch is very slow, such as a large transaction, then this transaction will slow down the entire batch. This is something we are quite unwilling to see, especially In a large-scale distributed database, it can easily become a destructive factor for stability.
to sum up
Determinism is a strong agreement, but it requires global transaction information to achieve. The summary of the above-mentioned deterministic database is as follows.

BOHM and PWV based on dependency analysis:

Calvin and Aria of distributed design:

In contrast, a distributed database based on the Percolator submission protocol requires only monotonically increasing clocks to implement distributed transactions, and decoupling transactions is better.

Figure 21-Consensus algorithm level comparison

The level of consensus algorithm is a very important performance improvement point for deterministic databases. The bottom of Figure 31 is the practice of consensus algorithm at the storage engine level that we often contact. Although the system has become simple for the upper layer, there is write amplification. The problem. The deterministic protocol can guarantee the same input to get the only output, so the consensus algorithm can exist in the sequencer layer, as shown in the upper part of Figure 21, which greatly improves the operating efficiency of a copy.

In summary, the main problems currently faced by deterministic databases are the impact of the coordinator role in Calvin and Aria on the overall situation, and the other is that the use of stored procedures is not friendly enough; the advantage is that the deterministic protocol is a two-phase commit Alternative, and can use a single version of the data to improve performance.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...