How to achieve transaction atomicity? In-depth analysis of PolarDB atomicity

Introduction to In the towering database building system, query optimizer and transaction system are two important load-bearing walls. Both are so important that a large number of data structures, mechanisms and features in the entire database architecture design are centered on Built by the two. One of them is responsible for how to query data faster and organize the underlying data system more effectively; the other is responsible for storing data securely, stably and persistently, and providing logical implementation for users' read and write concurrency. The subject we are exploring today is the transaction system, but the transaction system is too large, and we need to divide the content into several times. This article analyzes the atomicity in the PolarDB transaction system.

Author | You Hee
Source | Alibaba Technical Official Account

I. Introduction

In the towering database building system, the query optimizer and the transaction system are two important load-bearing walls. Both are so important that a large number of data structures, mechanisms and features in the entire database architecture design are built around them. Got up. One of them is responsible for how to query data faster and organize the underlying data system more effectively; the other is responsible for storing data securely, stably and persistently, and providing logical implementation for users' read and write concurrency. The subject we are exploring today is the transaction system, but the transaction system is too large, and we need to divide the content into several times. This article analyzes the atomicity in the PolarDB transaction system.

Two questions

Before reading this article, first ask a few important questions, these few questions may have been puzzled before contacting the database. However, the answers to these questions may have been simply answered by simple answers such as "pre-write log" and "crash recovery mechanism". This article hopes to discuss the implementation and internal principles of these mechanisms in a deeper step.

How is database atomicity guaranteed? What special data structures are used? Why use it?
Why can my successfully written data be guaranteed not to be lost?
Why can the data that I have submitted logically be recovered completely after the database crashes?
Furthermore, what is logically submitted data? Which step is considered a real submission?

Three backgrounds

1 The position of atomicity in ACID

After the famous ACID feature was proposed, this concept has been continuously cited (it was originally written into the SQL92 standard). These four features can roughly summarize the core demands of people for databases. The atomicity to be discussed in this article is the first feature. Let's first focus on the position of atomicity in transaction ACID.

This is my personal understanding of the relationship between the ACID characteristics of the database. I think that the ACID characteristics of the database can actually be defined in two perspectives. Among them, the AID (atomic, persistent, and isolated) characteristics are defined from the perspective of the transaction itself, and C (consistent) Features are defined from the user's perspective. I will talk about my own understanding separately below.

Atomicity: We still start from the concept of these characteristics to discuss, the concept of atomicity is that a transaction either executes successfully or fails, that is, All or nothing. We can use a minimal transaction model to define this characteristic. We assume that there is a transaction, and we can realize its real commit or rollback through a set of mechanisms. This goal is achieved, and the user only conducts it through our system. Commit once, and the focus of atomicity is not on the success or failure of the transaction itself; it ensures that the transaction system only accepts two states of success or failure, and there are corresponding strategies to ensure that the physical and logical results of success or failure are consistent . Atomicity can be defined by the characteristics of the smallest transaction unit, which is the cornerstone of the entire transaction system.
Persistence: Persistence refers to the fact that once a transaction is committed, it can be permanently stored in the database. The scope and perspective of persistence are almost the same as atomicity. In fact, the two are also closely connected in concept and realization. Both guarantee the consistency and recoverability of the data in a certain sense, and the boundary is the moment when the transaction is committed. For example, the current state of a data is T, if a transaction A tries to update the state to T+1, if the transaction A fails, then the database state returns to T, which is guaranteed by atomicity; if transaction A When the commit is successful, then the moment the transaction status becomes T+1, this is atomicity guaranteed; and once the transaction status becomes T+1 and the transaction is successfully committed, the transaction has ended and there is no longer atomicity, this T+1 The state is guaranteed by persistence. From this perspective, it can be inferred that the atomicity guarantees the crash recovery of the data before the transaction commits, and the durability guarantees the crash recovery after the transaction commits.
Isolation: Isolation is also a mechanism defined at the transaction level, which provides a certain degree of isolation guarantee for transaction concurrency. The essence of isolation is to prevent concurrent transactions from causing inconsistent states. Since it is not the focus of this article, I will not go into details here.
Consistency: Compared with several other characteristics, the concept of consistency is that the database must remain in a consistent state after one or more transactions. If you understand it from the perspective of transactions, guaranteeing AID can guarantee that transactions are serializable, recoverable, and atomic, but is the consistency of this transaction state true consistency? If AID is destroyed, C must be destroyed, but on the contrary, will AID guarantee that C will be guaranteed? If the answer is yes, then the concept will lose its meaning. We can guarantee AID to ensure that the transaction is consistent, but can we prove that the consistency of the transaction must guarantee the consistency of the data? In addition, the concept of data consistency is difficult to accurately define through transactions, but it is easy to define through the user level. Data consistency means that users believe that the state of the data in the database at any time meets their business logic. For example, bank deposits cannot be negative, so the user defines a non-negative constraint. I think this is a blank for concept designers, who tend to treat consistency as a high-level goal.

This article mainly revolves around atomicity, and the topics related to crash recovery may involve persistence. Isolation and consistency are not discussed in this article. In the visibility part, we default that the database has complete isolation, that is, the isolation level of serialization.

2 Intrinsic requirements of atomicity

The above talked a lot about the understanding of database transaction characteristics, let's enter our topic atomicity. We still need to use the example just now to continue to explain atomicity. The current status of the database is T, and now we hope to upgrade the data status to T+1 through a transaction A. We discuss the atomicity of this process.

If we want to ensure that this transaction is atomic, then we can define three requirements. Only when the following are met can we say that this transaction is atomic:

The database has a point in time when the transaction is truly successfully committed.
Transactions opened before this point in time (or snapshots taken) should only see the T state, and transactions opened after this point in time (or snapshots taken) should only see the T+1 state.
At any time before this point in time, the database should be able to return to the T state; at any time after this point in time, the database should be able to return to the T+1 state.

Note that we have not defined this point in time, and even we are not sure whether this point in time in 2/3 is the same point in time. What we can be sure of is that this point in time must exist, otherwise there is no way to say that the transaction is atomic. Atomicity determines that there must be a certain point in time for commit/rollback. In addition, according to what we have just described, we can infer the time point in 2 and we can define it as an atomic site. Since the commit before the atomic site is not visible to us, we can see it later, then this atomic site is the time point of the transaction commit for other transactions in the database; and the site in 3 can be positioned as a persistent site, Because this is consistent with the definition of persistence for crash recovery. That is, for durability, the transaction has been committed after 3 this point.

Tetraatomic program discussion

1 Start with two simple solutions

First, let's talk about atomicity from two simple schemes. The purpose of this step is to try to explain why the data structures we introduce in each step are necessary to achieve atomicity.

Simple Direct IO

Imagine that we have such a database, and every user operation will write data to the disk. We call this method simple Direct IO, which simply means that we do not record any data logs but only the data itself. Assume that the initial data version is T, so that if a data crash occurs after we insert some data, a data page of version T+0.5 will be written on the disk, and we have no way to roll back or continue subsequent operations. . Such a failed CASE undoubtedly broke the atomicity, because the current state is neither a commit nor a rollback but an intermediate state, so this is a failed attempt.

Simple Buffer IO

Next we have a new solution, this solution is called simple Buffer IO. Again we have no logs, but we have added a new data structure called "shared buffer pool". In this way, every time we write a data page, we do not write the data directly to the database, but write it to the shared buffer pool; this will have obvious advantages. First of all, the efficiency of reading and writing will be greatly improved. We write every time. There is no need to wait for the data page to be actually written to the disk, but can be performed asynchronously; secondly, if the database rolls back or crashes before the transaction is not committed, we only need to discard the data in the shared buffer pool, and only when the database is successfully committed, it Only then can the data be actually flushed to the disk, so from the perspective of visibility and crash recovery, we seem to have met the requirements.

However, the above scheme still has a difficult problem, that is, the matter of data placement is not as simple as we thought. For example, there are 10 dirty pages in the shared buffer pool. We can use storage technology to ensure that the flushing of a single page is atomic, but the database may crash at any time in the middle of these 10 pages. Then no matter when we decide to place the data to the disk, as long as the machine crashes during the data placement process, the data may generate a T+0.5 version on the disk, and we still have no way to redo or return it after restarting. roll.

The explanation of the above two examples seems to destined that the database has no way to ensure data consistency without relying on other structures (there is a popular solution is the Shadow Paging technology of the SQLite database, which is not discussed here), so if you want to solve these The problem, we need to introduce the next important data structure, the data log.

2 Write-ahead log + Buffer IO solution

program overview

On the basis of Buffer IO, we have introduced a data structure such as data log to solve the problem of data inconsistency.

The data cache is the same as the previous idea, the difference is that we will additionally record an xlog buffer before writing the data. These xlog buffers are a sequenced log. Its serial number is called lsn. We will record the log lsn corresponding to this data on the data page. Each data page page records the latest log serial number that updated it. This feature is to ensure the consistency of the log and data.

Imagine that if the log and data version we can import are exactly the same, and to ensure that the data log is persisted before the log, then whenever the data crashes, we can recover from this consistent log page. In this way, the data corruption problem mentioned before can be solved. Regardless of whether the transaction crashes before or after the commit, we can replay the correct data version by replaying the log, so that the atomicity of crash recovery can be achieved. In addition, the visibility part can be achieved through multi-version snapshots. It is not easy to ensure that the data log is consistent with the data. Let's talk about how to ensure it in detail and how to recover the data in the event of a crash.

Transaction commit and control

The WAL log is designed to ensure the recoverability of data, and in order to ensure the consistency of the WAL log and the data, when the data cache is persisted to disk, the WAL log corresponding to the persistent data page must be first Persist to the disk, this sentence explains the essential meaning of controlling dirtying.

There is such a process in the background of the database called the checkpoint process, which periodically performs checkpoint operations. When a checkpoint occurs, it will write a checkpoint log to the xlog log. This checkpoint log contains the current REDO location. Checkpoint ensures that all current dirty data has been flushed to the disk.
Perform the first insert operation. At this time, the shared memory cannot find the page. It will load the page from the disk into the shared memory, and then write the input that was inserted this time, and insert an xlog for writing data into the xlog buffer. , Upgrade the log mark of this table from LSN0 to LSN1.
At the moment of transaction commit, the transaction will write a transaction commit log, and then all WAL logs submitted by this transaction on the wal buffer pool will be flushed to disk.
After inserting the second data B, he will insert an xlog for writing data into the xlog buffer, and upgrade the log mark of this table from LSN1 to LSN2.
The same operation as 3.

After that, if the database is running normally, the next bgwriter/checkpoint process will flush the data pages to disk asynchronously; once the database crashes, the data log and transaction commit log corresponding to the two logs A and B have been flushed. On the disk, the data can be replayed in the shared buffer pool through log playback, and then written to the disk asynchronously.

fullpage mechanism guarantees recoverability

The recovery of the WAL log seems to be flawless, but unfortunately the solution just now has some flaws. Imagine that when a bgwriter process encounters the CRASH of the database while writing data asynchronously, some of the dirty pages are written to the disk, and there may be bad pages on the disk. (PolarDB data pages are 8k. In extreme cases, it is possible to write bad pages with 4k writes on the disk.) However, WAL logs cannot play back data on bad pages. At this time, another mechanism is needed to ensure that the database can find the original data in extreme cases, which involves an important mechanism, the fullpage mechanism.

When the data is modified for the first time after each checkpoin action, PolarDB will write the modified data and the entire data page to the wal buffer and then flash it to the disk. This WAL log containing the entire data page is called Backup block. The existence of the backup block allows the WAL log to replay the complete data page in any case. The following is a complete process.

checkpoint action
For the first insert operation, the shared memory cannot find the page at this time, it will load the page from the disk into the shared memory, and then write the input of this insert. At this time, different from the operation in the previous section, this WAL log with the PolarDB serial number LSN1 will write the entire page read from the disk and marked as LSN 0 into the wal buffer pool.
The transaction is committed, and the entire WAL log is forced to be flushed into the WAL segment on the disk at this time.
Same as the previous section
Same as the previous section

At this time, if the database crashes, when the database is restarted and restored, once it encounters a broken page, it can play back the correct data step by step through the original version of the page recorded in the original WAL log.

WAL log-based crash recovery mechanism

Based on the previous two sections, we can continue to demonstrate how the data is played back if the database crashes. We demonstrate a kind of playback where the data page is badly written.

When the database replays the WAL log written to data A, it will read the TABLE A page from the disk. The WAL log here is a backup log. This is because after CHECKPOINT, the first WAL log of each playback page is a backup log.
When this log is played back, the backup log has a special playback rule: it always overwrites the original page of its own page, and upgrades the LSN of the original page to the LSN of this page. (In order to ensure data consistency, the normal playback page will only play back WAL logs larger than your own LSN number). In this example, due to the existence of the backup block, the badly written page was successfully restored.
Next, PolarDB will play back subsequent logs according to the normal playback rules.

After the final data playback is successful, the data in the shared buffer pool can be asynchronously flushed to the disk to replace the previously damaged data.

We have spent a lot of space explaining how the database performs crash recovery through pre-written logs, which seems to explain the meaning of the persistence site; below, we need to explain the issue of visibility.

3 Visibility mechanism

Since our explanation of atomicity will involve the concept of visibility, this concept is implemented in PolarDB by a complex MVCC mechanism, and most of them belong to the category of isolation. A brief description of visibility will be given here, and a more detailed description will be placed in the isolation article to continue.

transaction tuple

The first thing to talk about is transaction tuples. It is the smallest unit of a piece of data, and it actually stores the data. Here we only pay attention to a few fields.

t\_xmin: The transaction ID that generated the data
t\_xmax: modify the transaction ID of the transaction data (transaction ID that deletes or locks the data)
t\_cid: a sequence number of the tuple operation in the same transaction
t\_ctid: a pointer composed of segment number/offset, pointing to the latest version of data

snapshot

The second thing to talk about is snapshots. The snapshot records the state of transactions in the database at a certain point in time.

Regarding the snapshot, we still do not expand. We know that the state of all possible transactions in the database at a certain point in time can be obtained from the procArray through the snapshot.

Current transaction status

The third point to talk about is the current transaction status, which refers to the mechanism that determines the running status of the transaction in the database. In a concurrent environment, it is very important to determine the transaction status you see.

When viewing the transaction status in a tuple, three data structures t\_infomask, procArray, and clog may be involved:

infomask: The cache flag bit located in the head of the tuple, which marks the running status of the two transactions of the tuple xmin/xmax. This status can be regarded as a layer of asynchronous cache of clog, which is used to speed up the acquisition of transaction status; its status is set It is an asynchronous setting. When the transaction is committed, all transaction-related tuples are not upgraded immediately, but are set when the first new enough snapshot setting can be seen this time.
procArray snapshot: the transaction status in the snapshot. The acquisition of the snapshot is actually to get the status of all transactions in the database at this moment in the procArray. Once the snapshot is acquired, the status is constant, unless it is acquired again (whether the acquired content in the same transaction changes depends on the transaction Isolation level).
clog: The actual status of the transaction, divided into two parts: clog buffer and clog file. All transaction status is recorded in real-time in the clog buffer.

In a visibility judgment process, the order of the three accesses is [infomask -> snapshot, clog], and the decisive order of the three is [snapshot -> clog -> infomask].

Infomask is the easiest information to obtain. It is recorded at the head of the tuple. Under certain conditions, the visibility of the current transaction can be clarified through the infomask, without involving the subsequent data structure; the snapshot has the highest level of decision-making power, and finally Determine the status of the xmin/xmax transaction is running/not running; and clog is used to assist in the judgment of visibility and to assist in setting the value of infomask. For example, if the xmin transaction visibility is found to have been committed in the snapshot/clog, then t\_infomask will be set as committed; and if the xmin transaction visibility is found to be committed in the snapshot, but the clog is not committed, The system judges that a crash or rollback has occurred, and sets the infomask to the transaction illegal.

Transaction snapshot visibility

After introducing tuples and snapshots, we can continue to discuss the topic of snapshot visibility. The visibility of PolarDB has a complex definition system that needs to be defined through many combinations of information, but the most direct ones are snapshots and tuple headers. The following uses an example of data insertion and update to illustrate the visibility of tuple headers and snapshots.

This article does not discuss isolation, we assume that the isolation level is serializable:

Snapshot1 moment: At this time, the transaction 1184/1187 has not started, and there is no record in the tuple. The student table is an empty table; the data that can be obtained through the Snapshot1 snapshot is empty, and we denote this version as T.
Snapshot1-Snapshot2, at the moment when we get a snapshot, we still get Snapshot1, so the data he sees should still be T.
Snapshot2 moment: At this time, transaction 1184 has ended, and 1187 has not yet started. So the modification of 1184 is visible to the user, and 1187 is still invisible. Specifically, you can see a tuple header like (1184/0) in the tuple, so what you see is the data version Tom, and we denote this version as T+1.
Snapshot2-Snapshot3, at the moment when we get a snapshot, we still get Snapshot2, so the data he sees should still be T+1.
Snapshot3 moment: At this moment, the transactions 1184/1187 have ended, and both are visible, so we can see that both (1184, 1187) and (1187, 1187) are not visible in the tuple, and (1187, 0) is Susan Is visible. We denote this version as T+2.

Through the above analysis, we can get a simple conclusion that the visibility of the database depends on the timing of the snapshot. The so-called difference in visibility version in our atomicity actually refers to the different snapshots we get. The snapshot determines whether a transaction in execution has been committed. This kind of commit has nothing to do with the transaction mark commit state or even the record of clog commit. We can use this method to make the snapshot we get consistent with the transaction commit.

transaction atomicity

We have briefly described the visibility of PolarDB snapshots above, and here are the specific implementation issues when the transaction is committed.

Our core idea of designing the visibility mechanism is: "A transaction should only see the version of the data it should see." How to define it should be seen. Here is just a simple example. If the xmin transaction of a tuple is not committed, other transactions are likely to be invisible; and if the xmin transaction of a tuple has been committed, other transactions may be see. How to know whether this xmin has been submitted or not, as mentioned above, we decide through snapshots, so the key mechanism when our transaction is committed is the update mechanism of the new snapshot.

Visibility involves two important data structures clog buffer and procArray when the transaction is committed. The relationship between the two has been explained above. They play a certain role in judging the visibility of the transaction. Of course, procArray plays a decisive role. This is because the acquisition of a snapshot is actually a process of traversing the ProcArray.

In the actual third step, the information submitted by the transaction will be written into the clog buffer. At this time, the transaction marks the clog as being committed, but in fact it is still not committed. After the transaction marks ProcArray has been committed, this step of the transaction completes the real commit, and the snapshot taken after this point in time will update the data version.

Five realization of atomicity in PolarDB

After completing the explanation of PolarDB crash recovery and visibility theory, we can know that PolarDB can ensure the crash recovery and visibility consistency of transactions through such a set of pre-written log + BufferIO solution, thereby achieving atomicity. Next, we will explore the most important link in transaction submission and find out what the atomic site we originally mentioned refers to.

1 Consistent recovery from transaction crashes-persistent site

Simply put, these four operations in transaction commit are the most core and important for the atomicity of the transaction. In this section, we first consider the first two operations.

Commit log of the commit transaction (that is, the WAL log of Commit).
All the submitted WAL logs of this transaction are forced to be flushed and persisted to storage.

We mark the position where this xlog (WAL log) is placed. We envision two situations:

If the transaction crashes or rolls back before this point, the Commit log must not be flushed regardless of whether the data log is flushed or not. Because of the sequential nature of the WAL log, the Commit log must be the last one to be persisted to the disk. At this time, if we replay the data, we find that the transaction lacking Commit log cannot be marked as committed, and the data related to this state must be invisible based on the visibility. These data will be treated as dirty data and cleaned up. So we can conclude that when the node crashes before this node, the transaction is actually not committed. The database is essentially restored to state T.
If it crashes or rolls back after this location, we can be sure that the Commit log must be flushed to the disk no matter at which step it crashed or rolled back. Once the Commit log is flushed to the disk, the data written by this firm must be replayed and marked as committed. Then this data is visible. This transaction has actually been committed and the database has been restored to T+1.

This phenomenon indicates that site 2 seems to be the critical point for crash recovery, which indicates that the database crash recovery can return to the T or T+1 state. So how do we call this site? Recall the concept of durability: once a transaction is committed, the transaction's changes to the database are permanently retained in the database. The two are actually consistent. So we call this site 2 the persistent site.

Another thing to note about xlog flashing is that xlog flashing and playback has the atomicity of a single file; the CRC check of the WAL log header provides the validity check of a single WAL log file. If the WAL log writes to the disk, the disk is damaged. , The content of this WAL log is invalid, to ensure that there will be no partial playback of data.

2 Consistent visibility of transactions-atomic sites

Next, we continue to look at operations 3 and 4:

Write this transaction commit to the Clog buffer.
Write the result of this transaction commit to ProcArray.

Operation No. 3 records the current status of the transaction in the Clog buffer, which can be regarded as a layer of log cache. Operation No. 4 writes the commit operation into ProcArray, which is a very important step. Through the explanation just now, we know that snapshot judgment of transaction status is carried out through ProcArray. That is, this step determines the status of the transaction as seen by other transactions.

If the transaction crashes or rolls back before operation 4, the data version seen by all other transactions in the database is T, which is equivalent to the transaction not being committed. This judgment is determined by the order of visibility -> snapshot -> Procarray.

After the 4th operation, the transaction has been committed for all observers, because all the snapshot data versions obtained after this point in time are T+1.

From this point of view, operation No. 4 is fully in line with the meaning of atomic operations. Because the progress of operation No. 4 affects whether the transaction can be successfully submitted. The transaction before operation 4 is always allowed to roll back, because no other transaction sees the T+1 status of the transaction; but after operation 4, the transaction is not allowed to roll back, otherwise once there are others that read the T+1 version The transaction will cause data inconsistency. The concept of atomicity is that the transaction is successfully committed or failed to roll back. Since the rollback is not allowed after the 4th operation, the 4th operation can be used as a sign that the transaction is successfully committed.

In summary, we can define operation 4 as the atomicity of the transaction.

3 Persistent sites and atomic sites

Atomicity and durability requirements

Give again the concept of atomicity and persistence:

Atomicity: A transaction either succeeds or fails.
Persistence: Once a transaction is successfully executed, it can be permanently stored in the database.

We marked operation 4 as an atomic site because at the moment of operation 4, objectively all observers thought that the transaction had been committed, the version of the snapshot was upgraded from T to T+1, and the transaction was no longer available. Roll back. So once the transaction is submitted, whether the atomicity does not take effect, I think it is, the atomicity at most only guarantees the data consistency at the moment the transaction is successfully submitted, and we can't talk about atomicity after the transaction is over. Therefore, the atomicity guarantees the visibility and recovery of the transaction before the atomicity site.

We mark site 2 as a persistent site because persistence believes that the transaction can be retained forever after a successful transaction. Based on the above speculation, this site is undoubtedly the persistent site No. 2. Therefore, we should ensure durability at all times since the 2nd position.

How to understand two sites

After explaining the two positions 2 and 4, we can finally define the two most important concepts involved in transaction submission. We can now answer the first question. At what time does the transaction actually commit? The answer is that the transaction can be completely recovered after the persistent site; and the transaction after the atomic site is truly regarded as committed by other transactions. But the two are not separable. How to understand this?

I think this is actually a compromise of atomic realization, because we don’t need to unify the two, we only need to ensure the key point, as long as the order of the two sites can make the data in different states consistent, Then it can be considered that it meets our definition of atomicity.

Crash or rollback before the persistent site, at this time the transaction fails, and the data version before the crash or after the recovery is T.
After the persistent site crashes or rolls back between atomic sites, the visibility version of the transaction at this time is T, which means that for all transactions in the database, we see T. After the rollback, the data is replayed to T+1; at this time, after the database is restarted, it will be found that the data version seen by the snapshot of the transaction before the database crash is T, and the data version seen by the snapshot after the crash is restarted. It is T+1, as if the transaction was implicitly committed. But this does not violate the consistency of the data.
Crash after the atomic site. This transaction has been committed, and all the transactions seen before and after the crash are the T+1 version of the data.

Finally, we consider why the two sites did not choose to merge. The operation of the persistent site is the flushing of the WAL log, which involves the problem of disk IO; on the other hand, what the atomic site does is write ProcArray, which requires a lot of competition on the ProcArray. A large lock can be considered a high-frequency shared memory write behavior; both themselves are related to the efficiency of database transactions. If the two are bound to become an atomic operation, it will undoubtedly make the two wait very serious, which may affect the transaction. The operating efficiency has a greater impact. From this perspective, the separation of the two behaviors is an efficiency consideration.

order of the two be reversed?

Obviously not. Through the above schematic diagram, we can see that there may be regions that do not meet the atomicity requirements or the persistence requirements during this period of time.

Specifically, if the atomic site is performed first, and then the persistent site is performed, it is assumed that the transaction situation in the middle of the two will collapse. Other transactions will see the T+1 version of the data before the crash, and the T version after the crash, so the behavior of seeing future data is obviously not allowed.

define the real submission

The real submission is the atomic site submission.

Still the most basic truth, the sign of the real submission is the upgrade of the data version from T to T+1. This site is the atomic site. Before this point, the data versions seen by other transactions are all T, saying that the real commit is inappropriate; after this point, the transaction cannot be rolled back. This is enough to show that this is the real commit point of the transaction.

Other operations

We finally pay attention to operation 1/3:

Operation No. 1 is to write the wal commit log to the xlog buffer. This write log is not critical for transaction submission; because if it is written and not flushed to the disk, then it is actually useless.
Operation No. 3 marks the transaction as committed in the clog buffer; this operation is not critical for transaction commit. Because if the database is running normally, it does not affect the visibility of the transaction snapshot; if the database crashes, regardless of whether the clog status has been persisted, the transaction status can be played back by the Commmit/Abort log in the xlog.

Six PolarDB atomic process

1 transaction commit

In this section, we return to the transaction commit function and see the position of these operations in the function call stack.

The transaction commit process is a transaction with a transaction ID, and a transaction without a transaction ID does not have this process. Since the transaction without transaction ID is likely to be a read-only operation, it will not cause any impact on data consistency in the database.
Strict mode will be turned on before submitting xlog. Any error in this mode will be a fatal error, and the database will directly crash and restart.
The sequence of xlog flashing and CLOG writing to memory is performed in synchronous mode. Xlog flashing is not guaranteed in asynchronous mode, so data may be lost after a crash.
There is a key operation in the middle of 3/4, Replication waits. In fact, the data xlog has been flushed at this time, but it has not yet been submitted. In the synchronous mode, the main library will wait for the xlog to be flushed to the disk by the library to complete the application, and then proceed to the next step.
Write ProcArray This transaction is committed, the transaction is really committed and the transaction can no longer be rolled back.
Clean up the resource status. At this time, the work has nothing to do with this transaction.

2 transaction rollback

The rollback of a transaction without a transaction ID will be skipped directly.
Before rollback, it will first determine whether the transaction has been committed. This determination is based on CLOG. How can a transaction be committed and rolled back? This is the state between 3-4 we discussed earlier. If the CLOG records the commit, then a fatal failure occurs directly in the database when the rollback command is encountered, and the database crashes and restarts.
During the rollback, the xlog rollback log will be written accordingly, but it is flushed to the disk asynchronously. It can be imagined that even if the rollback log is not written, the data is not visible.
When the transaction is written to the rollback log in ProcArray, the transaction is truly rolled back in the process (in fact, this state has no effect on other transactions, and the data versions obtained before and after are all T).

Seven Summary and Prospects

Finally, a summary of the full text is made. This article mainly focuses on the topic of "how to achieve transaction atomicity", and illustrates the underlying principles of PolarDB database atomicity from the perspective of database crash recovery characteristics and transaction visibility. In the process of introducing the principle of write-ahead log + buffer IO, it also talked about shared buffer, WAL log, clog, ProcArray, these data structures that are important for atomicity. Under the transaction as a whole, the various modules of the database are cleverly connected, making full use of computer resources such as disk, cache, and IO to form a complete database system.

Reminiscent of other models of computer science, such as the ISO network model, the transport layer TCP protocol provides reliable communication services on an unreliable channel. Database transactions implement a similar idea, that is, reliable storage of data on an unreliable operating system (which may crash at any time) and disk storage (not capable of atomic writing of large amounts of data). This simple and important idea can be said to be the cornerstone of the database system. It is so important that most of the core data structures in the entire database are mostly related. Perhaps with the development of databases in the future, more advanced database architecture systems will be developed, but we must not forget that atomicity and persistence should still be the core of database design.

Eight thoughts

At this point, the focus of transaction atomicity is over. Finally, I leave a few questions for everyone to think about the points mentioned in this article.

How to understand the atomicity and durability of transaction commit?
Thinking about the relationship between the atomicity of a single transaction and the atomicity of multiple transactions? Are crash recovery and visibility integrated?
There is a concept of asynchronous submission in PolarDB, that is, the xlog log is not required to be placed when the transaction is not required to be submitted. Please consider which characteristics of the transaction may be violated in this mode? Does it violate atomicity and durability?

Reference
https://www.interdb.jp/pg/
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

How to achieve transaction atomicity? In-depth analysis of PolarDB atomicity

I. Introduction

Two questions

Three backgrounds

1 The position of atomicity in ACID

2 Intrinsic requirements of atomicity

Tetraatomic program discussion

1 Start with two simple solutions

2 Write-ahead log + Buffer IO solution

3 Visibility mechanism

Five realization of atomicity in PolarDB

1 Consistent recovery from transaction crashes-persistent site

2 Consistent visibility of transactions-atomic sites

3 Persistent sites and atomic sites

Six PolarDB atomic process

1 transaction commit

2 transaction rollback

Seven Summary and Prospects

Eight thoughts

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

Y 分钟速成 zfs

好用的开源埋点方案-ClkLog埋点用户分析系统