The "road" and "pit" of MySQL distributed transactions

1 database transaction

1.1 Ordinary local transactions

Distributed transactions are also transactions, and the basic ACID characteristics of transactions must still meet:

A: Atomic, atomic, all SQL in a transaction is executed as an atomic unit of work, either all succeed or all fail;

C: Consistent, consistency, after the transaction is completed, the state of all data is consistent. If A transfers 100 to B in a transaction, as long as A deducts 100, B must add 100 to the account;

I: Isolation, isolation, if multiple transactions are executed concurrently, the modifications made by each transaction must be isolated from other transactions;

D: Duration, persistence, that is, after the transaction is completed, the modification to the database data is stored persistently.

Ordinary non-distributed transactions, within a process, rely on snapshot reads and current reads based on locks, and it is better to implement ACID to ensure transaction reliability. However, the distributed transaction participants are usually on different instances of different machines. The original local transaction lock cannot guarantee the ACID characteristics of the distributed transaction, and a new transaction framework needs to be introduced. MySQL's distributed transaction is based on 2PC (two-phase commit) Implementation, the following describes the 2pc distributed transaction in detail.

1.2 Distributed transaction based on 2pc

There are various implementations of distributed transactions, such as 2PC (two-phase commit), 3PC (three-phase commit), TCC (compensation transaction), etc. MySQL is a distributed transaction based on 2PC. The following describes the implementation of 2PC distributed transaction.

Two-phase commit: Two-Phase Commit, 2PC for short, is an algorithm designed to keep all nodes consistent when committing transactions based on a distributed system architecture.

The algorithm idea of 2PC can be summarized as: the participants notify the coordinator of the success or failure of the operation, and then the coordinator decides whether each participant wants to submit the operation or abort the operation according to the feedback information of all participants. The participants here can be understood as Resource Manager (RM), and the coordinator can be understood as Transaction Manager (TM).

The following diagram illustrates how RM and TM work in a distributed transaction:

The first stage of submission: TM will send Prepare to all RMs to ask whether the operation can be submitted. RM receives the request, implements the preparations before the transaction is submitted, and returns the result.

The second stage of submission: According to the results returned by the RM, all RMs have returned and can be submitted, then the TM sends a commit command to the RM, each RM implements its own submission, and releases the lock and resources at the same time, and then the RM reports that the submission is successful, and the TM completes the entire process. Distributed transactions; if any one RM returns uncommitted, all RMs involved in the distributed transaction need to be rolled back.

2 MySQL distributed transaction XA

MySQL distributed transaction XA is implemented based on the above 2pc framework. The following describes MySQL XA in detail.

2.1 XA Transaction Standard

A set of distributed XA transaction standards defined by the organization X/Open defines specifications and API interfaces, which are then implemented by manufacturers.

The distributed transaction in the XA specification consists of AP, RM, and TM:

As shown in the figure above, the application AP defines the transaction boundary (defines the start and end of the transaction), and accesses resources within the transaction boundary. The resource manager RM manages shared resources, namely database instances. The transaction manager TM is responsible for managing global transactions, assigning unique identifiers of transactions, monitoring the execution progress of transactions, and responsible for committing, rolling back, and recovering from failures of transactions. MySQL implements the XA standard syntax and provides the above RMs capabilities, allowing upper-layer applications to quickly support distributed transactions based on it.

2.2 MySQL XA syntax

XA START xid: Start a distributed transaction xid.

XA END xid: The distributed transaction xid is placed in the IDLE state, indicating that the SQL operation within the transaction is completed.

XA PREPARE xid: The transaction xid is submitted locally, and the success status is placed in PREPARED, and the failure is rolled back.

XA COMMIT xid: The transaction is finally committed and persistence is completed.

XA ROLLBACK xid: Transaction rollback terminated.

XA RECOVER: View XA transactions in the PREPARED state that exist in MySQL.

(1) Grammar points

Instances participating in distributed transactions are not directly related to each other from the perspective of the database kernel, and they are not aware of each other's state, and the sub-transactions on each node in a distributed transaction can be executed independently without dependencies. The association between them is through the global transaction. No. is established at the application layer.

Compared with ordinary transactions, XA transactions have one more global transaction number when they are opened, and one more end action and prepare action when they end.

XA START, to start a distributed transaction, needs to specify the distributed transaction number.
XA END is only a state change internally. It declares that the current XA transaction is over, and no new SQL statements are allowed to be added. It has no other effect. Some people in the industry propose that the XA transaction framework removes this step to reduce a network interaction and improve performance.

XA PREPARE, write binlog and redo log, pre-commit transactions, and save distributed transaction information to the global memory structure, so that other connections can query, roll back, commit, and roll back if prepare fails.

XA COMMIT, actually commits the transaction, modifies the transaction state, and releases the lock resources. If XA PREPARE has succeeded on the instance, then its XA COMMIT must succeed.

Example of XA transaction: user 201 transfers 1000 yuan to user 202, which is simplified as follows:

The first step is to open a distributed transaction, xa_ts:10001 is the global transaction number defined by the application layer, and instance 1 and instance 2 use it to build distributed transactions.

Steps 2 and 3 are ordinary transaction statements.

Step 4, declare the end of the xa transaction. After that, you can no longer append statements such as update and insert queries, which do not belong to this distributed transaction and are not allowed. Other statements are placed after xa commit or xa rollback.

In step 5, after the prepare is successful, the upper-layer application can initiate the transaction in step 6 to commit. Note that all nodes participating in this distributed transaction must prepare successfully, that is, both instance 1 and instance 2 have completed the prepare, and the application can initiate the submission. The core point of the two-phase submission framework is here.

If any node fails in the first 5 steps, all nodes participating in the distributed transaction must roll back. If example 2 is to add 1,000 yuan to the account, basically any situation can be successful, and step 5 can be successfully executed, but it is not necessarily the case in example 1. The account needs to be deducted by 1,000 yuan, and the funds may not be enough, and an error will be rolled back. If example 1 Cannot execute to prepare, all distributed transaction participants must also be rolled back, so instance 2 must also be rolled back. If step 5 is all successful, and one node executes step 6 and submits the transaction, then all nodes must submit, otherwise data inconsistency will result. In xa prepare, if it is not submitted, it will occupy resources. The residual xa transaction is equivalent to the existence of a long transaction, which has an impact on dirty brushing and purge. It is best for the business layer to submit it immediately.

(2) How to deal with residual XA transactions

It is mentioned above that the xa transaction does not commit and is equivalent to a long transaction. Once the prepare is successful, it must be committed immediately, otherwise it will bring many problems. However, the database crash or the application system crash and other reasons may cause the XA transactions to fail to be fully committed. How to deal with these remaining XA transactions? This uses the above XA RECOVER syntax, execute xa recover to view uncommitted XA transactions, and select the corresponding rollback or commit. If only the gtrid_length field has a value, it can generally be rolled back or committed in the form of xa rollback/commit xid, where xid is the data in xa recover.

If both gtrid_length and bqual_length have values, rollback or commit is relatively complicated, and it needs to be committed or rolled back in the following ways:

xa rollback/commit gtrid, bqual,formatid ;

gtrid and bqual are spliced in the data field, and need to be divided according to their lengths. Take the first uncommitted xa transaction as an example, gtrid_length is 34, which means that the first 34 characters in data are gtrid, and bqual_length is 22, which means data The last 22 characters in the middle are bqual, then the rollback or commit method can be expressed as follows:

xa rollback '10.177.197.41.tm163721155374124700', '10.177.197.41.tm881366', 1096044365;
xa commit '10.177.197.41.tm163721155374124700','10.177.197.41.tm881366',1096044365;

If there are other special characters in data, they can also be converted into hexadecimal integers for processing. The execution statement is as follows:

XA recover convert xid;---- convert hexadecimal display

Because it is a hexadecimal number, the characters are converted, and the number of characters in the data will be doubled. The rollback or submission content should be adjusted synchronously, and the characters in the data should also be doubled and then split. For example, the length of grtrid is 34. 34*2 hexadecimal numbers are gtrid, bqual length is 22, then the last 44 hexadecimal numbers are bqual, and the rollback or commit syntax is as follows:

xa rollback 0x31302E3137372E3139372E34312E746D313633373231313535323835323234363035, 0x31302E3137372E3139372E34312E746D383831323038,1096044365;
xa commit,
0x31302E3137372E3139372E34312E746D313633373231313535323835323234363035, 0x31302E3137372E3139372E34312E746D383831323038,1096044365;

Note: The above commit or rollback may report that the xid does not exist. This is not necessarily because the xid is written incorrectly. It may also be that the connection that opened the XA transaction has not been disconnected, and other connections cannot process the XA transaction. Here is the MySQL error. Inaccurate.

(3) The basis for committing or rolling back

The method of how to commit or rollback is given above, but which one should you choose to commit or rollback?

Whether the residual XA transaction is committed or rolled back must be determined by the business. Whoever starts the XA transaction and builds the distributed transaction manager TM must be responsible for the transaction to the end.

From the perspective of a single database, it is impossible to determine whether the XA transaction should be committed or rolled back. No matter which one is selected, it may cause global data errors. The operation and maintenance students must confirm with the business side whether the transaction should be committed or rolled back. Operate after authorization. Taking the above transfer as an example, user 201 transfers 1000 yuan to 202, all prepare is successful, and commit is initiated. At this time, user instance 202 fails to restart, and commit is not completed. After restarting, there are residual XA transactions. If 201 submits successfully, then 202 It must be submitted. If 201 is unsuccessful, 202 can be submitted together with 201 or rolled back together, which is determined by the application layer transaction manager TM. If 201 is submitted successfully, 202 is rolled back and 1000 is deducted from 201, 202 is not received, and the money is less for reconciliation. If 201 is rolled back and 202 is submitted, 202 will be added by 1000, 201 will not be deducted, and the reconciliation will result in more money.

2.3 The "pit" in MySQL XA transaction design

(1) Defects in design

Binlog-based master-slave replication is the cornerstone of MySQL's high availability, which is also the most important factor for MySQL's widespread popularity. In MySQL, for ordinary transactions (non-XA transactions), engines such as innodb and binlog use 2PC to maintain data consistency. In order to distinguish 2PC from XA transactions, it is called internal two-phase commit. The internal 2pc uses binlog as the coordinator (TM). When the internal prepare is written, redo is written first and then binlog is written, both of which are persistent (affected by the brushing parameter strategy) and then submitted. When a crash restart occurs, all prepared transactions will be recovered first, the xid transaction number in it will be taken out, and then it will be found in the coordinator Binlog. If there is this xid in the binlog, it means that both innodb and binlog are executed successfully, etc. If the two participating nodes of the external xa transaction are successfully prepared, the submission will continue. If it is not found in the binlog, it is just explained that it is only completed in the engine layer and needs to be rolled back. If the xid of an ongoing transaction is not found in the prepare, then It means that the prepare is not completed, and it is rolled back directly. This order must be to write the Redo log first, and then write the Binlog last.

So what is the state of a distributed transaction in the XA prepare state? Distributed XA transactions are also implemented based on ordinary transactions. In fact, it is an ordinary transaction that supports suspension, allows other sessions to continue to commit or rollback, and supports recovery of this suspended state after crash or restart.

The prepare action of a common transaction occurs after an explicit commit, writing redo first and then writing binlog. The prepare of the XA transaction occurs before the explicit XA commit, it needs to generate the binlog, and then write the redo, which is the opposite of the ordinary transaction, which leads to the lack of a coordinator for the internal 2pc commit of the external 2pc transaction, in some cases will cause database inconsistency.

The binlog of an XA transaction consists of two parts, from xa start to xa prepare is an indivisible atomic statement block, and xa commit is an atomic statement block, and each has its own gtid, as shown in the following binlog:

After the distributed transaction prepare with transaction number X'7831',X'',1, many ordinary transactions are inserted in the middle, and then the xa commit is executed.

The binlog of an XA transaction is divided into two independent parts. If a crash occurs after the master node generates the XA prepare binlog, the engine layer has not yet been prepared. After the restart, the engine layer rolls back because the prepare action is not completed. However, in the master-slave architecture, as long as the binlog is generated normally, it may be synchronized to the slave machine. In this case, there will be more intermediate transactions of this xa prepare on the slave machine, and eventually replication problems will occur. This problem has been discovered for many years, the official confirmed the bug, and it has not been fixed ( https://bugs.mysql.com/bug.php?id=87560 ).

(2) How to deal with this problem

Although we want to avoid failures as much as possible, we should also be prepared to face any failures, and then we will move forward, and there will be no chaos!

In a regular connection, after the XA transaction of MySQL executes prepare, it usually cannot execute other non-xa statements, and an error will be reported to remind that it is currently in an xa transaction. However, in the replicated sql playback thread, after executing xa prepare, you can directly execute sql of other non-xa transactions, because the XA transaction Binlog generated on the master side may be separate, as shown in the example above. Therefore, after the sql thread of the slave machine executes the binlog of xa prepare, it is allowed to execute the binlog of other transactions normally. If a crash occurs on the master in the xa preapre process, the binlog is just generated, but the subsequent prepare action is not completed. The standby machine receives the binlog of the xa preare action. After the master restarts, the transaction will be rolled back and the xa will not be generated again. The transaction is followed by binlog, which will cause the standby machine to hang after executing xa prepare, and the occupied locks and other resources will not be released. The problem will not be exposed until the newly synchronized binlog conflicts with it and reports an error.

There are two cases to fix:

Case 1: For gtid-based replication, a gtid duplication error should be reported directly (presumably, it cannot be reproduced locally). Restarting on the master should roll back the first half of the XA transaction, and the latter transaction will regenerate the transaction with the same gtid, resulting in a replication error. Stop replication at this time, roll back the half XA transaction on the standby machine, and reset the gtid to the previous one. gtid, you can rebuild and copy. Note that there may be multiple XA transactions in the prepared state in Binlog, and you need to parse the binlog to carefully determine which transaction to roll back.

Situation 2: The replication of gtid is not opened, which is more troublesome than the above situation. There is no gtid to determine whether the binlog transaction is repeated. As long as the subsequent transactions do not involve the resources locked by the half xa transaction, the standby machine can maintain the replication system normally. Synchronize data all the time, wait until there is an error in conflicting data, and after the playback thread retries more than a certain number of times (controlled by the slave_transaction_retries retry parameter), the SQL thread reports the corresponding error, and the replication is interrupted before it can be sensed. Restoring the data is similar to the above, roll back the XA transaction, and rebuild the master-slave, but the binlog of this transaction may not be found, because there is no gtid, it will not report an error immediately, it may report an error in a few minutes, or an error in a few months, depending on When does the business generate conflicting data. And after this transaction, the slave has synchronized a lot of data, and whether the data is reliable needs to be evaluated. It is strongly recommended to open the Gtid copy mode online, and non-gtid copies are officially being eliminated!

3 Consistency of Distributed Transactions

When using distributed transactions, it is necessary to ensure the consistency of distributed transactions.

The consistency of distributed transactions is divided into write consistency and read consistency. The write consistency XA framework XA prepare and XA commit have been resolved. As long as it is guaranteed that all commits are committed, and all rollbacks are rolled back, write consistency can be guaranteed.

Read consistency is much more complicated. Let's take a look at MySQL's official "few words" on XA transactions on read consistency:

The above content is intercepted from the official documentation, which contains a brief introduction to XA read consistency: if the application is sensitive to read, the SERIALIZABLE isolation level is preferred, and the RR level is not enough for distributed transactions. explanation, but we can construct an example to analyze whether this "may not be sufficien" is appropriate to describe read consistency.

As shown in the figure below, there are two accounts A and B on two instances. Suppose each account is initially 100 yuan, and A transfers 20 to B. The left side of the timeline is the operation on the A account instance, and the right side is the B account instance. , the middle T1 to T6 are different time points.

Time T1: Initially 100.

Time T2: Accounts AB have completed the xa prepare operation, one minus 20 and the other plus 20.

Time T3: Account A node XA commit is successful.

Moment T5: Account B's XA commit is successful.

When in the RR or RC isolation level, a reconciliation operation is initiated to count the total amount of funds in the AB account. When only they transfer each other, the total amount should always be 200. At time T6, query A is 80, B is 120, and the general ledger is 200, no problem. When querying account A at T4, it is 80. When querying account B, due to the MVCC mechanism, the value of 100 in the last snapshot will be read, and the sum will be 180. The general ledger is incorrect. Because it operates on different instances, when XA commit is started, it may not be guaranteed that XA commits of all nodes reach all nodes at the same time due to network and other reasons. In a high concurrency scenario, the above problems are almost inevitable. Therefore, when using MySQL's native XA distributed transaction, if there is no other means to ensure read consistency, and the application has cross-node read application scenarios, the serialization (SERIALIZABLE) isolation level should be used, "may not be sufficien" obviously It is inappropriate. No business can accept that this kind of statistics is wrong.

If it is the serialized isolation level, when A is 80 when read at T4, it will wait when reading B. After the successful XA commit at time T5, it can read that B is 120 and the general ledger is 200, no problem. The serialization isolation level is only read-read without blocking, read-write, write-read, and write-write will block, while RC and RR only write-write blocking, so only the serialization isolation level can fully guarantee the read of MySQL XA transactions consistency. But it blocks too much and has the worst performance of all isolation levels, so this isolation level is usually not used unless necessary. There are many solutions in the industry to solve the problem of read consistency under RR and RC of distributed transactions to improve database performance, but native MySQL does not have this capability. Therefore, businesses using MySQL's native XA transactions need to choose the isolation level carefully.

4 Summary

As long as we face the residual XA transactions carefully, handle the redundant binlog data that may exist after the crash carefully, and carefully evaluate whether there is a read consistency read problem using the RR and RC isolation levels, there are basically no other problems with MySQL XA transactions. As RM fully provides cross-node distributed transaction capabilities, MySQL has implemented the syntax functions in the distributed transaction processing specification defined by the X/Open organization, and you can rest assured that your business can run on this road!

About the author

Flyfox Senior Backend Engineer

Engaged in database kernel work for more than ten years, deeply participated in several self-developed database projects based on PostgreSQL and MySQL, and is currently responsible for the RDS product R&D team.

For more exciting content, please scan the code and follow the [OPPO Digital Intelligence Technology] public account

The "road" and "pit" of MySQL distributed transactions

1 database transaction

1.1 Ordinary local transactions

1.2 Distributed transaction based on 2pc

2 MySQL distributed transaction XA

2.1 XA Transaction Standard

2.2 MySQL XA syntax

(1) Grammar points

(2) How to deal with residual XA transactions

(3) The basis for committing or rolling back

2.3 The "pit" in MySQL XA transaction design

(1) Defects in design

(2) How to deal with this problem

3 Consistency of Distributed Transactions

4 Summary

OPPO数智技术

引用和评论

OPPO云数据库访问服务技术揭秘

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

MySQL 备份 Shell 脚本：支持远程同步与阿里云 OSS 备份

《SQL应用场景解析：如何通过SQL解决实际业务问题》

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统