With the popularity of the microservice architecture, the problem of cross-service distributed transactions will inevitably be encountered. The main reason why distributed transactions are difficult is that various unexpected situations may occur at each node in the distributed system. This article first introduces the abnormal problems in distributed systems, then introduces the challenges these problems bring to distributed transactions, then points out the problems of various common usages, and finally gives the correct solutions.
The biggest enemy of the distributed system may be NPC, here it is the acronym of Network Delay, Process Pause, Clock Drift. Let's first look at what the specific NPC problem is:
- Network Delay, network delay. Although the network works well in most cases, although TCP guarantees that the transmission sequence and will not be lost, it cannot eliminate the network delay problem.
- Process Pause, the process is paused. There are many reasons for the process to be suspended: for example, the GC (garbage collection mechanism) in the programming language will suspend all running threads; for another example, we sometimes suspend the cloud server so that the cloud server can be removed from the server without restarting One host is migrated to another host. We can't predict with certainty the length of the process pause. You think it is a long time for a few hundred milliseconds, but in fact, it is not uncommon for a process to pause for a few minutes.
- Clock Drift, clock drift. In real life, we usually think that time passes smoothly and monotonically increases, but in computers it is not. The computer uses clock hardware for timing, usually a quartz clock, which has limited timing accuracy and is also affected by the temperature of the machine. In order to synchronize the time between multiple machines on the network to a certain extent, the NTP protocol is usually used to align the time of the local device with a dedicated time server. A direct result of this is that the local time of the device may suddenly move forward or backward. After jumping.
Since distributed transactions are distributed systems, there are naturally NPC problems. Because there is no time stamp involved, the trouble caused is mainly NP.
Empty compensation and suspension of TCC
We take TCC in distributed transactions (for students who don’t know TCC yet, you can refer to this article, most classic seven solutions for , to understand the basics of distributed transactions) as an example, Look at the impact of NP.
Under normal circumstances, the execution order of a TCC rollback is to execute Try first, then execute Cancel, but due to N, the network delay of Try may be large, causing Cancel to be executed first, and then Try.
This situation introduces two problems in distributed transactions:
- Null compensation: When Cancel is executed, Try is not executed, and the Cancel operation of the transaction branch needs to determine that Try is not executed. At this time, it is necessary to ignore the business data update in Cancel and return directly
- Suspension: When Try is executed, Cancel has been executed, and the Try operation of the transaction branch needs to determine the consistency of Cancel. At this time, it is necessary to ignore the business data update in Try and return directly
Distributed transactions also have a common problem that needs to be dealt with, that is, repeated requests, and business needs to be idempotent. Because empty compensation, suspension, and repeated requests are all related to NP, we collectively refer to them as sub-transaction disorder problems. In business processing, these three issues need to be handled carefully, otherwise incorrect data will appear.
Problems with existing solutions
We see the open source project https://github.com/yedf/dtm , including various cloud vendors and open source projects. Most of their business implementation recommendations are similar to the following:
- Null compensation: "In response to this problem, when designing the service, it is necessary to allow null compensation, that is, when the business primary key to be compensated is not found, return the compensation success, record the original business primary key, and mark that the business flow has been compensated successfully."
- Anti-hanging: "It is necessary to check whether the current business primary key already exists in the business primary key recorded by the null compensation. If it exists, the service must be rejected to avoid data inconsistency."
The above implementation can run normally in most cases, but the "check first and then change" in the above approach is easy to fall into the hole in the case of concurrent. Let's analyze the following scenarios:
- Under the normal execution sequence, when Try is executed, after checking the business primary key without empty compensation records, before the transaction is submitted, if the process pause P occurs, or the network request within the transaction is congested, the local transaction waits for a long time.
- After the global transaction times out, Cancel is executed, because the primary key of the business to be compensated is not found, so it is judged to be empty compensation and return directly
- The process of Try is suspended and the local transaction is finally submitted
- After the global transaction rollback is completed, the business operation of the Try branch is not rolled back, resulting in suspension
In fact, there are many scenarios of P and C, and the combination of P and C in NPC, which can lead to the above-mentioned race conditions, so I won't repeat them one by one.
Although the probability of this happening is not high, in the financial field, once money accounts are involved, the impact may be huge.
PS: If idempotent control also adopts "check first and then change", similar problems are also prone to occur. The key to solving this type of problem is to use a unique index and "check on behalf of a change" to avoid race conditions.
Let's explain in detail how yedf/dtm solves this problem.
dtm pioneered the sub-transaction barrier technology to simultaneously solve the three problems of null compensation, anti-hanging, and idempotence. For TCC transactions, his detailed work process is as follows:
- Create the sub-transaction barrier table dtm_barrier.barrier in the local database, the unique index is gid-branchid-branchop
- For Try, Confirm, Cancel operations, insert ignore a record gid-branchid-try|confirm|cancel, if the number of affected rows is 0 (repetitive request, suspension), submit directly and return
- For the Cancel operation, insert another record gid-branchid-try, if the number of affected rows is 1 (empty compensation), submit directly and return
- Execute business logic and submit to return, if business error occurs, roll back
If the execution time of Try and Cancel does not overlap, then the reader can easily analyze that the above process can solve the problem of empty compensation and suspension. If there is an overlap in the execution time of Try and Cancel, let's see what happens.
Assuming that Try and Cancel are executed concurrently, both Cancel and Try will insert the same record gid-branchid-try. Due to the unique index conflict, only one of the two operations can succeed, and the other will return after the transaction holding the lock is completed. .
- Case 1, Try inserting gid-branchid-try failed, Cancel operation inserting gid-branchid-try successfully, this is a typical empty compensation and suspension scenario, according to the sub-transaction barrier algorithm, both Try and Cancel will return directly
- Case 2, Try inserting gid-branchid-try succeeds, Cancel operation inserting gid-branchid-try fails, according to the above sub-transaction barrier algorithm, the business will be executed normally, and the order of business execution is Try before Cancel
- In case 3, the operation of Try and Cancel encounters downtime and other situations during the overlapping period, then at least Cancel will be retried by dtm, and then it will eventually go to case 1 or 2.
In summary of the detailed discussion of various situations, the sub-transaction barrier can ensure the correctness of the final result under various NP situations.
In fact, sub-transaction barriers have a number of advantages, including:
- Two insert judgments solve the three problems of empty compensation, anti-hanging, and idempotence. Compared with the three cases of other solutions, the logic complexity is greatly reduced.
- The sub-transaction barrier of dtm is that the SDK layer solves these three problems, and the business does not need to care at all
- High performance. For normally completed transactions (generally no more than 1% of failed transactions), the additional cost of the sub-transaction barrier is one SQL per branch operation, which is less expensive than other solutions.
The above theory and analysis process are also applicable to SAGA distributed transactions. The sub-transaction barrier in dtm supports both TCC and SAGA transaction modes.
DTM is a distributed transaction manager developed by golang, which solves the consistency problem of updating data across databases, services, and language stacks.
The following is a comparison of the main features of dtm and Ali open source seata:
|Support language||Go、Java、python、php、c#...||Java||dtm can easily access a new language|
|Exception handling||Sub-transaction barrier automatic processing||Manual processing||dtm solves idempotence, suspension, and null compensation|
|AT affairs||XA is recommended||✓||AT is similar to XA, with better performance, but with dirty rollback|
|SAGA affairs||Support concurrency||State machine mode|
|Transaction message||✓||✗||dtm provides transaction messages similar to rocketmq|
|Single service and multiple data sources||✓||✗|
|letter of agreement||HTTP、gRPC||dubbo and other agreements||dtm is more cloud-native|
If your language stack includes languages other than Java, then dtm is your first choice. If your language stack is Java, you can also choose to connect to dtm and use sub-transaction barrier technology to simplify your business writing. You can refer to to easily complete a TCC distributed transaction in Java to automatically handle null compensation, suspension, and idempotence. .
If you want to learn the knowledge related to distributed transactions, dtm's documentation is highly acclaimed, allowing readers to quickly get started with distributed transactions, combining theory with practice, and allowing readers to gradually deepen.
Welcome everyone to visit https://github.com/yedf/dtm , welcome to Issue, PR, Star