In-depth analysis of Saga distributed transactions

Saga is a very important transaction mode in the field of distributed transactions. It is especially suitable for solving long transactions such as travel booking. This article will deeply analyze the design principles of saga transactions and the best practices in solving ticket booking problems.

The theoretical source of saga

The transaction model of saga originated from this paper: sagas

In this paper, the author proposes to split a long transaction into multiple sub-transactions. Each sub-transaction has a forward operation Ti and a reverse compensation operation Ci.

If all the sub-transactions Ti are completed successfully in turn, the global transaction is completed

If the sub-transaction Ti fails, then Ci, Ci-1, Ci-2 .... will be called to compensate

After expounding the basic saga logic of the above part, the paper proposes the technical processing of the following scenarios

Rollback and retry

For a SAGA transaction, if it encounters a failure during execution, then there are two options, one is to roll back, the other is to retry to continue.

The mechanism of rollback is relatively simple, just record the next operation to the save point before proceeding to the next step. Once a problem occurs, roll back from the save point and perform all compensation operations in the reverse direction.

If there is a long transaction that lasts for a day and is interrupted by a temporary failure such as a server restart, if only a rollback can be performed at this time, then the business is unacceptable. The best strategy at this time is to retry at the savepoint and let the transaction continue until the transaction is complete.

To support retrying in the past, all sub-transactions of the global transaction need to be arranged and saved in advance, and then in case of failure, re-read the unfinished progress and retry to continue execution.

Concurrent execution

For long transactions, the feature of concurrent execution is also crucial. A long transaction that takes one day in serial may be completed in half a day with the support of parallelism, which is of great help to the business.

In some scenarios, concurrent execution of sub-transactions is a necessary requirement for the business. For example, when booking multiple tickets and tickets, and when the ticket confirmation time is long, you should not wait for the previous ticket to be confirmed before booking the next ticket. As a result, the ticket booking success rate has dropped significantly.

In the scenario of concurrent execution of sub-transactions, support for rollback and retry, the challenge will be greater, involving more complex save points.

Implementation classification of saga

At present, there are many saga implementations on the market, and they all have the basic functions of saga.

These implementations can be roughly divided into two categories

State machine implementation

A typical implementation of this type is seata's saga, which introduces a state machine defined by a DSL language, allowing users to do the following operations:

After a certain sub-transaction ends, according to the result of this sub-transaction, decide what to do next
Able to save the result of sub-transaction execution to the state machine and use it as input in subsequent sub-transactions
Allow concurrent execution between sub-transactions that have no dependencies

The advantages of this approach are:

Powerful function, affairs can be flexibly customized

weakness is:

The threshold for using the state machine is very high, and you need to understand the relevant DSL, which is poor in readability and difficult to debug when problems occur. The official example is a global transaction that contains two sub-transactions. The state machine definition in Json format has about 95 lines, which is difficult to get started.
The interface is intrusive and can only use specific input and output interface parameter types. In the cloud-native era, it is not friendly to strongly typed gRPC (gRPC protocol, user-defined input and output pb files are not available in TM, so the results cannot be parsed Field)

Non-state machine implementation

This type of implementation includes eventuate's saga and dtm's saga.

In this type of implementation, no new DSL is introduced to implement the state machine, but a functional interface is used to define each branch transaction under the global transaction:

advantage:

Easy to use, easy to maintain

shortcoming:

It is difficult to achieve flexible customization of state machine transactions

PS: The author of eventuate will be based on the mode of event subscription collaboration, also called saga, because of his great influence, so many articles will mention this when introducing the saga mode. But in fact, this model is not related to the original saga papers, nor is it related to the saga models implemented by various companies, so this model is not specifically discussed here.

There are many other saga implementations, such as servicecomb-pack, Camel, hmily. Due to limited energy, they have not been studied one by one. After doing more research in the follow-up, we will continue to update the article

dtm's saga design

dtm supports TCC and saga modes, these two modes have different characteristics, each adapts to different business scenarios, and complements each other.

The above table compares the two transaction modes of TCC and SAGA.

TCC is positioned as a short transaction with high consistency requirements. Transactions with higher consistency requirements are generally short transactions (a transaction has not been completed for a long time, in the eyes of users, the consistency is relatively poor, generally there is no need to adopt TCC such a highly consistent design), so the transaction branch of TCC The programming is placed on the AP side (that is, in the program code), which can be flexibly called by the user. In this way, users can make flexible judgments and executions based on the results of each branch.

SAGA is positioned as a long transaction/short transaction with lower consistency requirements. For scenarios like booking air tickets, the duration is long, which may last from a few minutes to one or two days. It is necessary to save the entire transaction schedule to the server to avoid the APP that initiates the global transaction from upgrading, malfunctioning and other reasons. Information is lost.

The flexibility provided by the state machine is not necessary for the TCC orchestrated on the client side, but it is meaningful for the saga stored on the server side. When I first designed saga, I made more detailed trade-offs. This method of state machine is very difficult to get started, and users are easily discouraged. I found some users to do demand research, and the core requirements I summarized are:

Sub-transactions are executed concurrently to reduce latency. For example, it takes a long time to confirm the round-trip air ticket of the travel booking business, and it may take a long time to confirm the reservation. It is easy to fail to book the return ticket after waiting for the air ticket to be booked.
Some operations cannot be rolled back and need to be placed after the rollbackable sub-transaction to ensure that once executed, they will eventually succeed.

Under these two core requirements, dtm's saga finally did not adopt a state machine, but it supported the concurrent execution of sub-transactions and specified the order relationship between sub-transactions.

Below we take a practical problem as an example to explain the usage of saga in dtm

For ticket booking services, the execution results of sub-transactions are not returned immediately, usually after a flight is booked and the third party notifies the result after a period of time. For this situation, dtm's saga provides good support, it supports sub-transactions to return the results in progress, and supports specifying the retry interval. The sub-transaction of booking can be in its own logic, if the order has not been placed, then the order is placed; if the order has been placed, then it is a retry request at this time, you can go to the third party to query the results, and finally return success/failure/in progress .

Problem solving examples

We use a real user case to explain dtm's saga best practices.

Problem scenario: A user travel application receives a user travel plan and needs to book an air ticket to Sanya, a hotel in Sanya, and a return air ticket.

Require:

Both air tickets and hotels are either booked successfully or rolled back (hotels and airlines provide related rollback interfaces)
Booking air tickets and hotels are concurrent, avoiding serial situations, because one of the reservations is late in the final confirmation time, causing other reservations to miss the time
The confirmation time of the scheduled result may vary from 1 minute to 1 day

The above requirements are exactly the problems to be solved by the saga transaction mode. Let's take a look at how dtm can solve them (take the Go language as an example).

First, we create a saga transaction according to requirement 1. This saga contains three branches, namely, booking air tickets to Sanya, booking hotels, and booking return air tickets

        saga := dtmcli.NewSaga(DtmServer, gid).
            Add(Busi+"/BookTicket", Busi+"/BookTicketRevert", bookTicketInfo1).
            Add(Busi+"/BookHotel", Busi+"/BookHotelRevert", bookHotelInfo2).
            Add(Busi+"/BookTicket", Busi+"/BookTicketRevert", bookTicketBackInfo3)

Then we let saga execute concurrently according to requirement 2 (the default is sequential execution)

  saga.EnableConcurrent()

Finally, we deal with the problem that the "confirmation time for scheduled results" in 3 is not an immediate response. Since it is not an instant response, we cannot make the reservation operation wait for the result of a third party, but after submitting the reservation request, it will immediately return to the status-in progress. Our branch transaction is not completed, dtm will retry our transaction branch, we specify the retry interval as 1 minute.

  saga.SetOptions(&dtmcli.TransOptions{RetryInterval: 60})
  saga.Submit()
// ........
func bookTicket() string {
    order := loadOrder()
    if order == nil { // 尚未下单，进行第三方下单操作
        order = submitTicketOrder()
        order.save()
    }
    order.Query() // 查询第三方订单状态
    return order.Status // 成功-SUCCESS 失败-FAILURE 进行中-ONGOING
}

Advanced usage

In practical applications, I have also encountered some business scenarios, which require some additional skills to deal with

Support retry and rollback

dtm requires the business to explicitly return the following values:

SUCCESS indicates that the branch is successful and you can proceed to the next step
FAILURE indicates that the branch failed, the global transaction failed and needs to be rolled back
ONGOING means that it is in progress, follow up to retry at normal intervals
Others indicate system problems, follow-up to retry according to the exponential backoff algorithm

Some third-party operations cannot be rolled back

For example, once a shipment instruction is given in an order, it involves offline related operations, so it is difficult to roll back directly. How to deal with saga involved in this type of situation?

We divide the operations in a transaction into rollback operations and non-rollback operations. Then put the rollback operations to the front and the non-rollback operations to be executed later, then this type of problem can be solved

        saga := dtmcli.NewSaga(DtmServer, dtmcli.MustGenGid(DtmServer)).
            Add(Busi+"/CanRollback1", Busi+"/CanRollback1Revert", req).
            Add(Busi+"/CanRollback2", Busi+"/CanRollback2Revert", req).
            Add(Busi+"/UnRollback1", Busi+"/UnRollback1NoRevert", req).
            EnableConcurrent().
            AddBranchOrder(2, []int{0, 1}) // 指定step 2，需要在0，1完成后执行

Timeout rollback

Saga is a long transaction, so the duration span is very large, may be 100ms to 1 day, so saga does not have a default timeout period.

dtm supports saga transaction to specify the timeout period separately, and when the timeout period is reached, the global transaction will be rolled back.

    saga.SetOptions(&dtmcli.TransOptions{TimeoutToFail: 1800})

In saga transactions, you must pay attention to setting the timeout period. This type of transaction cannot contain transaction branches that cannot be rolled back, otherwise there will be problems with rolling back such branches over time.

The results of other branches are used as input

If a very small number of actual businesses not only need to know whether certain transaction branches are executed successfully, but also want to obtain detailed result data of success, then how does dtm handle such a demand? For example, the B branch requires detailed data returned by the successful execution of the A branch.

The recommended approach of dtm is to provide another interface in ServiceA so that B can obtain relevant data. Although this scheme is slightly inefficient, it is easy to understand and maintain, and the development workload will not be too great.

PS: Please pay attention to a small detail, try to make network requests outside of your transaction to avoid the transaction time span from becoming longer and causing concurrency problems.

summary

This article summarizes saga-related theoretical knowledge and design principles, and compares the different implementations of saga and its advantages and disadvantages. Finally, with a real problem case, explain in detail the use of dtm's saga transaction

dtm is a one-stop distributed transaction solution that supports multiple transaction modes such as transaction messaging, SAGA, TCC, XA, and SDKs for languages such as Go, Java, Python, PHP, C#, and Node.

The project document also explains in detail the basic knowledge, design concepts and the latest theories related to distributed transactions. It is an excellent material for learning distributed transactions.

Welcome everyone to visit yedf/dtm and give us Issue, PR, Star.

In-depth analysis of Saga distributed transactions

The theoretical source of saga

Rollback and retry

Concurrent execution

Implementation classification of saga

State machine implementation

Non-state machine implementation

dtm's saga design

Problem solving examples

Advanced usage

Support retry and rollback

Some third-party operations cannot be rolled back

Timeout rollback

The results of other branches are used as input

summary

叶东富

引用和评论

支持Saga、Tcc、Xa混用，支持gRPC，HTTP混用的分布式事务模式

70k star，取代Postman！这款轻量级API工具，太香了！

嘎嘎好用！推荐三款开源的 Redis 桌面客户端！

C++ 中 VS 项目引入公共配置文件

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL 备份 Shell 脚本：支持远程同步与阿里云 OSS 备份