The architectural revolution of message eventual consistency



Updating data across services is a common task in application development. If some key data has high requirements for consistency and business does not need to support rollback, then local message tables are usually used to ensure eventual consistency. When dealing with cross-service update data consistency issues, many companies first introduce local message tables, and then introduce more transaction modes as business scenarios become more complicated.

The second-order message proposed in this paper is a new model and architecture, which elegantly solves the problem of eventual consistency of messages. To solve the same problem, hundreds of lines of code in the local message table or transaction message can be simplified to about five or six lines, which greatly simplifies the architecture, improves development efficiency, and has great advantages.

Below, we take inter-bank transfer as an example to explain this new structure in detail. The business scenarios are described as follows:

We need to transfer 30 yuan from A to B across banks. We first perform the transfer-out operation TransOut that may fail, that is, deduct 30 yuan from A. If A fails to deduct due to insufficient balance, the transfer will fail directly and an error will be returned; if the deduction is successful, then proceed to the next transfer-in operation, because there is no insufficient balance in the transfer-in operation, it can be assumed that the transfer-in operation will succeed.

Developed with new architecture

The new architecture is based on the distributed transaction manager dtm-labs/dtm

The core code to accomplish the above tasks is as follows:

 msg := dtmcli.NewMsg(DtmServer, gid).
    Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPreparedB", db, func(tx *sql.Tx) error {
    return busi.SagaAdjustBalance(tx, busi.TransOutUID, -req.Amount, "SUCCESS")
The above code is HTTP access. The access of gRPC is basically the same as that of HTTP. It will not be repeated here. For readers who need it, you can refer to the examples in dtm-labs/dtm-examples

in this part of the code

  • First generate a DTM msg global transaction, pass the dtm server address and global transaction id
  • Add a branch business logic to msg, the business logic here is the balance transfer operation TransIn, and then bring the data that this service needs to transmit, the amount is 30 yuan
  • Then call the DoAndSubmitDB of msg, this function ensures the successful execution of the business and the submission of the msg global transaction, either succeeding or failing at the same time

    1. The first parameter is the lookback URL, the detailed meaning will be explained later
    2. The second parameter is sql.DB, which is the database object accessed by the business
    3. The third parameter is the business function. The business in our example is to deduct the balance of 30 yuan from A

Since the current TransOut business operation and TransIn are no longer the same service, it may happen that after one operation is performed, a process crash occurs, resulting in another operation not being performed. At this time, dtm will check the URL to check whether the business operation of TransOut is successfully completed. . The checkback in dtm only needs to paste the following code, and the framework will automatically complete the checkback logic:

 app.GET(BusiAPI+"/QueryPreparedB", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
    return MustBarrierFromGin(c).QueryPrepared(dbGet())

So far, a complete two-stage message service has been completed, and the access complexity and code amount have huge advantages over existing solutions such as local message tables, and have become the preferred solution for this type of problem. You can run a complete example with:

run dtm

 git clone && cd dtm
go run main.go

run the example

 git clone && cd dtm-examples 
go run main.go http_msg_doAndCommit

success process

How does PrepareAndSubmit ensure the atomicity of successful business execution and msg submission? Please see the timing diagram below:

Under normal circumstances, the five steps in the sequence diagram will be completed normally, the entire business will proceed as expected, and the global transaction will be completed. There is a new content here that needs to be explained, that is, the submission of msg is initiated in two stages. The first stage calls Prepare, and the second stage calls Commit. After DTM receives the Prepare call, it will not call the branch transaction, but Awaiting subsequent Submit. Only after the Submit is received, the branch call is started, and the global transaction is finally completed.

Downtime process after submission

In a distributed system, all kinds of downtime and network anomalies need to be considered. Let's take a look at the possible problems:

First of all, the most important goal we need to achieve is that the business is successfully executed and the msg transaction is an atomic operation. Therefore, let’s first see what happens if there is a downtime failure before sending the Submit message after the business is submitted. How does the new architecture ensure atomicity?

Let's take a look at the timing diagram in this case:


What if there is a process crash or the machine crashes after the local transaction is committed and before the Submit is sent? At this time, DTM will take out the msg transaction that only prepares but not submit after a certain timeout period, and call the checkback service specified by the msg transaction.

Your back-check service logic does not need to be written manually. You only need to call it according to the code given before. It will query the table to see if the local transaction has been submitted:

  • Submitted: return success, dtm calls the next sub-transaction
  • Rolled back: return failed, dtm terminated global transaction, no more subtransaction calls
  • In progress: This check will wait for the final result, and then process it as previously committed/rolled back

Downtime process before commit

Let's take a look at the timing diagram of the local transaction being rolled back:

If, after dtm receives the Prepare call, the AP encounters a failure and crashes before the transaction is committed, then the database will detect the disconnection of the AP and automatically roll back the local transaction.

Subsequent dtm polling takes out the global transactions that have timed out, only Prepare but no Submit, and check back. The checkback service finds that the local transaction has been rolled back and returns the result to dtm. After the dtm receives the rolled back result, it marks the global transaction as failed and ends the global transaction.

Ease of use

Adopting a new architecture to deal with consistency issues requires only:

  • Define the local business logic and specify the service to be processed in the next step
  • Define the QueryPrepared processing service, just copy and paste the example code.

Then we look at other scenarios

Second-Order Messages vs Local Message Tables

For the above problems, the local message table scheme can also be used (for details of the scheme, please refer to the seven most classic solutions for distributed transactions ) to ensure the eventual consistency of data. If a local message table is used, the required work includes:

  • Execute local business logic in local transaction, insert message into message table and commit at last
  • Write a polling task to send messages from the local message table to the message queue
  • Consume the message and send the message to the corresponding processing service

Comparing the two, the second-order message has the following advantages:

  • No need to learn or maintain any message queues
  • No need to handle polling tasks
  • No need to consume messages

Second-Order Messages vs Transactional Messages

For the above problems, RocketMQ's transaction message scheme can also be used (for details of the scheme, please refer to the seven most classic solutions for distributed transactions ) to ensure the final consistency of data. If a local message table is used, the required work includes:

If transactional messages are used, the required work includes:

  • Open local transaction, send half message, commit transaction, send commit message
  • Consume the overtime half message, query the local database for the received overtime half message, and then perform Commit/Rollback
  • Consume the committed message and send the message to the processing service

Comparing the two, the second-order message has the following advantages:

  • No need to learn or maintain any message queues
  • The complex operations between local transactions and sending messages need to be handled manually, and bugs may occur if you are not careful. Second-order messages are processed fully automatically.
  • No need to consume messages

The second-order message is similar to RocketMQ's transaction message in terms of two-phase commit. It is a new architecture inspired by RocketMQ's transaction message. The naming of second-order messages does not reuse RocketMQ's transaction messages, mainly because the architecture of second-order messages has changed a lot. On the other hand, in the context of distributed transactions, the name "transaction message" is used. , it is easy to cause confusion in understanding.

more advantages

Compared to the queuing scheme described earlier, second-order messages have many additional advantages:

  • The entire exposed interface of the second-order message is completely independent of queues, but only related to actual business and service calls, which is more friendly to developers
  • Second-order messages do not need to consider problems such as message queue message accumulation and other failures, because second-order messages only rely on dtm, developers can think that dtm is the same as other ordinary stateless services in the system, and only depends on the storage behind Mysql/Redis.
  • The message queue is asynchronous, and the second-order message supports both asynchronous and synchronous. The default is asynchronous. You only need to open msg.WaitResult=true, then you can wait for the downstream service to complete synchronously.
  • Second-order messages also support specifying multiple downstream services at the same time

Future outlook for second-order news

Second-order messages can greatly reduce the difficulty of message eventual consistency solutions and are widely used. In the future, dtm will consider adding a background, allowing dynamic designation of downstream services and providing higher flexibility. If you originally used message queues for service decoupling, this dtm backend allows you to directly specify multiple receiver functions for a message without writing message consumers, bringing a simpler, more intuitive, and easier-to-use development experience.

Analysis of the principle of back check

In the previous sequence diagram and the interface, the checkback service appears. In the second-order message, the copy and paste code is automatically processed, while the RocketMQ transaction message is processed manually. So what is the principle of automatic processing?

To check back, first create a separate table in the business database instance to store the global transaction id. When processing a business transaction, the gid is written to this table.

When we check back with gid, if we can find gid in the table, it means that the local transaction has been committed, so we can return to dtm to inform that the local transaction has been committed.

When we use gid to check back, and no gid is found in the table, it means that the local transaction has not been committed. At this time, the possible results are two, one is that the transaction is still in progress, and the other is that the transaction has been rolled back. I checked a lot of sources about RocketMQ and didn't find a working solution. All solutions found are, if no results are found, then do nothing and wait for the next check back, if the check back for 2 minutes or longer has been unchecked, then the local transaction is considered to have been rolled back .

There are big problems with the above solution:

  • If the gid cannot be found within two minutes, it cannot be considered that the local transaction has been rolled back. In extreme cases, a database failure (such as a process or disk stuck) may occur, the duration exceeds 2 minutes, and the data is finally submitted again, then at this time , the data is not eventually consistent, and manual intervention is required.
  • If a local transaction has been rolled back, but the query operation will continue to poll within two minutes at intervals of about 10s, which will cause unnecessary pressure on the server

The second-order message scheme of dtm completely solves this part of the problem. The second-order message working process of dtm is as follows:

  1. When processing a local transaction, the gid will be inserted into the dtm_barrier.barrier table with the insertion reason as committed. The table has a unique index and the primary field is gid.
  2. When checking back, the operation of the second-order message is not to directly check whether the gid exists, but to insert and ignore a piece of data with the same gid, and the insertion reason is rolledbacked. At this time, if there is a record with gid in the table, the new insert operation will be ignored, otherwise the data will be inserted.
  3. Then use gid to query the records in the table. If the reason of the record is found to be committed, it means that the local transaction has been committed; if the reason of the record is found to be rolledbacked, it means that the local transaction has been rolled back.

So compared to the common solutions in RocketMQ checkback, how does the second-order message distinguish between in-progress and rolled back? The trick is to look back at the data inserted. If the transaction of the database is still in progress when looking back, the insert operation will be blocked by the transaction in progress, because the insert operation will wait for the lock held in the transaction. If the insert operation returns normally, the local transaction in the database must have ended, and must have been committed or rolled back.

Here's a question for you: Can operation 3 of the second-order message be omitted, or can it be judged whether it has been rolled back only based on whether the insertion of step 2 is successful? Welcome to leave a message for discussion

normal message

The second-order message can not only replace the local message table scheme, but also can replace the ordinary message scheme. If you call Submit directly, it is similar to the ordinary message scheme, but provides a more flexible and simple interface.

Assuming such an application scenario, there is a button to participate in the event on the interface. If you participate in the event, you will be given permanent permissions for two e-books. In this case, it can be handled in the server side of the button like this:

 msg := dtmcli.NewMsg(DtmServer, gid).
    Add(busi.Busi+"/AuthBook", &Req{UID: 1, BookID: 5}).
    Add(busi.Busi+"/AuthBook", &Req{UID: 1, BookID: 6})
err := msg.Submit()

This approach also provides an asynchronous interface without relying on message queues. In many scenarios of microservices, the legacy asynchronous messaging architecture can be replaced.


The second-order message proposed in this paper has a simple and elegant interface, and brings a simpler architecture than the local message table and Rocket transaction message, which can help you better solve the data consistency problem that does not require rollback.

project address

For more theoretical knowledge and practice of distributed transactions, you can visit the following projects and public accounts: , welcome to visit, and star support us.

Follow the [Distributed Transaction] official account to get more knowledge about distributed transactions, and you can join our community at the same time

阅读 1.3k

908 声望
6.1k 粉丝
0 条评论
908 声望
6.1k 粉丝