Author: Zhu Jinjun

Hello everyone, my name is Jun.

Today, let's talk about how the new version (1.5.1) of Alibaba Seata solves the problems of idempotency, suspension and empty rollback in TCC mode.

TCC Review

TCC mode is the most classic distributed transaction solution. It divides distributed transactions into two stages to execute. In the try stage, resources are reserved for each branch transaction. If all branch transactions successfully reserve resources, enter commit The global transaction is committed in the stage, and if one node fails to reserve resources, it enters the cancel stage to roll back the global transaction.

Taking traditional order, inventory, and account services as an example, in the try phase, try to reserve resources, insert orders, deduct inventory, and deduct amount. These three services all require local transactions, and resources can be transferred to the middle. surface. In the commit phase, the resources reserved in the try phase are transferred to the final table. In the cancel phase, the resources reserved in the try phase are released, for example, the account amount is returned to the customer's account.

Note: Local transactions must be submitted in the try phase. For example, to deduct the order amount, the money must be deducted from the customer's account. If it is not deducted, the customer's account will not have enough money in the commit phase, and problems will occur.

try-commit

The try phase first reserves resources, and then deducts resources in the commit phase. As shown below:

 title=

try-cancel

In the try phase, resources are reserved first. The failure to deduct inventory when reserving resources results in a rollback of the global transaction, and the resources are released in the cancel phase. As shown below:

 title=

TCC advantage

The biggest advantage of TCC mode is high efficiency. The locked resources in the try phase of the TCC mode are not locked in the real sense, but the local transaction is actually committed, the resources are reserved to the intermediate state, and there is no need to block waiting, so the efficiency is higher than other modes.

At the same time, the TCC mode can also be optimized as follows:

Asynchronous commit

After the try phase is successful, it does not immediately enter the confirm/cancel phase, but considers that the global transaction has ended, and starts a scheduled task to execute the confirm/cancel asynchronously, deducting or releasing resources, which will greatly improve performance.

Same library mode

There are three roles in TCC mode:

  • TM: Manage global transactions, including opening global transactions and committing/rolling back global transactions;
  • RM: Manage branch transactions;
  • TC: Manages the state of global transactions and branch transactions.

The following picture is from Seata's official website:

 title=

When the TM opens a global transaction, the RM needs to send a registration message to the TC, and the TC saves the state of the branch transaction. When the TM requests a commit or rollback, the TC needs to send a commit or rollback message to the RM. In this distributed transaction consisting of two branch transactions, there are four RPCs between the TC and the RM.

The optimized process is as follows:

 title=

The TC holds the state of the global transaction. When the TM opens a global transaction, the RM no longer needs to send a registration message to the TC, but saves the branch transaction state locally. After the TM sends a commit or rollback message to the TC, the RM asynchronous thread first finds out the uncommitted branch transaction saved locally, and then sends a message to the TC to obtain the global transaction status (where the local branch transaction is located) to decide whether to commit or rollback. local affairs.

After this optimization, the number of RPCs is reduced by 50%, and the performance is greatly improved.

RM code example

Taking the inventory service as an example, the RM inventory service interface code is as follows:

 @LocalTCC
public interface StorageService {

    /**
     * 扣减库存
     * @param xid 全局xid
     * @param productId 产品id
     * @param count 数量
     * @return
     */
    @TwoPhaseBusinessAction(name = "storageApi", commitMethod = "commit", rollbackMethod = "rollback", useTCCFence = true)
    boolean decrease(String xid, Long productId, Integer count);

    /**
     * 提交事务
     * @param actionContext
     * @return
     */
    boolean commit(BusinessActionContext actionContext);

    /**
     * 回滚事务
     * @param actionContext
     * @return
     */
    boolean rollback(BusinessActionContext actionContext);
}

Through the @LocalTCC annotation, RM will register a branch transaction with TC when it is initialized. There is a @TwoPhaseBusinessAction annotation on the try phase method (decrease method), which defines the resourceId, commit method and cancel method of the branch transaction. The useTCCFence attribute will be discussed in the next section.

Problem with TCC

The three major problems in TCC mode are idempotency, dangling, and empty rollback. In Seata 1.5.1 version, a transaction control table is added, the table name is tcc_fence_log to solve this problem. The attribute useTCCFence mentioned in the @TwoPhaseBusinessAction annotation in the previous section is to specify whether to enable this mechanism. The default value of this attribute is false.

The tcc_fence_log table creation statement is as follows (MySQL syntax):

 CREATE TABLE IF NOT EXISTS `tcc_fence_log`
(
    `xid`           VARCHAR(128)  NOT NULL COMMENT 'global id',
    `branch_id`     BIGINT        NOT NULL COMMENT 'branch id',
    `action_name`   VARCHAR(64)   NOT NULL COMMENT 'action name',
    `status`        TINYINT       NOT NULL COMMENT 'status(tried:1;committed:2;rollbacked:3;suspended:4)',
    `gmt_create`    DATETIME(3)   NOT NULL COMMENT 'create time',
    `gmt_modified`  DATETIME(3)   NOT NULL COMMENT 'update time',
    PRIMARY KEY (`xid`, `branch_id`),
    KEY `idx_gmt_modified` (`gmt_modified`),
    KEY `idx_status` (`status`)
) ENGINE = InnoDB
DEFAULT CHARSET = utf8mb4;

idempotent

In the commit/cancel phase, because the TC has not received the response of the branch transaction, it needs to retry, which requires the branch transaction to support idempotency.

Let's see how the new version solves it. The following code is in the TCCResourceManager class:

 @Override
public BranchStatus branchCommit(BranchType branchType, String xid, long branchId, String resourceId,
         String applicationData) throws TransactionException {
 TCCResource tccResource = (TCCResource)tccResourceCache.get(resourceId);
 //省略判断
 Object targetTCCBean = tccResource.getTargetBean();
 Method commitMethod = tccResource.getCommitMethod();
 //省略判断
 try {
  //BusinessActionContext
  BusinessActionContext businessActionContext = getBusinessActionContext(xid, branchId, resourceId,
   applicationData);
  Object[] args = this.getTwoPhaseCommitArgs(tccResource, businessActionContext);
  Object ret;
  boolean result;
  //注解 useTCCFence 属性是否设置为 true
  if (Boolean.TRUE.equals(businessActionContext.getActionContext(Constants.USE_TCC_FENCE))) {
   try {
    result = TCCFenceHandler.commitFence(commitMethod, targetTCCBean, xid, branchId, args);
   } catch (SkipCallbackWrapperException | UndeclaredThrowableException e) {
    throw e.getCause();
   }
  } else {
   //省略逻辑
  }
  LOGGER.info("TCC resource commit result : {}, xid: {}, branchId: {}, resourceId: {}", result, xid, branchId, resourceId);
  return result ? BranchStatus.PhaseTwo_Committed : BranchStatus.PhaseTwo_CommitFailed_Retryable;
 } catch (Throwable t) {
  //省略
  return BranchStatus.PhaseTwo_CommitFailed_Retryable;
 }
}

As can be seen from the above code, when executing the branch transaction commit method, first determine whether the useTCCFence attribute is true, if it is true, then go to the commitFence logic in the TCCFenceHandler class, otherwise go to the normal commit logic.

The commitFence method in the TCCFenceHandler class calls the commitFence method of the TCCFenceHandler class. The code is as follows:

 public static boolean commitFence(Method commitMethod, Object targetTCCBean,
          String xid, Long branchId, Object[] args) {
 return transactionTemplate.execute(status -> {
  try {
   Connection conn = DataSourceUtils.getConnection(dataSource);
   TCCFenceDO tccFenceDO = TCC_FENCE_DAO.queryTCCFenceDO(conn, xid, branchId);
   if (tccFenceDO == null) {
    throw new TCCFenceException(String.format("TCC fence record not exists, commit fence method failed. xid= %s, branchId= %s", xid, branchId),
      FrameworkErrorCode.RecordAlreadyExists);
   }
   if (TCCFenceConstant.STATUS_COMMITTED == tccFenceDO.getStatus()) {
    LOGGER.info("Branch transaction has already committed before. idempotency rejected. xid: {}, branchId: {}, status: {}", xid, branchId, tccFenceDO.getStatus());
    return true;
   }
   if (TCCFenceConstant.STATUS_ROLLBACKED == tccFenceDO.getStatus() || TCCFenceConstant.STATUS_SUSPENDED == tccFenceDO.getStatus()) {
    if (LOGGER.isWarnEnabled()) {
     LOGGER.warn("Branch transaction status is unexpected. xid: {}, branchId: {}, status: {}", xid, branchId, tccFenceDO.getStatus());
    }
    return false;
   }
   return updateStatusAndInvokeTargetMethod(conn, commitMethod, targetTCCBean, xid, branchId, TCCFenceConstant.STATUS_COMMITTED, status, args);
  } catch (Throwable t) {
   status.setRollbackOnly();
   throw new SkipCallbackWrapperException(t);
  }
 });
}

As can be seen from the code, when submitting a transaction, it will first determine whether there is already a record in the tcc_fence_log table, and if there is a record, determine the transaction execution status and return. In this way, if it is judged that the status of the transaction is already STATUS_COMMITTED, it will not be submitted again, ensuring idempotency. If there is no record in the tcc_fence_log table, insert a record for later retry judgment.

The logic of Rollback is similar to commit, the logic is in the rollbackFence method of class TCCFenceHandler.

Empty rollback

As shown in the figure below, the account service is a cluster of two nodes. In the try phase, the account service 1 node fails. In the try phase, without considering the retry, the global transaction must go to the end state, which requires the account service on the account service. Execute a cancel operation, thus running a rollback operation in vain.

 title=

Seata's solution is to insert a record into the tcc_fence_log table in the try phase. The value of the status field is STATUS_TRIED. In the Rollback phase, it is judged whether the record exists. If it does not exist, the rollback operation is not performed. code show as below:

 //TCCFenceHandler 类
public static Object prepareFence(String xid, Long branchId, String actionName, Callback<Object> targetCallback) {
 return transactionTemplate.execute(status -> {
  try {
   Connection conn = DataSourceUtils.getConnection(dataSource);
   boolean result = insertTCCFenceLog(conn, xid, branchId, actionName, TCCFenceConstant.STATUS_TRIED);
   LOGGER.info("TCC fence prepare result: {}. xid: {}, branchId: {}", result, xid, branchId);
   if (result) {
    return targetCallback.execute();
   } else {
    throw new TCCFenceException(String.format("Insert tcc fence record error, prepare fence failed. xid= %s, branchId= %s", xid, branchId),
      FrameworkErrorCode.InsertRecordError);
   }
  } catch (TCCFenceException e) {
   //省略
  } catch (Throwable t) {
   //省略
  }
 });
}

The processing logic in the Rollback phase is as follows:

 //TCCFenceHandler 类
public static boolean rollbackFence(Method rollbackMethod, Object targetTCCBean,
         String xid, Long branchId, Object[] args, String actionName) {
 return transactionTemplate.execute(status -> {
  try {
   Connection conn = DataSourceUtils.getConnection(dataSource);
   TCCFenceDO tccFenceDO = TCC_FENCE_DAO.queryTCCFenceDO(conn, xid, branchId);
   // non_rollback
   if (tccFenceDO == null) {
    //不执行回滚逻辑
    return true;
   } else {
    if (TCCFenceConstant.STATUS_ROLLBACKED == tccFenceDO.getStatus() || TCCFenceConstant.STATUS_SUSPENDED == tccFenceDO.getStatus()) {
     LOGGER.info("Branch transaction had already rollbacked before, idempotency rejected. xid: {}, branchId: {}, status: {}", xid, branchId, tccFenceDO.getStatus());
     return true;
    }
    if (TCCFenceConstant.STATUS_COMMITTED == tccFenceDO.getStatus()) {
     if (LOGGER.isWarnEnabled()) {
      LOGGER.warn("Branch transaction status is unexpected. xid: {}, branchId: {}, status: {}", xid, branchId, tccFenceDO.getStatus());
     }
     return false;
    }
   }
   return updateStatusAndInvokeTargetMethod(conn, rollbackMethod, targetTCCBean, xid, branchId, TCCFenceConstant.STATUS_ROLLBACKED, status, args);
  } catch (Throwable t) {
   status.setRollbackOnly();
   throw new SkipCallbackWrapperException(t);
  }
 });
}

The sql executed by the updateStatusAndInvokeTargetMethod method is as follows:

 update tcc_fence_log set status = ?, gmt_modified = ?
    where xid = ? and  branch_id = ? and status = ? ;

It can be seen that the value of the status field recorded in the tcc_fence_log table is changed from STATUS_TRIED to STATUS_ROLLBACKED. If the update is successful, the rollback logic is executed.

suspension

Suspension means that due to network problems, RM did not receive a try command at first, but after executing Rollback, RM received a try command and reserved resources successfully. At this time, the global transaction has ended and the reserved resources cannot be released. As shown below:

 title=

Seata's solution to this problem is to first judge whether there is a current xid record in tcc_fence_log when executing the Rollback method, and if not, insert a record into the tcc_fence_log table, the status is STATUS_SUSPENDED, and no rollback operation will be performed. code show as below:

 public static boolean rollbackFence(Method rollbackMethod, Object targetTCCBean,
         String xid, Long branchId, Object[] args, String actionName) {
 return transactionTemplate.execute(status -> {
  try {
   Connection conn = DataSourceUtils.getConnection(dataSource);
   TCCFenceDO tccFenceDO = TCC_FENCE_DAO.queryTCCFenceDO(conn, xid, branchId);
   // non_rollback
   if (tccFenceDO == null) {
       //插入防悬挂记录
    boolean result = insertTCCFenceLog(conn, xid, branchId, actionName, TCCFenceConstant.STATUS_SUSPENDED);
    //省略逻辑
    return true;
   } else {
    //省略逻辑
   }
   return updateStatusAndInvokeTargetMethod(conn, rollbackMethod, targetTCCBean, xid, branchId, TCCFenceConstant.STATUS_ROLLBACKED, status, args);
  } catch (Throwable t) {
   //省略逻辑
  }
 });
}

When the try phase method is executed later, a current xid record will be inserted into the tcc_fence_log table, which will cause a primary key conflict. code show as below:

 //TCCFenceHandler 类
public static Object prepareFence(String xid, Long branchId, String actionName, Callback<Object> targetCallback) {
 return transactionTemplate.execute(status -> {
  try {
   Connection conn = DataSourceUtils.getConnection(dataSource);
   boolean result = insertTCCFenceLog(conn, xid, branchId, actionName, TCCFenceConstant.STATUS_TRIED);
   //省略逻辑
  } catch (TCCFenceException e) {
   if (e.getErrcode() == FrameworkErrorCode.DuplicateKeyException) {
    LOGGER.error("Branch transaction has already rollbacked before,prepare fence failed. xid= {},branchId = {}", xid, branchId);
    addToLogCleanQueue(xid, branchId);
   }
   status.setRollbackOnly();
   throw new SkipCallbackWrapperException(e);
  } catch (Throwable t) {
   //省略
  }
 });
}

Note: For update is used in queryTCCFenceDO method sql, so there is no need to worry that the Rollback method cannot obtain the records of the tcc_fence_log table and cannot judge the execution result of the local transaction in the try phase.

Summarize

TCC mode is the most used mode for distributed transactions, but idempotency, suspension and empty rollback have always been issues that TCC mode needs to consider. The Seata framework perfectly solves these problems in version 1.5.1.

The operation of the tcc_fence_log table also needs to consider the control of the transaction. Seata uses a proxy data source, so that the operation of the tcc_fence_log table and the RM business operation are executed in the same local transaction, so that the local operation and the operation of the tcc_fence_log can be guaranteed to succeed or fail at the same time. .


阿里云云原生
1k 声望302 粉丝