一、问题现象

在客户的实例上,从库经常出现如下错误:

2021-04-22T16:51:50.263800+08:00 4607 [ERROR] Slave SQL for channel '': Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'asharms_log'. Query: 'COMMIT', Error_code: 1399
2021-04-22T16:51:50.263818+08:00 4607 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state Error_code: 1399

上述错误的意思是在xa事务里面不能有普通的事务语句,看错误中有一条commit语句,这就非常奇怪了,于是分析下binlog:

# at 124591860
#210422 16:51:50 server id 1052825841  end_log_pos 124591925 CRC32 0xbede87b3   GTID    last_committed=351856   sequence_number=351857  rbr_only=no
SET @@SESSION.GTID_NEXT= '71ae57d2-579b-11ea-bdd6-fa163e79967b:16996347'/*!*/;
# at 124591925
#210422 16:51:50 server id 1052825841  end_log_pos 124592142 CRC32 0x667b3dcc   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
SET @@session.sql_mode=1075838976/*!*/;
XA START X'31302e3230342e32392e36332e746d313631393437303532353830363030313936',X'31302e3230342e32392e36332e746d323030',1096044365
/*!*/;
# at 124592142
#210422 16:51:50 server id 1052825841  end_log_pos 124592300 CRC32 0x44794e3c   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_log_es_register_2021_04 where dateNumber = '2021-04-26'
/*!*/;
# at 124592300
#210422 16:51:50 server id 1052825841  end_log_pos 124592455 CRC32 0xcfce4445   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_log_es_login_2021_04 where dateNumber = '2021-04-26'
/*!*/;
# at 124592455
#210422 16:51:50 server id 1052825841  end_log_pos 124592608 CRC32 0x06344d11   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_log_es_pay_2021_04 where dateNumber = '2021-04-26'
/*!*/;
# at 124592608
#210422 16:51:50 server id 1052825841  end_log_pos 124592757 CRC32 0xa64923b3   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_currency_stats where dateNumber = '2021-04-26'
/*!*/;
# at 124592757
#210422 16:51:50 server id 1052825841  end_log_pos 124592837 CRC32 0xde7c6010   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
COMMIT
/*!*/;

发现这个XA事务最后是以COMMIT结束的,后面没有对应的xa end、xa prepare、xa commit语句,这就非常奇怪了。

这里笔者查询了大量资料,发现有一种情况可以造成上述现象。具体看下一届复现步骤

二、复现步骤

在MySQL XA事务中提供了一个语句,用于将两阶段事务一起提交:

xa start '111'
xa end '111'
xa commit '111' ONE PHASE

然后怎么复现呢,这里其实复现很简单,只需要用上述流程,然后执行具体的sql的时候删除一条不存在的记录即可,如下所示:

注意这里需要设置binlog的格式为statement,原因后面章节会详细介绍

mysql> select * from test1;
+------+-------+
| id   | name  |
+------+-------+
|    1 | zbdba |
|    3 | zbdba |
|    5 | zbdba |
|    6 | zbdba |
|   15 | zbdba |
|   18 | zbdba |
|    8 | zbdba |
+------+-------+
7 rows in set (0.00 sec)


mysql> xa start '111';
Query OK, 0 rows affected (0.00 sec)


mysql> delete from test1 where id = 2;
Query OK, 0 rows affected (0.00 sec)


mysql> xa end '111';
Query OK, 0 rows affected (0.00 sec)


mysql> xa commit '111' ONE PHASE;
Query OK, 0 rows affected (0.00 sec)

binlog中内容如下所示:

# at 190
#210426 17:42:10 server id 789  end_log_pos 251         GTID    last_committed=0        sequence_number=1       rbr_only=no
SET @@SESSION.GTID_NEXT= '5096b7b7-557e-11eb-9b26-00505631ad78:74'/*!*/;
# at 251
#210426 17:41:55 server id 789  end_log_pos 351         Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1619430115/*!*/;
SET @@session.pseudo_thread_id=3/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1075838976/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=83,@@session.collation_connection=83,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
XA START X'313131',X'',1
/*!*/;
# at 351
#210426 17:41:55 server id 789  end_log_pos 457         Query   thread_id=3     exec_time=0     error_code=0
use `zbdba15`/*!*/;
SET TIMESTAMP=1619430115/*!*/;
delete from test1 where id = 2
/*!*/;
# at 457
#210426 17:42:10 server id 789  end_log_pos 529         Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1619430130/*!*/;
COMMIT
/*!*/;

到这里我们已经成功复现。

查看客户实例的binlog中,也有多条ONE PHASE语句,猜测可能是上述流程导致的,日志如下:

XA COMMIT X'31302e3230342e32392e36332e746d313631383830373732333632323030313337',X'31302e3230342e32392e36332e746d313339',1096044365 ONE PHASE

这里不太理解客户这么使用的原因,既然使用两阶段事务,为什么最后还要一起提交呢。

三、问题原因

上面这个奇怪的现象无疑是MySQL的BUG了,在MySQL 5.7.21 版本已经修复,下面我们看看这个BUG是怎么产生的。

在代码中有如下判断:

    if (!trn_ctx->no_2pc(trx_scope) && (trn_ctx->rw_ha_count(trx_scope) > 1))
      error= tc_log->prepare(thd, all);
  }
  /*
    The state of XA transaction is changed to Prepared, intermediately.
    It's going to change to the regular NOTR at the end.
    The fact of the Prepared state is of interest to binary logger.
  */
  if (!error && all && xid_state->has_state(XID_STATE::XA_IDLE))
  {
    DBUG_ASSERT(thd->lex->sql_command == SQLCOM_XA_COMMIT &&
                static_cast<Sql_cmd_xa_commit*>(thd->lex->m_sql_cmd)->
                get_xa_opt() == XA_ONE_PHASE);


    xid_state->set_state(XID_STATE::XA_PREPARED);
  }
  if (error || (error= tc_log->commit(thd, all)))
  {
    ha_rollback_trans(thd, all);
    error= 1;
    goto end;
  }

意思是两个引擎都有改变才写入prepare语句,这里的两个引擎指的是binlog和InnoDB存储引擎,如果我们删除不存在的记录,这里的值会是1,因为InnoDB存储引擎没有改动。那么就不会写入prepare语句,在prepare方法中主要写入xa end和xa prepare语句到binlog文件中。

不过在commit阶段(对应上述tc_log->commit(thd, all)方法)也有有条件判断是否写入xa prepare语句

  /*
    We commit the transaction if:
     - We are not in a transaction and committing a statement, or
     - We are in a transaction and a full transaction is committed.
    Otherwise, we accumulate the changes.
  */
  // 如果binlog cache中为空,并且是事务提交
  if (!cache_mngr->trx_cache.is_binlog_empty() &&
      ending_trans(thd, all))
  {
    const bool real_trans=
      (all || !trn_ctx->is_active(Transaction_ctx::SESSION));


    /*
      We are committing an XA transaction if it is a "real" transaction
      and has an XID assigned (because some handlerton registered). A
      transaction is "real" if either 'all' is true or the 'all.ha_list'
      is empty.


      Note: This is kind of strange since registering the binlog
      handlerton will then make the transaction XA, which is not really
      true. This occurs for example if a MyISAM statement is executed
      with row-based replication on.
    */
    if (is_loggable_xa_prepare(thd))
    {
      /* The prepare phase of XA transaction two phase logging. */
      int err= 0;
      bool one_phase= get_xa_opt(thd) == XA_ONE_PHASE;


      DBUG_ASSERT(thd->lex->sql_command != SQLCOM_XA_COMMIT || one_phase);


      XID_STATE *xs= thd->get_transaction()->xid_state();
      XA_prepare_log_event end_evt(thd, xs->get_xid(), one_phase);


      DBUG_ASSERT(skip_commit);

      // 在这里会写入xa end 语句
      err= cache_mngr->trx_cache.finalize(thd, &end_evt, xs);
      if (err ||
          (DBUG_EVALUATE_IF("simulate_xa_prepare_failure_in_cache_finalize",
                            true, false)))
      {
        DBUG_RETURN(RESULT_ABORTED);
      }
    }
    // 这个条件跟之前判断基本一致,就是看两个引擎是否都有修改,如果不是都修改则不满足条件
    else if (real_trans && xid && trn_ctx->rw_ha_count(trx_scope) > 1 &&
             !trn_ctx->no_2pc(trx_scope))
    {
      Xid_log_event end_evt(thd, xid);
      if (cache_mngr->trx_cache.finalize(thd, &end_evt))
        DBUG_RETURN(RESULT_ABORTED);
    }
    else
    {
      // 最终会进入到这个分支,直接写入COMMIT到binlog文件中,也就是为什么xa start后,后面写入了commit的原因。
      Query_log_event end_evt(thd, STRING_WITH_LEN("COMMIT"),
                              true, FALSE, TRUE, 0, TRUE);
      if (cache_mngr->trx_cache.finalize(thd, &end_evt))
        DBUG_RETURN(RESULT_ABORTED);
    }
    stuff_logged= true;
  }

通过上面代码分析,发现不能进入到写入xa end 的分支里面,最终进入到了写入commit分支。

在MySQL 5.7.21 修复了这个问题,它其实在上述逻辑的上面加入了如下代码:

if (thd->lex->sql_command == SQLCOM_XA_COMMIT)
  {
    /* The Commit phase of the XA two phase logging. */


    bool one_phase= get_xa_opt(thd) == XA_ONE_PHASE;
    DBUG_ASSERT(all);
    DBUG_ASSERT(!skip_commit || one_phase);


    int err= 0;
    XID_STATE *xs= thd->get_transaction()->xid_state();
    /*
      XA COMMIT ONE PHASE statement which has not gone through the binary log
      prepare phase, has to end the active XA transaction with appropriate XA
      END followed by XA COMMIT ONE PHASE.


      The state of XA transaction is changed to PREPARED after the prepare
      phase, intermediately in ha_commit_trans code for the interest of
      binlogger. Hence check that the XA COMMIT ONE PHASE is set to 'PREPARE'
      and it has not already been written to binary log. For such transaction
      write the appropriate XA END statement.
    */
    if (!(is_loggable_xa_prepare(thd))
        && one_phase
        && !(xs->is_binlogged())
        && !cache_mngr->trx_cache.is_binlog_empty())
    {
      XA_prepare_log_event end_evt(thd, xs->get_xid(), one_phase);
      err= cache_mngr->trx_cache.finalize(thd, &end_evt, xs);
      if (err)
      {
        DBUG_RETURN(RESULT_ABORTED);
      }
      trx_stuff_logged= true;
      thd->get_transaction()->xid_state()->set_binlogged();
    }
    if (DBUG_EVALUATE_IF("simulate_xa_commit_log_failure", true,
                         do_binlog_xa_commit_rollback(thd, xs->get_xid(),
                                                      true)))
      DBUG_RETURN(RESULT_ABORTED);
  }

上述代码只要是xa commit 就会写入xa end语句。而不是之前的逻辑需要两个引擎都有写入才将xa end 写入到binlog文件中。

那在5.7.21 如果在xa事务中执行一条删除不存在的记录,并且通过ONE PHASE提交,binlog如下所示:

XA START X'313131',X'',1
/*!*/;
# at 351
#210426 21:32:25 server id 789  end_log_pos 457         Query   thread_id=2     exec_time=0     error_code=0
use `zbdba15`/*!*/;
SET TIMESTAMP=1619443945/*!*/;
delete from test1 where id = 2
/*!*/;
# at 457
#210426 21:32:28 server id 789  end_log_pos 545         Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1619443948/*!*/;
XA END X'313131',X'',1
/*!*/;
# at 545
#210426 21:32:28 server id 789  end_log_pos 580         XA PREPARE X'313131',X'',1
XA COMMIT X'313131',X'',1 ONE PHASE
/*!*/;

这样就不会存在COMMIT语句了,就不会导致从库报错。

四、解决方案

上述我们已经知道了原因,下面有两种解决方案:

1、升级版本到5.7.21版本后
2、修改binlog格式为row格式
这里为什么修改为row格式可以,在MYSQL_BIN_LOG::commit方法中有一个判断逻辑:

  /*
    No cache manager means nothing to log, but we still have to commit
    the transaction.
   */
  if (cache_mngr == NULL)
  {
    if (!skip_commit && ha_commit_low(thd, all))
      DBUG_RETURN(RESULT_ABORTED);
    DBUG_RETURN(RESULT_SUCCESS);
  }

上述意思是,如果binlog没有内容那么binlog cache manager就是空的,我们删除了一条不存在的数据,对于statement格式来说还是有一条删除的sql语句,但是对于row格式来说是没有binlog数据,所以这里为空。然后直接进行InnoDB层的提交返回即可,就不会进入到后面写binlog的逻辑,所以这个事务实际上不会写入到binlog中。

这是一个比较trick的方法,不过对主从应该都没有影响,可以临时使用。长期来看还是建议升级版本。

五、相关链接

https://bugs.mysql.com/bug.php?id=83295

六、问题

  • 为什么XA start 后有COMMIT,这是怎么造成的?
    目前来看可以由 xa commit one phase 语句触发
  • COMMIT 是由xid event 触发

相似的bug:

https://bugs.mysql.com/bug.php?id=83295

https://bugs.mysql.com/bug.php?id=91633

XA COMMIT X'7a62646261',X'',1 ONE PHASE

2021-04-22T16:51:50.263800+08:00 4607 [ERROR] Slave SQL for channel '': Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'asharms_log'. Query: 'COMMIT', Error_code: 1399
2021-04-22T16:51:50.263818+08:00 4607 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state Error_code: 1399




mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.116.37.36
                  Master_User: rdsrepladmin
                  Master_Port: 3306
                Connect_Retry: 3
              Master_Log_File: binlog.000051
          Read_Master_Log_Pos: 124948048
               Relay_Log_File: relay.000002
                Relay_Log_Pos: 11227019
        Relay_Master_Log_File: binlog.000051
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1399
                   Last_Error: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'asharms_log'. Query: 'COMMIT'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 124591860
              Relay_Log_Space: 11583404
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1399
               Last_SQL_Error: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'asharms_log'. Query: 'COMMIT'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1052825841
                  Master_UUID: 71ae57d2-579b-11ea-bdd6-fa163e79967b
             Master_Info_File: /rds/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 210422 16:51:50
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 71ae57d2-579b-11ea-bdd6-fa163e79967b:16964554-16997296
            Executed_Gtid_Set: 71ae57d2-579b-11ea-bdd6-fa163e79967b:16964554-16996346,
9694fe28-a2bd-11eb-b44d-fa163e217427:1
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)


# at 124591466
#210422 16:51:50 server id 1052825841  end_log_pos 124591531 CRC32 0x626295f4   GTID    last_committed=351855   sequence_number=351856  rbr_only=no
SET @@SESSION.GTID_NEXT= '71ae57d2-579b-11ea-bdd6-fa163e79967b:16996346'/*!*/;
# at 124591531
#210422 16:51:50 server id 1052825841  end_log_pos 124591632 CRC32 0xdde08dbd   Query   thread_id=83808472      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
BEGIN
/*!*/;
# at 124591632
# at 124591664
#210422 16:51:50 server id 1052825841  end_log_pos 124591664 CRC32 0xe213b1b8   Intvar
SET INSERT_ID=170/*!*/;
#210422 16:51:50 server id 1052825841  end_log_pos 124591829 CRC32 0x590708d5   Query   thread_id=83808472      exec_time=0     error_code=0
use `asharms_log`/*!*/;
SET TIMESTAMP=1619081510/*!*/;
insert into tb_log_batch_state(dateNumber,state) VALUES (CURDATE(),0)
/*!*/;
# at 124591829
#210422 16:51:50 server id 1052825841  end_log_pos 124591860 CRC32 0x475b7cc0   Xid = 954916214
COMMIT/*!*/;
# at 124591860
#210422 16:51:50 server id 1052825841  end_log_pos 124591925 CRC32 0xbede87b3   GTID    last_committed=351856   sequence_number=351857  rbr_only=no
SET @@SESSION.GTID_NEXT= '71ae57d2-579b-11ea-bdd6-fa163e79967b:16996347'/*!*/;
# at 124591925
#210422 16:51:50 server id 1052825841  end_log_pos 124592142 CRC32 0x667b3dcc   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
SET @@session.sql_mode=1075838976/*!*/;
XA START X'31302e3230342e32392e36332e746d313631393437303532353830363030313936',X'31302e3230342e32392e36332e746d323030',1096044365
/*!*/;
# at 124592142
#210422 16:51:50 server id 1052825841  end_log_pos 124592300 CRC32 0x44794e3c   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_log_es_register_2021_04 where dateNumber = '2021-04-26'
/*!*/;
# at 124592300
#210422 16:51:50 server id 1052825841  end_log_pos 124592455 CRC32 0xcfce4445   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_log_es_login_2021_04 where dateNumber = '2021-04-26'
/*!*/;
# at 124592455
#210422 16:51:50 server id 1052825841  end_log_pos 124592608 CRC32 0x06344d11   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_log_es_pay_2021_04 where dateNumber = '2021-04-26'
/*!*/;
# at 124592608
#210422 16:51:50 server id 1052825841  end_log_pos 124592757 CRC32 0xa64923b3   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
delete from tb_currency_stats where dateNumber = '2021-04-26'
/*!*/;
# at 124592757
#210422 16:51:50 server id 1052825841  end_log_pos 124592837 CRC32 0xde7c6010   Query   thread_id=83808424      exec_time=0     error_code=0
SET TIMESTAMP=1619081510/*!*/;
COMMIT
/*!*/;
这里不应该是COMMIT结尾,应该是XA COMMIT

zbdba
66 声望9 粉丝

数据库工程师一枚,网名zbdba,2017DTCC讲师,CRUG用户组人员,热爱开源DB内部原理探究,关注数据库架构、运维领域。