mysql - Failure Analysis | The impact of Sql_slave_skip_counter usage on replication - 个人文章

Author: Tang Longzhe
The Shanghai DBA of the Acsen delivery service team is responsible for the fault handling and daily maintenance of the MySQL database. Good at troubleshooting
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

1. Background description

Recently, I encountered a master-slave failure problem caused by the irregular use of sql_slave_skip_counter in the customer environment.

We know that the master-slave replication of mysql is a logical replication, so the slave often encounters errors in the process of applying the relay log, and the parameter sql_slave_skip_counter can set how many events to skip, so that the slave database sql continues to play back. However, if sql_slave_skip_counter is used improperly, it will also cause inconsistent master-slave data.

The following is a screenshot of some of the captured error logs:

It can be seen that in order to skip the SQL that reported the error, SQL_SLAVE_SKIP_COUNTER=100000 has been manually set on site, which means that 100,000 events are skipped, and the number of binlog event events is related to transaction table/non-transaction table and binlog log mode.

Common error description:

write_rows: Duplicate entry (1062 error) primary key conflict, mainly manifested as repeated insertion of existing records;

update_rows : Can't find record (1032 error), unable to find the row record that needs to be updated.

sql_slave_skip_counter parameter description:

It is known from the official explanation that sql_slave_skip_counter is skipped in units of events, and will not stop until the event group where the Nth event is located is skipped. For transactional tables, one event group corresponds to one transaction; for non-transactional tables, one event group corresponds to one SQL statement. And an event group can contain multiple event events.

For a detailed explanation, you can refer to the official documentation:

https://dev.mysql.com/doc/refman/5.6/en/set-global-sql-slave-skip-counter.html .

https://dev.mysql.com/doc/refman/8.0/en/replication-administration-skip.html .

2. Test the impact of sql_slave_skip_counter on replication

Use the open transaction mode (begin/commit) to simulate delete/update and encounter Can't find record (1032) errors. Duplicate entry (1062) errors in insert can be analyzed using similar methods. MySQL version 8.0.23 is used here, and an asynchronous replication architecture is built based on Row log mode + Position mode.

2.1 Prepare data

Prerequisites (the master-slave library applies the following configuration):

 -- 关闭GTID
mysql> show global variables like '%gtid%';
+----------------------------------------+----------+
| Variable_name               | Value     |
+----------------------------------------+----------+
| binlog_gtid_simple_recovery    | OFF    |
| enforce_gtid_consistency        | OFF    |
| gtid_executed                |         |
| gtid_mode                    | OFF      |
| gtid_owned                    |          |
| gtid_purged                    |          |
| simplified_binlog_gtid_recovery    | OFF      |
+----------------------------------------+----------+

-- 禁用read_only
mysql> show global variables like 'read_only';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| read_only    | OFF  |
+------------------+-------+

-- 设置row模式
mysql> show global variables like 'binlog_format';
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| binlog_format | ROW |
+---------------+----------+

Create two tables in the main database, one is a transaction table and the other is a non-transaction table, and then write records with id=1 into each table.

 -- 主库创建测试表
create table replica_innodb (id int, msg char(32)) engine=innodb;
create table replica_myisam (id int, msg char(32)) engine=myisam;

-- 往测试表中添加数据 
insert into replica_innodb values(1,'innodb_001');
insert into replica_myisam values(1,'myisam_001');

2.2 Simulate sql_slave_skip_counter = N to skip replication failure scenarios

Description: This scenario will simulate the operation that the master database updates the records of the replica_myisam non-transactional table, causing the update failure of the slave database.

(1) Execute the delete statement from the library to delete the record of the non-transaction table id=1

 set sql_log_bin=0;
delete from replica_myisam where id=1;
set sql_log_bin=1;

(2) The main database updates the records of the replica_myisam non-transactional table id=1, and adds data to the two tables

 -- 编写insert存储过程
delimiter $$
create procedure p_insert()
begin
declare i int;
set i = 2;
while i < 6 do
insert into replica_innodb values (i, concat('innodb_', repeat(0, 2), i));
insert into replica_myisam values (i, concat('myisam_', repeat(0, 2), i));
set i = i + 1;
end while;
end$$
delimiter ;

-- 更新数据
begin;
update replica_myisam set msg='my01' where id=1;
call p_insert;
commit;

(3) Check the replication status

 Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Update_rows event on table testdb.replica_myisam; Can't find record in 'replica_myisam', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 1715

The above operation first deletes the non-transactional table data id=1 from the slave database, then the master database updates the data, and then transfers the records of the master database id=1 to the slave database through replication, which causes the slave database to fail during playback by the SQL thread. Update the non-existing record (1032 error is reported), causing the SQL thread to fail (the SQL thread has stopped at this time).

(4) Try to use sql_slave_skip_counter to skip errors and restart replication

 -- 从库跳过"10000"个event，并重启SQL线程
set global sql_slave_skip_counter=10000;
start slave sql_thread;

-- 查看数据，可以发现replica_myisam以及replica_innodb表id>=2的记录并未同步到从库
mysql> select * from replica_myisam;
Empty set (0.00 sec)

mysql> select * from replica_innodb\G
*************************** 1. row ***************************
 id: 1
msg: innodb_001
1 row in set (0.01 sec)

-- 查看复制状态为Yes，表示已经跳过冲突的event事件
mysql> show slave status\G
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Skip_Counter: 9955
Exec_Master_Log_Pos: 4493
Relay_Log_Space: 4854

According to the observed Skip_Counter and query results, all conflicting sql statements and events generated by normal sql statements are skipped. The skipped operations and the corresponding event types can be confirmed by the two values of Relay_Log_File and Relay_Log_Pos recorded in the error log when the SQL thread recovers.

 show relaylog events in "mysql-relay.000002" from 1533 limit 50;

It can be seen that based on the ROW log mode, transaction tables and non-transaction tables will generate at least 4 corresponding event events:

rows_query (only recorded in binlog log when binlog_rows_query_log_events=ON), table_map, update_rows/write_rows and query(begin)/xid(commit). In fact, all operations on transactional tables are in the same explicit transaction, and for all operations on non-transactional tables, each SQL statement corresponds to a separate transaction.

According to the commit order recorded in the binlog, it can be seen that the non-transaction table will be submitted first, and the innodb transaction table will be submitted after the commit is executed, so 1032 reports an error operation related to the replica_myisam table.

The role of each event event:

Event	effect
Query	Record transactions in the form of text, including DDL operations (ROW format), begin, etc.
Rows_query	Log DML operations (insert, delete, update)
Table_map	Library table information for logging events
Write_rows	For the insert operation, contains the data to be inserted
Update_rows	For the update operation, including the data before and after the modification
Xid	Assign a Xid (transaction id, unique) at the end of the transaction to mark the end of the transaction

(5) Simulate the main database to append data

 -- 主库追加数据
begin;
insert into replica_innodb values(6,'innodb_006');
commit;
select * from replica_innodb where id=6\G
*************************** 1. row ***************************
 id: 6
msg: innodb_006

-- 从库观察到IO及SQL线程状态为yes，复制正常且Skip_Counter值递减。
show slave status\G
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Skip_Counter: 9949
Exec_Master_Log_Pos: 4849
Relay_Log_Space: 5210

-- 从库检查事务表数据，未发现id=6的记录 
select * from replica_innodb\G
*************************** 1. row ***************************
 id: 1
msg: innodb_001

Note: The above operation does not apply the transaction submitted by the master library to the slave library.

Phenomenon description:

(1) Through the above operations, sql_slave_skip_counter=N(N>0) After restarting the slave replication thread, the event is skipped from the library. Every time an event is skipped, N is subtracted by 1 until N becomes 0. During this period, all conflicting sql statements and events generated by normal sql statements are recorded in this variable, and the current statement is executed from the library only after N becomes 0.
And when the Skip_Counter value is equal to 1, when operating on the transaction table, it is not only skipping an event, but skipping a complete transaction. If there are multiple sqls in the transaction, then when an error occurs, the remaining sql operations will be performed. Skip all together until the xid event is encountered. For example: the error occurs in the first sql statement, if the second sql statement affects 1000 rows of data, then the master and slave have 1000 rows of inconsistent data.
(2) Observe that the replication from the library is normal, the IO and SQL thread status is yes, the Skip_Counter value is decremented, and no new log is generated in the binlog, and no error message is reported in the error log.

2.3 sql_slave_skip_counter clear

In order to prevent the impact of sql_slave_skip_counter on replication, it is necessary to recover in time after error skipping and set the parameter to zero.

Method 1: Restart replication

Note: After MySQL 5.7 and 8.0 restart replication, the Skip_Counter counter is cleared to zero, and this method is invalid in MySQL 5.6.

 stop slave;
start slave;
show global variables like 'sql_slave_skip_counter'\G
*************************** 1. row ***************************
Variable_name: sql_slave_skip_counter
        Value: 0

Method 2: Manually set to zero, it will take effect immediately

 stop slave;
set global sql_slave_skip_counter=0;
start slave;
show variables like 'sql_slave_skip_counter'\G
*************************** 1. row ***************************
Variable_name: sql_slave_skip_counter
        Value: 0

Method 3: Restart the slave library

3. Summary

The specification uses sql_slave_skip_counter. If the operation is improper, it is very likely to skip other events and cause inconsistent master-slave data.

If the amount of database data is small, about 60G, it is recommended to redo the replication. The amount of data exceeds 60G. Before skipping sql_slave_skip_counter, you can see what operations are performed by the current binlog event group. For 1032 and 1062 errors, try to patch the data and let the replication process apply the changes from the library.

It is recommended to enable GTID (globally unique transaction ID) for MySQL replication. This method can quickly determine the consistency of master-slave data according to changes in the current transaction, and also strengthens fault recovery and fault tolerance. Note: This method does not support online GTID activation in 5.6, you need to stop the library first and then configure GTID related parameters.

Failure Analysis | The impact of Sql_slave_skip_counter usage on replication

1. Background description

The following is a screenshot of some of the captured error logs:

sql_slave_skip_counter parameter description:

2. Test the impact of sql_slave_skip_counter on replication

2.1 Prepare data

2.2 Simulate sql_slave_skip_counter = N to skip replication failure scenarios

2.3 sql_slave_skip_counter clear

3. Summary

爱可生开源社区

引用和评论

如何巧妙解决 Too many connections 报错？

Open WebUI：开源AI交互平台的全面解析

被 Manus 带火的 MCP 是什么｜一文看懂

MySQL × 向量数据库：大模型时代的黄金组合实战指南

百万级群聊的设计实践

分布式数据库解析

MySQL Binlog 深度解析