Author: Fu Xiang

Now living in Zhuhai, mainly responsible for Oracle, MySQL, mongoDB and Redis maintenance.

Source of this article: original contribution

*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.


1. Environmental description

1.1. MySQL version

root@3306 (none)> SELECT @@VERSION;
+------------+
| @@VERSION |
+------------+
| 5.7.34-log |
+------------+
1 row in set (0.01 sec)

1.2. binlog related parameters

root@3306 (none)> select @@log_bin,@@log_slave_updates;
+-----------+---------------------+
| @@log_bin | @@log_slave_updates |
+-----------+---------------------+
| 1 | 1 |
+-----------+---------------------+
1 row in set (0.00 sec)
root@3306 (none)>

1.3. GTID related parameters

root@3306 (none)> select
@@binlog_gtid_simple_recovery,@@enforce_gtid_consistency,@@gtid_mode;
+-------------------------------+----------------------------+-------------+
| @@binlog_gtid_simple_recovery | @@enforce_gtid_consistency | @@gtid_mode |
+-------------------------------+----------------------------+-------------+
| 1 | ON | ON |
+-------------------------------+----------------------------+-------------+
1 row in set (0.01 sec)
root@3306 (none)>

1.4. Semi-synchronous related parameters

root@3306 (none)> show global variables like '%semi%';
+-------------------------------------------+------------+
| Variable_name | Value |
+-------------------------------------------+------------+
| rpl_semi_sync_master_enabled | ON |
| rpl_semi_sync_master_timeout | 1000 |
| rpl_semi_sync_master_trace_level | 32 |
| rpl_semi_sync_master_wait_for_slave_count | 1 |
| rpl_semi_sync_master_wait_no_slave | ON |
| rpl_semi_sync_master_wait_point | AFTER_SYNC |
| rpl_semi_sync_slave_enabled | ON |
| rpl_semi_sync_slave_trace_level | 32 |
+-------------------------------------------+------------+
8 rows in set (0.00 sec)
root@3306 (none)>

1.5. Parameters related to multi-thread synchronization

root@3306 (none)> select
@@binlog_transaction_dependency_tracking,@@slave_parallel_type,@@slave_parallel_
workers;
+------------------------------------------+-----------------------+---------
-----------------+
| @@binlog_transaction_dependency_tracking | @@slave_parallel_type |
@@slave_parallel_workers |
+------------------------------------------+-----------------------+---------
-----------------+
| COMMIT_ORDER | LOGICAL_CLOCK |
2 |
+------------------------------------------+-----------------------+---------
-----------------+
1 row in set (0.00 sec)
root@3306 (none)>

2. Failure phenomenon

After the host where the MySQL slave library is located fails to restart, the sql_thread thread reports an error:

root@3306 (none)> show slave status\G
-- 摘取有用信息如下:
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1062
Last_Error: Coordinator stopped because there were
error(s) in the worker(s). The most recent failure being: Worker 1 failed
executing transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' at master
log MySql-bin.000685, end_log_pos 1004756557. See error log and/or
performance_schema.replication_applier_status_by_worker table for more details
about this failure or others, if any.
......省略..................
Retrieved_Gtid_Set: 471c2974-f9bb-11eb-afb1-
52540010fb89:88313207-88341912
Executed_Gtid_Set: 471c2974-f9bb-11eb-afb1-52540010fb89:1-
88313206,
d4c228df-f9c6-11eb-a2d8-5254006f63b6:1-5
Auto_Position: 1
root@3306 (none)> select * from
performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 471c2974-f9bb-11eb-afb1-52540010fb89:88313207
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '471c2974-f9bb-
11eb-afb1-52540010fb89:88313207' at master log MySql-bin.000685, end_log_pos
1004756557; Could not execute Write_rows event on table kefumobile.i_sms_proxy;
Duplicate entry '14765130' for key 'PRIMARY', Error_code: 1062; handler error
HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1004756557
LAST_ERROR_TIMESTAMP: 2022-01-24 23:05:02
*************************** 2. row ***************************
CHANNEL_NAME:
WORKER_ID: 2
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION:
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
2 rows in set (0.00 sec)

It can be seen from the error message that when the worker thread plays back transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' , an error is reported due to the primary key conflict of the record to be inserted.

3. Failure analysis

Before the host restarts, the master-slave synchronization is normal. After the host restarts, the master-slave synchronization reports an error due to the primary key conflict. Compare the row record of the conflicting primary key.
Recorded in the master-slave library is consistent, preliminary analysis transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' in the host
It has already been played back from the library before the barrier, so why is the transaction repeated?

When gtid mode is enabled, if master_auto_position=1 is specified, when starting slave, the slave library will
The union of Retrieved_Gtid_Set and Executed_Gtid_Set is sent to the main library, and the main library will receive the union and its own
The gtid_executed comparison sends all the transactions missing from the gtid set of the slave library to the slave library.

After the host restarts, the transaction is replayed repeatedly, indicating that there are GTID transactions in the union of Retrieved_Gtid_Set and Executed_Gtid_Set
Lost, causing duplicate fetch transaction execution to throw a primary key violation error. Both Retrieved_Gtid_Set and Executed_Gtid_Set are memory variables.
After MySQL restarts, Retrieved_Gtid_Set is initialized to a null value, so it is inferred that Executed_Gtid_Set has GTID transaction loss.
lose.

Executed_Gtid_Set comes from the gtid_executed variable, and the gtid_executed variable persistence medium has
mysql.gtid_executed table and binlog, of which mysql.gtid_executed table was introduced after MySQL 5.7. In MySQL 5.6, to use GTID from the library, you must first set log_bin=on, log_slave_updates=on, because the GTID executed from the library is only Keep in binlog.

  1. When log_bin=on and log_slave_updates=off, the update of the gtid_executed variable is persisted to the mysql.gtid_executed table in real time, and the value of the gtid_executed variable is read from the mysql.gtid_executed table after MySQL restarts.
  2. When log_bin=on, log_slave_updates=on, the mysql.gtid_executed table will be updated only when the binlog is switched, and the GTID set that has been executed until the last binlog is saved. After mysql restarts, when the default parameter binlog_gtid_simple_recovery=1, the gtid_executed variable value is calculated from the last binlog file.

The value of the gtid_executed variable is outdated, and it is inferred that binlog is not persistent in real time. Let's take a look at the parameter sync_binlog:

root@3306 (none)> show variables like 'sync_binlog';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| sync_binlog | 600 |
+---------------+-------+
1 row in set (0.00 sec)

Through the above analysis, the ins and outs of this fault are clear:

Worker thread reports 1062 primary key conflict error --> gtid_executed information is outdated --> binlog is not persisted in real time

4. Test verification

Build a master-slave test environment, simulate the concurrent insertion of the master library through sysbench, and reproduce the fault after the slave library host is violently shut down:

root@mysql.sock][(none)]> select * from
performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4a0ad3da-b89e-11eb-9d0b-000c299b4d6c:452362
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '4a0ad3da-b89e-
11eb-9d0b-000c299b4d6c:452362' at master log MySql-bin.000012, end_log_pos
1011339749; Could not execute Write_rows event on table sbtest.sbtest5;
Duplicate entry '111305' for key 'PRIMARY', Error_code: 1062; handler error
HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1011339749
LAST_ERROR_TIMESTAMP: 2022-01-26 09:56:38
*************************** 2. row ***************************
CHANNEL_NAME:
WORKER_ID: 2
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION:
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
2 rows in set (0.00 sec)
[root@mysql.sock][(none)]>

5. Troubleshooting

Since the cause of the error is the repeated execution of the transaction, it is good to skip the error. There are two ways as follows, choose one of the ways to execute according to your needs:

5.1. Replacing the error-reporting transaction execution with an empty transaction

set gtid_next='xxxxxx';
begin;
commit;
set gtid_next=AUTOMATIC;
start slave sql_thread

If there are too many GTIDs lost in the latest binglog, manual execution is cumbersome and requires constant trial and error. You can write a stored procedure for batch execution:

set sql_log_bin=0;
DELIMITER $$
create procedure p_fx_gtid_next(i_master_Executed_Gtid_max varchar(100))
begin
declare v_gtid_next varchar(100);
declare master_Executed_Gtid_max varchar(100);
declare slave_Executed_Gtid_max varchar(100);
-- 主库当前执行了的gtid最大值,做为退出循环条件 show master status
set master_Executed_Gtid_max=i_master_Executed_Gtid_max;
loop_name:loop
SELECT ifnull(min(LAST_SEEN_TRANSACTION),'empty') FROM
performance_schema.replication_applier_status_by_worker WHERE
LAST_ERROR_NUMBER=1062
into v_gtid_next;
if v_gtid_next <> 'empty' then
set gtid_next = v_gtid_next;
start transaction;
commit;
set gtid_next =AUTOMATIC;
start slave sql_thread;
end if;
select max(LAST_SEEN_TRANSACTION) from
performance_schema.replication_applier_status_by_worker into
slave_Executed_Gtid_max;
if slave_Executed_Gtid_max = master_Executed_Gtid_max then
leave loop_name;
end if;
select SLEEP(1);
end loop;
end $$
DELIMITER ;
set sql_log_bin=1;
call p_fx_gtid_next('XXXXX:XXX');

5.2. Restart MySQL with parameter slave_skip_errors=1062

After the master-slave synchronization is normal, cancel the parameter slave_skip_errors setting and restart MySQL.


爱可生开源社区
426 声望207 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。