Author: Hu Chengqing

A member of the DBA team of Aikesheng, good at failure analysis and performance optimization, personal blog: https://www.jianshu.com/u/a95ec11f67a8 , welcome to discuss.

Source of this article: original contribution

*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.


Replication exception

During the replication process, if the main library fails or the network is interrupted, the slave io thread will be interrupted, and it is possible that the slave library only copies half of the transaction. For example, the transactions performed by the main library are as follows:

 begin;
insert 1;
insert 2;
commit;

The binlog received from the library may contain only part of the transaction, such as:

  • Case 1: Only contains begin;
  • Case 2: Contains only begin;insert 1;
  • Case 3: Contains only begin;insert 1;insert 2;

After the slave sql thread of the slave library plays back this part of the binlog, it will wait for the slave io thread to read the remaining binlog from the main library. Before that, the sql thread plays back the half transaction, just like we manually execute the half transaction, not Will commit and will not roll back.

How should we deal with this exception?

  • When the slave io thread resumes, what should be done?
  • What should I do when the slave io thread cannot be resumed?

experiment procedure

testing method:

 ##1. 在从库上用 tc 模拟网络延迟,意在使读取 binlog 的速度变慢
tc qdisc add dev eth0 root netem delay 3000ms 3000ms 100%

##2. 在主库执行一个多语句事务
begin;
update t2 set pad='4' where id < 40;
update t2 set pad='5' where id < 50;
update t2 set pad='6' where id < 60;
commit;

##3. 在主库执行 commit 成功后,立刻用 iptables 切断主从之间的网络
iptables -A OUTPUT -d 172.16.21.4 -j DROP
iptables -A INPUT -s 172.16.21.4 -j DROP

In this way, the phenomenon we can observe from the library is:

  • One of the worker thread status is Waiting for an event from Coordinator , this status indicates that the worker thread has finished its work and is waiting for the Coordinator (coordination thread) to allocate a new relay log event, but it also shows that it is executing update t2 set pad='5' where id < 50 , which is contradiction 1:

  • show slave status the output, Retrieved_Gtid_Set --- is equal to Executed_Gtid_Set (meaning that the sql thread has finished playing back all the relay logs), but the worker thread in the above figure is playing back SQL again, This is paradox 2:

Finally, through the real hammer of the relay log, we can see that the relay log of this transaction is not complete, and it ends at update t2 set pad='5' where id < 50 ; this Rows_query event :

When the slave io thread cannot be resumed

If the slave io thread cannot be recovered for a long time, the sql thread will not be able to commit or roll back because it cannot wait for the remaining binlog, and will always hold the lock of this transaction:

If the slave io thread is abnormal due to the failure of the main library, it is likely to perform a master-slave switch. After the slave library is promoted to master, the transaction lock held by the SQL thread may block business requests.

At this point, stop slave should stop the sql thread and let the transaction roll back to release the lock. It should be noted that: in this case, the stop slave will wait for 60 seconds (wait for the slave io thread to receive the remaining binlog of the transaction), and the sql thread will be stopped after the 60-second timeout:

When the slave io thread resumes

After the slave io thread is interrupted abnormally, the sql thread works normally, the sql thread executes part of the transaction, and will wait for the io thread to send a new binlog. After the slave io thread thread is restored, if it is based on GTID replication, it will re-acquire the complete binlog from the current GTID transaction. The slave library will roll back the current transaction first, and then replay the newly received binlog.


爱可生开源社区
426 声望209 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。