mysql - Failure Analysis | Manual rm to drop binlog causes master-slave error - 个人文章

Author: Chen Wei
Axon database engineer, responsible for the daily maintenance and troubleshooting of MySQL.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

1. Fault description

Some time ago, the customer reported that the replication error 1236 was reported. According to the error message, the error message was that the slave library read the binlog log that does not exist in the main library, which caused the replication to be interrupted.

2. Failure analysis

Note: Considering the sensitivity of customer information, the following analysis processes are simulated in the test environment, the database version is 5.7.31, and GTID is enabled.

2.1. First log in to the slave library to view the error message

It is found that the io thread of the slave library is broken, and an error is reported: Got fatal error 1236 from master when reading data from binary log: 'Could not open log file' , it is not difficult to understand literally, the binlog log cannot be obtained from the master library, GTID Stop at 828415, check the error log from the library and report the following error:

According to the error information reported by the slave library, next step we go to the main library to check the binlog information and error log of the main library to see if the binlog of the main library is missing.

2.2. View the main library binlog information, as well as the error log

Log in to the main library to check the binlog log list and index file, and find that the binlog of the main library is continuous without any problem, but when checking the mysql-bin.index file, it is found that there are two records of mysql-bin.000006, but there is actually only one mysql -bin.000006 file, remember this strange phenomenon first.

Take another look at the error log of the main library, the content is as follows, and the error log of the slave library points to the same binlog.

At this point, we can confirm that the problem lies in the binlog log of mysql-bin.000006, but from the above figure, we can find that the binlog log of the main library has the log of mysql-bin.000006, so let's analyze the mysql pointed to now. -bin.000006 log and the previous log mysql-bin.000005.

By parsing mysql-bin.000005, we can see that the last transaction gtid is: 'c582a82e-b985-11ec-adf5-02000aba3e89:828415' is the location where the gtid is stopped from the library above.

Continue to parse mysql-bin.000006 to see that the gtid of its first transaction is: 'c582a82e-b985-11ec-adf5-02000aba3e89:855706'.

The first GTID transaction of mysql-bin.000006 is 855706, which is more than 20,000 GTID transactions different from the binlog.000005 file. It is not continuous with the previous binlog.000005, indicating that there is indeed a transaction loss. The question is why The mysql-bin.index file will record that there are two identical mysql-bin.000006 and how the customer's failure scenario is triggered.

3. Failure to reproduce

3.1. Reproduce the index index file to record the same binlog file name

First, according to the results of the fault analysis, we first reproduced the binlog file name duplication scene in the binlog index file.

The first step is to log in to the machine where the main library is located. When the replication status is normal, manually delete the binlog log rm that the main library is using.

The second step is to enter the main library and execute flush logs to refresh the log. At this time, binlog will regenerate the binlog file that has just been deleted. At this time, observe that the binlog log of the main library is continuous, and there are two just deleted manually in the index file. The binlog log information, that is, the index file records two identical files, which is consistent with the customer scenario.

Therefore, there are two binlog.000006 in the binlog index file. Through the above test, it can be found that when the binlog file in use is manually rmted, the binlog file counter will not be affected. After the binlog file is refreshed (restart, flush log, binlog file is full, etc.), the binlog file counter will increase by 1 according to the current largest binlog file. As in the above scenario, the first binlog.000006 file is normally recorded in binlog.index. At this time, after rm the binlog.000006 file, the largest binlog log is 000005. At this time, the main library flush logs will also be listed in the order. The maximum binlog.00000 5 generates the second binlog.000006, and two binlog.000006 also appear in the binlog.index. But the main library has actually lost a binlog.000006, the gtid of the existing binlog.000006 and the previous binlog.000005 are not continuous, but the file names are still continuous.

3.2. Reproduce the customer scenario Got fatal error 1236 from master when reading data from binary log: 'Could not open log file'

During the test of 3.1, it is found that the customer's fault does not necessarily occur, and there are other phenomena. Let's reproduce the customer scenario first.

First of all, there is a certain delay between the master and slave, as follows from the slave library to the master library binlog.000006.

At this time, manually rm the binlog.000007 that is being used by the main library. Although binlog.000007 has been deleted, it is still occupied by the main library in the background. The new binlog.000007 has not been refreshed yet, but in the index There is a record of binlog.000007.

At this time, when the slave library finishes reading binlog.000006 and continues to obtain binlog.000007 recorded in the binlog index file, since this file has been deleted and the main library does not generate a new binlog.000007, an error is reported.

After the master-slave replication failure, the master database is still using the deleted binlog file until the flush logs are triggered, and then the mysql-bin.index 3.1 scenario will appear. So far, the customer replication failure scenario has been completely reproduced.

3.3 Other scenarios

In the process of reproduction, it is found that other different scenarios will appear. Due to the long length, the brief description will not be expanded in detail. Interested students can test by themselves.

scene one:
The master never reports an error, and the slave library loses gtid. This scenario requires a certain master-slave delay like the above scenario of customer replication failure. The difference is that when the slave library has read binlog.xxx and continues to obtain the binlog index file When the recorded binlog.yyy file is deleted by rm, the main library has triggered flush logs to generate a new binlog.yyy, and the new binlog.yyy is read from the library io thread. In this case, the gtid of the slave library is discontinuous, and the master and slave are inconsistent.

Scenario two:
In the case that the master-slave has no or low latency, the slave library has read the binlog.yyy that the master library is using. At this time, delete binlog.yyy that is being used by the main library. Since the deleted binlog.yyy is still occupied in the background, you can see that the replication status is normal in the slave library. Until the main library binlog file is refreshed (flush log, binlog file is full, etc.) to generate a new binlog.yyy, the gtid of the slave library is not updated at this time, and the slave library will soon report an error of 1236.

The reason why an error 1236 is reported after generating a new binlog.yyy is because after the main library binlog log is flushed to generate a new binlog.yyy, the new binlog.yyy event starts from 4, and at this time the slave library has been applied to the old one The location of the binlog.yyy event is far beyond 4, so the slave library cannot read the binlog log from the new binlog.yyy and reports an error.

4. Recommendations

Through the above test, it can be found that artificially rmting the binlog in use will basically lead to master-slave error or master-slave inconsistency, and once this happens, there is generally no other better way than redoing the slave library, which is not conducive to the database. maintenance, so it is recommended to:

To avoid direct operations such as compressing and deleting binlog files, you can modify the binlog cleaning strategy by adjusting parameters.
When you need to delete binlog manually, you can enter MySQL and use the purge command to delete it.

Failure Analysis | Manual rm to drop binlog causes master-slave error

1. Fault description

2. Failure analysis

2.1. First log in to the slave library to view the error message

2.2. View the main library binlog information, as well as the error log

3. Failure to reproduce

3.1. Reproduce the index index file to record the same binlog file name

3.2. Reproduce the customer scenario Got fatal error 1236 from master when reading data from binary log: 'Could not open log file'

3.3 Other scenarios

4. Recommendations

爱可生开源社区

引用和评论

gh-ost 扩展 MySQL 字段失败？看看 ChatDBA 和 DeepSeek 都怎么说？

分布式数据库解析

Mybatis源码-缓存机制

Mybatis-基础使用

Mysql 连接区别与事务隔离级别

MySQL × 向量数据库：大模型时代的黄金组合实战指南

百万级群聊的设计实践