Author: Yu Zhenxing

A member of the Aikesheng DBA team, familiar with Oracle, MySQL, MongoDB, Redis, and recently working on TiDB. He is good at architecture design, fault diagnosis, data migration, disaster recovery construction, etc. Responsible for the daily operation and maintenance of the customer's MySQL and our self-developed DMP database management platform. Passionate about technology sharing and writing technical documents.

Source of this article: original submission

* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.
Event background


Event background

Customer feedback after the evening data run batch, when querying the data of the relevant table, it was found that some data of the table did not exist in the database. From the log of the application run batch, no error was reported in the run batch, and the log can be found to be clearly displayed. At that time, that batch of data has been inserted into the database, and we need to help analyze the reason for the loss of this batch of data.

Remarks: Considering the sensitivity of information, the following analysis scenarios test environment simulation, and the relevant data is explained below

  • The library table involved is demo.t_dba_info table
  • The missing data is the record insert into t_dba_info(name,age) values('zhenxing',20);

Failure analysis

1. Log in to the database to confirm whether the record exists

Obviously, the data does not exist in the database as the customer said

2. Confirm the time interval when the data is lost and analyze the binlog

Here I am a simulation environment, directly parsed in the main library, and the production environment is recommended to be parsed from the library to avoid affecting the main library
BINLOG_LIST='mysql-bin.000002  mysql-bin.000003  mysql-bin.000004  mysql-bin.000005  mysql-bin.000006  mysql-bin.000007'

for binlog in ${BINLOG_LIST}
do 
    echo "====== ${binlog}"
    mysqlbinlog -vv ${binlog}|grep -iEw "zhenxing"
done

You can see that we parsed and searched for the zhenxing , and indeed found that the data was inserted into the database, so from the general idea, we only need to continue to parse the binlog to find out if there is a DELETE or UPDATE operation for this record. Can

3. Analyze the binlog to view the modification operations on this table

Filter out which DELETE did 061cd6e1091757 or UPDATE
## 这里我通过已知的故障时间区间将涉及的binlog列出来做循环解析
BINLOG_LIST='mysql-bin.000002  mysql-bin.000003  mysql-bin.000004  mysql-bin.000005  mysql-bin.000006  mysql-bin.000007'

for binlog in ${BINLOG_LIST}
do 
    echo "====== ${binlog}"
    mysqlbinlog --no-defaults --base64-output=decode-rows -vv ${binlog} | awk '/###/ {if($0~/UPDATE|INSERT|DELETE/)count[$2" "$NF]++}END{for(i in count) print i,"\t",count[i]}' | column -t | sort -k2nr|grep -i t_dba_info 
done

By parsing the binlog, you can see that the only operation on the table is mysql-bin.000006. The binlog file has 2 UPDATE operations, and the others are INSERT. Next, we only need to continue to parse the mysql-bin.000006 file and search to see if Whether to modify the record of zhenxing

4. Analyze and locate the binlog

## 用最简单的命令直接解析并搜索对demo.t_dba_info表插入的zhenxing这条记录
[root@10-186-61-100 binlog]# mysqlbinlog -vv mysql-bin.000006|less

Through analysis, it is found that the UPDATE operation of the demo.t_dba_info table in this binlog file is not for the zhenxing record. analyzed here and found that it is more confusing. The data is clearly inserted, and it disappears without modification. Could it be done? Some special operations

5. Eliminate the possibility of some special operations

  1. When inserting this data, the main library binlog clearly has a record, is it possible to do set session sql_log_bin=off when deleting this record and not record the binlog

    • This can be preliminarily excluded by querying whether the record exists under the slave database. The customer's production environment is a one-master multi-slave architecture, and the slave database does not have this record, and the possibility of
  2. Is it possible that this table has DDL behavior in addition to DML behavior, such as reconstruction, but the batch of data is not reinserted into the table after reconstruction, so try to parse the binlog to see if the DDL operation behavior of the table exists

    BINLOG_LIST='mysql-bin.000002  mysql-bin.000003  mysql-bin.000004  mysql-bin.000005  mysql-bin.000006  mysql-bin.000007'
    
    for binlog in ${BINLOG_LIST}
    do 
     echo "====== ${binlog}"
     mysqlbinlog ${binlog}|egrep -iEw "truncate|create|drop"
    done


    I found some clues. There are 2 truncate operations on the table in mysql-bin.000004, and so on. It seems that something has been found. The missing data is also in this mysql-bin.000004 file. Sort out the logic. Could it be that the record was between 2 truncate, so I analyzed this binlog in detail and got the following information

    #211211 14:52:47 truncate table t_dba_info
    #211211 14:53:00 insert into t_dba_info(name,age) values('zhenxing',20)
    #211211 14:53:18 truncate table t_dba_info

    basically understand why this record is weird. I lost . I confirmed the logic of running batch and filling data with the customer. I learned that it will truncate the table. However, due to a misoperation , it was triggered again after the batch started. A round of truncate behavior caused part of the data that has been inserted into the table to be cleaned up again, which caused some records to be lost when parsing the binlog, but no deletion was observed, but was cleaned up by truncate.

6. Fault summary

This article is a practical case of binlog analysis. Binlog records a lot of information. Binlog can be analyzed in different dimensions. At the same time, there are also some skills in the configuration and use of binlog online environment. For example, in this case, the online environment is Standardized deployment, reasonable parameter settings, the binlog parsing time will not be too long due to a single binlog file being too large, and the parameter such as binlog_rows_query_log_events is turned on, so that specific SQL statements can be clearly recorded in row mode.
max_binlog_size = 250M
binlog_rows_query_log_events = 1

Remarks: Although the binlog records enough information, when the cause of the fault is located, because it does not record the IP and user , if you do not open the audit, you can only know that the behavior has occurred, but you cannot be specific. Locate the "person" that triggered the behavior.

7. Binlog parsing skills

  1. Try to parse from the library as much as possible to avoid affecting the main library
  2. First, roughly locate the binlog involving related library table operations, and then analyze the data in the corresponding binlog separately
  3. No need to add -v to output detailed information when parsing DDL (speed up parsing)
  4. If the binlog_rows_query_log_events parameter is turned on, you need to use the -vv parameter to display the specific SQL statement

爱可生开源社区
426 声望209 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。