Author: Wang Shun

A member of the DBA team of Acsen, who handles database problems in the company's project, likes to learn technology and study technical problems.

Source of this article: original contribution

*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.


Problem background:

On the first night, the MySQL database was upgraded from version 5.6.22 to 5.7.32. The upgrade process was smooth and business verification was normal.

Two days later, the business staff reported that the database was stuck, and the database did not respond when sql was executed.

Phenomenon:

Connect to the database and see what you find...

 mysql> show processlist;
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id    | User        | Host                | db    | Command     | Time   | State                                                         | Info                                                                                                 |
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|     1 | system user |                     | NULL  | Connect     | 138205 | Waiting for master to send event                              | NULL                                                                                                 |
|     2 | system user |                     | NULL  | Connect     | 137998 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                 |
|  6891 | user05      | %:37029             | db01  | Query       | 110410 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                           |
| 11528 | user05      | %:44043             | db01  | Query       |  91406 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                           |
| 12126 | user02      | %:44966             | db02  | Query       |  89331 | Waiting for table metadata lock                               | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY TRUNCATE PARTITION P20210810                        |
| 20896 | repl        | 172.24.64.88:54437  | NULL  | Binlog Dump |  53695 | Master has sent all binlog to slave; waiting for more updates   | NULL                                                                                               |
| 26520 | user03      | 172.24.64.197:45044 | db03  | Sleep       |     80 |                                                               | NULL                                                                                               |
| 26886 | user04      | 172.24.64.73:39607  | db04  | Execute     |  29567 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_develop                                                                 |
| 26887 | user04      | 172.24.64.73:39608  | db04  | Execute     |  29565 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_accid_day                                                               |
......                                                                                                                                                          |
| 28069 | user02      | %:41400             | db02  | Query       |  24872 | Waiting for table metadata lock                               | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY ADD PARTITION (PARTITION P20210811 VALUES IN ('202108 |
......                                                                                                                                     |
| 33666 | user05      | 10.1.16.10:47068    | db04  | Sleep       |   2275 |                                                               | NULL                                                                                                 |
| 33788 | user04      | 172.24.64.73:50227  | db04  | Execute     |   1833 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_accid_day                                                                 |
| 33791 | user04      | 172.24.64.73:50236  | db04  | Execute     |   1826 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_develop                                                                   |
......                                                                                                                                      |
| 34095 | user05      | 10.1.16.10:54601    | db04  | Query       |    710 | Waiting for table metadata lock                               | SELECT * FROM `tb_par_user_accid_day` LIMIT 0, 1000                                                  |
......                                                     | show processlist                                                                                     |
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
70 rows in set (0.00 sec)

From the output information, you can see that Waiting for table metadata lock appears in multiple sessions, and the two longest-executing sq are the two insert statements with id 6891 and 11528 respectively.

The schema involved in insert is db01, and the Waiting for table metadata lock is db02/db04.

It's very strange, the execution statement comes from a different schema, who holds the MDL lock to block the following statement? Can't figure it out...

Waiting for table metadata lock If you see the MDL lock waiting information, you need to check the reason.

 mysql> select * from performance_schema.metadata_locks;
Empty set (0.00 sec)

Return empty, no MDL lock related information was found, check the environment and find that MDL related instruments are not enabled.

Remark:

Provisional

Modify the performance_schema.setup_instruments table, but after the instance restarts, it will return to the default value.

UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME = 'wait/lock/metadata/sql/mdl';

permanent

set in config file

[mysqld]

performance-schema-instrument='wait/lock/metadata/sql/mdl=ON'

Keep checking the big business

After multiple executions of select * from INNODB_TRX to view the running transactions, you can see that the two executing insert statements have been executed and uncommitted.

 mysql> select * from INNODB_TRX\G
*************************** 1. row ***************************
                    trx_id: 328588354171408
                 trx_state: RUNNING
               trx_started: 2021-08-11 09:06:34
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 1755570
       trx_mysql_thread_id: 11528
                 trx_query: INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                    
      ……忽略字段信息          
       ,copartner_code          
       trx_operation_state: NULL
         trx_tables_in_use: 15
         trx_tables_locked: 3442
          trx_lock_structs: 1755570
     trx_lock_memory_bytes: 159899856
           trx_rows_locked: 91807791
         trx_rows_modified: 0
   trx_concurrency_tickets: 3903
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
*************************** 2. row ***************************
                    trx_id: 328588354177792
                 trx_state: RUNNING
               trx_started: 2021-08-11 03:49:50
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 1856514
       trx_mysql_thread_id: 6891
                 trx_query: INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (
        deal_date                   
      ……忽略字段信息             
       ,copartner_code       
       trx_operation_state: fetching rows
         trx_tables_in_use: 15
         trx_tables_locked: 3597
          trx_lock_structs: 1856514
     trx_lock_memory_bytes: 169681104
           trx_rows_locked: 101871982
         trx_rows_modified: 0
   trx_concurrency_tickets: 2338
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
2 rows in set, 2 warnings (0.06 sec)

See what the complete sql statement is?

Execute show full processlist; see the complete execution statement. The previous check only executed the show processlist and did not see the completed statement.

 mysql> show full processlist;

| Id    | User        | Host                | db                 | Command     | Time   | State                                                         | Info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
|     1 | system user |                     | NULL               | Connect     | 141084 | Waiting for master to send event                              | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|     2 | system user |                     | NULL               | Connect     | 140877 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|  6891 | user05      | %:37029             | db01               | Query       | 113289 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (                   
       ……忽略字段信息                                                                   
     )
    SELECT                      
       ……忽略字段信息                                                                
   FROM
   (
   SELECT DISTINCT
        t1.cust_no           as cust_no              
      ……忽略字段信息  ID                      
       ,t2.PRODPRCname             as PRODPRCname                     
   FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1
      left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2
            on t1.user_no=t2.user_no
        left join db04.TB_PAR_USER_ACCID_DAY t3
            on t1.user_no=t3.id_no
        left join db04.TB_PAR_USER_EXTEND_INFO t4
            on t1.user_no=t4.user_no
        left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5
            on t4.USER_ATTR2=t5.net_id
        left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6
            on t1.user_no=t6.id_no
        left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7
            on t1.user_no=t7.user_no
        left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8
            on t1.user_no=t8.id_no
        left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9
            on t8.contract_no=t9.contract_no
        left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10
            on t1.USER_STAT_CODE=t10.run_code
        left join db04.TB_PAR_USER_DEVELOP t11
            on t11.user_no=t1.user_no
        left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12
            on t11.develop_dep=t12.group_id
        left join db04.TB_RES_LOGIN_NO t13
            on t11.develop_name=t13.system_user_code
        left join db04.TB_RES_STAFF_INFO t14
            on t13.system_user_id=t14.STAFF_ID
        /*group by 
        ……忽略字段信息                             
  ) t
   /*GROUP BY
       ……忽略字段信息
       
| 11528 | user05        | %:44043             | db01               | Query       |  94285 | Sending data                                                  | INSERT 
        INTO db01.tb_mrt_par_user_order_info_d 
    (                    
       ……忽略字段信息                                                                            
     )
    SELECT                  
       ……忽略字段信息                                                                       
   FROM
   (
   SELECT DISTINCT
                t1.cust_no           as cust_no                  
      ……忽略字段信息              
             ,t2.PRODPRCname             as PRODPRCname                    
   FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1
      left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2
            on t1.user_no=t2.user_no
        left join db04.TB_PAR_USER_ACCID_DAY t3
            on t1.user_no=t3.id_no
        left join db04.TB_PAR_USER_EXTEND_INFO t4
            on t1.user_no=t4.user_no
        left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5
            on t4.USER_ATTR2=t5.net_id
        left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6
            on t1.user_no=t6.id_no
        left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7
            on t1.user_no=t7.user_no
        left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8
            on t1.user_no=t8.id_no
        left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9
            on t8.contract_no=t9.contract_no
        left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10
            on t1.USER_STAT_CODE=t10.run_code
        left join db04.TB_PAR_USER_DEVELOP t11
            on t11.user_no=t1.user_no
        left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12
            on t11.develop_dep=t12.group_id
        left join db04.TB_RES_LOGIN_NO t13
            on t11.develop_name=t13.system_user_code
        left join db04.TB_RES_STAFF_INFO t14
            on t13.system_user_id=t14.STAFF_ID

        /*group by 
        ……忽略字段信息                     
  ) t
   /*GROUP BY
        ……忽略字段信息

Through the complete sql, we can see that there is a multi-table related subquery in the inster statement, and the databases involved in the query are db02 and db04.

I learned from the business personnel that these two inserts are batch tasks. The query statements involved can run data more than 20 minutes before the upgrade, but no data is generated more than a day after the upgrade. What is the reason?

Get back to business

It has been confirmed that the two batch running tasks can be terminated first and then run again, and the termination will not affect online business.

Kill two inserts, the kill process is very fast, the query statement of the two statements does not run out of data, and the insert has not started to execute. After kill is executed successfully, the MDL lock waiting phenomenon disappears quickly.

Cause Analysis

According to the description of the previous business personnel, data can be generated by directly running the query statement in the insert statement for 20 minutes before the upgrade. After the database is upgraded, the query cannot run out of data. The two running batches were not verified after the test environment was upgraded.

I want to compare the difference between the execution plans of the same sql in the 5.6 and 5.7 environments, because the 5.6 environment has been upgraded and cannot be compared.

I have encountered the same 5.6 before and after upgrading to 5.7, sql runs slowly because of the derived table problem.

5.7 By default derived_merge = on, the derived tables will be merged, and then the where conditional filter will be performed, but the phenomenon that this will cause is that multiple unfiltered tables are directly merged, and the execution speed of sql is slowed down if the amount of data is too large.

Official documentation: https://dev.mysql.com/doc/refman/5.7/en/derived-table-optimization.html

solution

  1. is to turn off this optimizer configuration in the database
  2. is to rewrite the business sql

Due to the need to resume business use as soon as possible and negotiate with business personnel to close the derived_merge item, verify again that the running batch is running normally.


爱可生开源社区
426 声望209 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。