Author: Wang Shun
A member of the DBA team of Acsen, who handles database problems in the company's project, likes to learn technology and study technical problems.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.
Problem background:
On the first night, the MySQL database was upgraded from version 5.6.22 to 5.7.32. The upgrade process was smooth and business verification was normal.
Two days later, the business staff reported that the database was stuck, and the database did not respond when sql was executed.
Phenomenon:
Connect to the database and see what you find...
mysql> show processlist;
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| 1 | system user | | NULL | Connect | 138205 | Waiting for master to send event | NULL |
| 2 | system user | | NULL | Connect | 137998 | Slave has read all relay log; waiting for more updates | NULL |
| 6891 | user05 | %:37029 | db01 | Query | 110410 | Sending data | INSERT
INTO db01.tb_mrt_par_user_order_info_d
(
deal_date |
| 11528 | user05 | %:44043 | db01 | Query | 91406 | Sending data | INSERT
INTO db01.tb_mrt_par_user_order_info_d
(
deal_date |
| 12126 | user02 | %:44966 | db02 | Query | 89331 | Waiting for table metadata lock | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY TRUNCATE PARTITION P20210810 |
| 20896 | repl | 172.24.64.88:54437 | NULL | Binlog Dump | 53695 | Master has sent all binlog to slave; waiting for more updates | NULL |
| 26520 | user03 | 172.24.64.197:45044 | db03 | Sleep | 80 | | NULL |
| 26886 | user04 | 172.24.64.73:39607 | db04 | Execute | 29567 | Waiting for table metadata lock | TRUNCATE TABLE tb_par_user_develop |
| 26887 | user04 | 172.24.64.73:39608 | db04 | Execute | 29565 | Waiting for table metadata lock | TRUNCATE TABLE tb_par_user_accid_day |
...... |
| 28069 | user02 | %:41400 | db02 | Query | 24872 | Waiting for table metadata lock | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY ADD PARTITION (PARTITION P20210811 VALUES IN ('202108 |
...... |
| 33666 | user05 | 10.1.16.10:47068 | db04 | Sleep | 2275 | | NULL |
| 33788 | user04 | 172.24.64.73:50227 | db04 | Execute | 1833 | Waiting for table metadata lock | TRUNCATE TABLE tb_par_user_accid_day |
| 33791 | user04 | 172.24.64.73:50236 | db04 | Execute | 1826 | Waiting for table metadata lock | TRUNCATE TABLE tb_par_user_develop |
...... |
| 34095 | user05 | 10.1.16.10:54601 | db04 | Query | 710 | Waiting for table metadata lock | SELECT * FROM `tb_par_user_accid_day` LIMIT 0, 1000 |
...... | show processlist |
+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
70 rows in set (0.00 sec)
From the output information, you can see that Waiting for table metadata lock appears in multiple sessions, and the two longest-executing sq are the two insert statements with id 6891 and 11528 respectively.
The schema involved in insert is db01, and the Waiting for table metadata lock is db02/db04.
It's very strange, the execution statement comes from a different schema, who holds the MDL lock to block the following statement? Can't figure it out...
Waiting for table metadata lock If you see the MDL lock waiting information, you need to check the reason.
mysql> select * from performance_schema.metadata_locks;
Empty set (0.00 sec)
Return empty, no MDL lock related information was found, check the environment and find that MDL related instruments are not enabled.
Remark:
Provisional
Modify the performance_schema.setup_instruments table, but after the instance restarts, it will return to the default value.
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME = 'wait/lock/metadata/sql/mdl';
permanent
set in config file
[mysqld]
performance-schema-instrument='wait/lock/metadata/sql/mdl=ON'
Keep checking the big business
After multiple executions of select * from INNODB_TRX to view the running transactions, you can see that the two executing insert statements have been executed and uncommitted.
mysql> select * from INNODB_TRX\G
*************************** 1. row ***************************
trx_id: 328588354171408
trx_state: RUNNING
trx_started: 2021-08-11 09:06:34
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 1755570
trx_mysql_thread_id: 11528
trx_query: INSERT
INTO db01.tb_mrt_par_user_order_info_d
(
deal_date
……忽略字段信息
,copartner_code
trx_operation_state: NULL
trx_tables_in_use: 15
trx_tables_locked: 3442
trx_lock_structs: 1755570
trx_lock_memory_bytes: 159899856
trx_rows_locked: 91807791
trx_rows_modified: 0
trx_concurrency_tickets: 3903
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
*************************** 2. row ***************************
trx_id: 328588354177792
trx_state: RUNNING
trx_started: 2021-08-11 03:49:50
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 1856514
trx_mysql_thread_id: 6891
trx_query: INSERT
INTO db01.tb_mrt_par_user_order_info_d
(
deal_date
……忽略字段信息
,copartner_code
trx_operation_state: fetching rows
trx_tables_in_use: 15
trx_tables_locked: 3597
trx_lock_structs: 1856514
trx_lock_memory_bytes: 169681104
trx_rows_locked: 101871982
trx_rows_modified: 0
trx_concurrency_tickets: 2338
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
2 rows in set, 2 warnings (0.06 sec)
See what the complete sql statement is?
Execute show full processlist; see the complete execution statement. The previous check only executed the show processlist and did not see the completed statement.
mysql> show full processlist;
| Id | User | Host | db | Command | Time | State | Info
| 1 | system user | | NULL | Connect | 141084 | Waiting for master to send event | NULL |
| 2 | system user | | NULL | Connect | 140877 | Slave has read all relay log; waiting for more updates | NULL |
| 6891 | user05 | %:37029 | db01 | Query | 113289 | Sending data | INSERT
INTO db01.tb_mrt_par_user_order_info_d
(
……忽略字段信息
)
SELECT
……忽略字段信息
FROM
(
SELECT DISTINCT
t1.cust_no as cust_no
……忽略字段信息 ID
,t2.PRODPRCname as PRODPRCname
FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1
left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2
on t1.user_no=t2.user_no
left join db04.TB_PAR_USER_ACCID_DAY t3
on t1.user_no=t3.id_no
left join db04.TB_PAR_USER_EXTEND_INFO t4
on t1.user_no=t4.user_no
left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5
on t4.USER_ATTR2=t5.net_id
left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6
on t1.user_no=t6.id_no
left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7
on t1.user_no=t7.user_no
left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8
on t1.user_no=t8.id_no
left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9
on t8.contract_no=t9.contract_no
left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10
on t1.USER_STAT_CODE=t10.run_code
left join db04.TB_PAR_USER_DEVELOP t11
on t11.user_no=t1.user_no
left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12
on t11.develop_dep=t12.group_id
left join db04.TB_RES_LOGIN_NO t13
on t11.develop_name=t13.system_user_code
left join db04.TB_RES_STAFF_INFO t14
on t13.system_user_id=t14.STAFF_ID
/*group by
……忽略字段信息
) t
/*GROUP BY
……忽略字段信息
| 11528 | user05 | %:44043 | db01 | Query | 94285 | Sending data | INSERT
INTO db01.tb_mrt_par_user_order_info_d
(
……忽略字段信息
)
SELECT
……忽略字段信息
FROM
(
SELECT DISTINCT
t1.cust_no as cust_no
……忽略字段信息
,t2.PRODPRCname as PRODPRCname
FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1
left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2
on t1.user_no=t2.user_no
left join db04.TB_PAR_USER_ACCID_DAY t3
on t1.user_no=t3.id_no
left join db04.TB_PAR_USER_EXTEND_INFO t4
on t1.user_no=t4.user_no
left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5
on t4.USER_ATTR2=t5.net_id
left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6
on t1.user_no=t6.id_no
left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7
on t1.user_no=t7.user_no
left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8
on t1.user_no=t8.id_no
left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9
on t8.contract_no=t9.contract_no
left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10
on t1.USER_STAT_CODE=t10.run_code
left join db04.TB_PAR_USER_DEVELOP t11
on t11.user_no=t1.user_no
left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12
on t11.develop_dep=t12.group_id
left join db04.TB_RES_LOGIN_NO t13
on t11.develop_name=t13.system_user_code
left join db04.TB_RES_STAFF_INFO t14
on t13.system_user_id=t14.STAFF_ID
/*group by
……忽略字段信息
) t
/*GROUP BY
……忽略字段信息
Through the complete sql, we can see that there is a multi-table related subquery in the inster statement, and the databases involved in the query are db02 and db04.
I learned from the business personnel that these two inserts are batch tasks. The query statements involved can run data more than 20 minutes before the upgrade, but no data is generated more than a day after the upgrade. What is the reason?
Get back to business
It has been confirmed that the two batch running tasks can be terminated first and then run again, and the termination will not affect online business.
Kill two inserts, the kill process is very fast, the query statement of the two statements does not run out of data, and the insert has not started to execute. After kill is executed successfully, the MDL lock waiting phenomenon disappears quickly.
Cause Analysis
According to the description of the previous business personnel, data can be generated by directly running the query statement in the insert statement for 20 minutes before the upgrade. After the database is upgraded, the query cannot run out of data. The two running batches were not verified after the test environment was upgraded.
I want to compare the difference between the execution plans of the same sql in the 5.6 and 5.7 environments, because the 5.6 environment has been upgraded and cannot be compared.
I have encountered the same 5.6 before and after upgrading to 5.7, sql runs slowly because of the derived table problem.
5.7 By default derived_merge = on, the derived tables will be merged, and then the where conditional filter will be performed, but the phenomenon that this will cause is that multiple unfiltered tables are directly merged, and the execution speed of sql is slowed down if the amount of data is too large.
Official documentation: https://dev.mysql.com/doc/refman/5.7/en/derived-table-optimization.html
solution
- is to turn off this optimizer configuration in the database
- is to rewrite the business sql
Due to the need to resume business use as soon as possible and negotiate with business personnel to close the derived_merge item, verify again that the running batch is running normally.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。