Author: Liu Kaiyang

The Beijing DBA of the Aikesheng delivery service team has a strong interest in learning about databases and peripheral technologies, and likes to read books and pursue technology.

Source of this article: original contribution

*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.


In the investigation of many difficult problems, the editor recently encountered a problem that the execution of a select statement will cause MySQL to crash, and I would like to share it with you.

Look at the error first

Generally speaking, as long as the database crashes, the error log will definitely leave clues. Let's first look at the specific error report:

 06:08:23 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x7f55ac0008c0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f56f4074d80 thread_stack 0x46000
/usr/local/mysql/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x1f1b71e]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x323) [0xfcfac3]
/lib64/libpthread.so.0(+0xf630) [0x7f5c28c85630]
/usr/local/mysql/bin/mysqld(actual_key_parts(KEY const*)+0xa) [0xef55ca]
/usr/local/mysql/bin/mysqld(calculate_key_len(TABLE*, unsigned int, unsigned long)+0x28) [0x10da428]
/usr/local/mysql/bin/mysqld(handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+0x261) [0x10dac51]
/usr/local/mysql/bin/mysqld(check_unique_constraint(TABLE*)+0xa3) [0xe620e3]
/usr/local/mysql/bin/mysqld(do_sj_dups_weedout(THD*, SJ_TMP_TABLE*)+0x111) [0xe62361]
/usr/local/mysql/bin/mysqld(WeedoutIterator::Read()+0xa9) [0x1084cd9]
/usr/local/mysql/bin/mysqld(MaterializeIterator::MaterializeQueryBlock(MaterializeIterator::QueryBlock const&, unsigned long long*)+0x17c) [0x10898bc]
/usr/local/mysql/bin/mysqld(MaterializeIterator::Init()+0x1e1) [0x108a021]
/usr/local/mysql/bin/mysqld(SELECT_LEX_UNIT::ExecuteIteratorQuery(THD*)+0x251) [0xf5d241]
/usr/local/mysql/bin/mysqld(SELECT_LEX_UNIT::execute(THD*)+0xf9) [0xf5f3f9]
/usr/local/mysql/bin/mysqld(Sql_cmd_dml::execute_inner(THD*)+0x20b) [0xeedf8b]
/usr/local/mysql/bin/mysqld(Sql_cmd_dml::execute(THD*)+0x3e8) [0xef7418]
/usr/local/mysql/bin/mysqld(mysql_execute_command(THD*, bool)+0x39c9) [0xeab3a9]
/usr/local/mysql/bin/mysqld(mysql_parse(THD*, Parser_state*)+0x31c) [0xead0cc]
/usr/local/mysql/bin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x156b) [0xeaeb6b]
/usr/local/mysql/bin/mysqld(do_command(THD*)+0x174) [0xeb0104]
/usr/local/mysql/bin/mysqld() [0xfc1a08]
/usr/local/mysql/bin/mysqld() [0x23ffdec]
/lib64/libpthread.so.0(+0x7ea5) [0x7f5c28c7dea5]
/lib64/libc.so.6(clone+0x6d) [0x7f5c26db9b0d]
  
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f55ac0ca298): SELECT DISTINCT T.CUST_NO FROM testDB.TABLE_TRANSACTION T  WHERE EXISTS (SELECT 1  FROM testDB.Table1 T1 WHERE T.CUST_NO = T1.CUST_NO ) AND T.AGENT_CERT_NO IS NOT NULL
Connection ID (thread ID): 65
Status: NOT_KILLED

Several obvious information can be found from the output of the above error log:

1. The SQL statement that caused the crash is: SELECT DISTINCT T.CUST_NO FROM testDB.TABLE_TRANSACTION T WHERE EXISTS (SELECT 1 FROM testDB.Table1 T1 WHERE T.CUST_NO = T1.CUST_NO ) AND T.AGENT_CERT_NO IS NOT NULL

2. The signal sent by the database is signal 11, which means that MySQL has accessed a wrong memory address.

Analysis process

1. View OS logs and system resource usage:

The output of the OS log has no effect on the troubleshooting direction, and there is no MySQL OOM phenomenon.

View the monitoring without any abnormal output during the MySQL crash period, and you can execute select in the environment to trigger the database crash at any time.

2. Obtain complete SQL and table structure information from the business side.

 # 完整的SQL语句:
SELECT 'testPA' AS INDIC_KEY, A.CUST_NO AS OBJ_KEY,
  CASE WHEN B.CUST_NO IS NULL THEN 1 ELSE  END AS INDICVAL1,'2222-06-06' AS GRADING_DATE
FROM testDB.Table1 A
LEFT JOIN (
  SELECT DISTINCT T.CUST_NO
  FROM testDB.TABLE_TRANSACTION T
  WHERE
    EXISTS (SELECT 1 FROM testDB.Table1 T1 WHERE T.CUST_NO = T1.CUST_NO)
  AND T.AGENT_CERT_NO IS NOT NULL
) B ON A.CUST_NO = B.CUST_NO;
  
# 表结构
CREATE TABLE `TABLE_TRANSACTION` (
  `cert_key` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `cust_no` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
 ···
CREATE TABLE `Table1` (
  `CUST_NO` varchar(32) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
 ···

3. View the execution plan of the select statement

4. Stack Analysis

Through the stack, you can see that the optimizer converts the EXISTS subquery into the semi-join operation. Since the optimizer chooses the DuplicateWeedout execution strategy by default, the outer query will be implemented by creating a temporary table. The record is deduplicated.

The execution process can be verified by the execution plan: the Extra column of the execution plan will display the Start temporary prompt, and the driven table will display the End temporary prompt.

5. Stack problem point output:

(/usr/local/mysql/bin/mysqld(actual_key_parts(KEY const*)+0xa) [0xef55ca]

The stack crashed at the memory address 0xef55ca , which can be analyzed by gdb to get the corresponding code bits:

#4 0x0000000000ef55ce in actual_key_parts (key_info=0x7fd5241641b0) at ../../mysql-8.0.19/sql/sql_class.h:1487

sql_class.h:1487 source address is:

Through inline know that optimizer_switch_flag function is the inline call of actual_key_parts d8c2cbaf196c3da6d9dfa7f566245108---, find the location of actual_key_parts function: ---b358894061b1d7fbf248---

6. Debug with gdb:

6.1. Use gdb's frame to push down the 4-layer stack to actual_key_parts function:

6.2. Print actual_key_parts the memory address corresponding to the pointer returned by the function

6.3. It is found that when using in_use , the return value is empty, and 0x0 appears, indicating that the table memory address is wrong.

6.4. Since the return of in_use is empty, when calling the code behind in_use optimizer_switch_flag , an illegal address will appear, resulting in a database crash.

get conclusion

After analysis, it can be determined that the problem is caused by the disorder of the memory address of the MySQL function. After polling the high-level code, it is found that the 8.0.24 version has been revised to the above code:

The following is the relevant description of the bug: https://github.com/mysql/mysql-server/commit/7fde9072e1f62b1b3cf857757a3be41cec5c8e48

solution

In the above investigation and analysis, we got this bug because the use of the semi-join DuplicateWeedout execution strategy of ---72539919375728c134e9cd77564aea46--- led to the problem. Try to avoid this problem from happening.

On the one hand, the business side must avoid the execution of the SQL. From the perspective of the DBA, the consideration is how the SQL can be executed normally. After verification:

The following three solutions can solve the database crash caused by the current select query.

1. Set a reasonable and unified character set (utf8mb4) and collation rules for business tables to avoid the use of exist DuplicateWeedout strategy in semi-joins to speed up SQL execution efficiency;

2. Turn off the database-level DuplicateWeedout optimization strategy:

SET [GLOBAL|SESSION] optimizer_switch='duplicateweedout=off';

3. Upgrade MySQL version to 8.0.24;

Other solutions

1. Retrieve the relevant stack code directly on Google and other platforms, find bugs similar to MySQL, and then verify the relevant SQL in the repaired version, confirm that the bug has been fixed in the corresponding one, and complete the troubleshooting.

2. Open coredump in the database to complete the auxiliary verification of the stack.

Special thanks: Mr. Huang Yan, CTO of Aikesheng


爱可生开源社区
426 声望211 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。