bug - Failure Analysis | MySQL: slave_compressed_protocol causes crash - 个人文章

Author: Hu Chengqing
DBA team member, good at failure analysis, performance optimization, personal blog: 161d2c92bc929a https://www.jianshu.com/u/a95ec11f67a8, welcome to discuss.
Source of this article: original submission
* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.

Phenomenon

MySQL version: 8.0.18

create.sql: zabbix initialization script, including table building and data insertion statements, 10M+ size

A new customer deployed a set of our company's database management platform and took over the incoming one-master and two-slave instances. One master and one slave were in the Wuxi computer room for semi-synchronous replication, and the other slave was in the Beijing computer room for asynchronous replication. When source create.sql on the main library, it will crash. But before taking over and entering the platform, there will be no crash.

Note: The management platform mentioned here will not affect the understanding of the entire article.

Investigation process

1. Reproduce in the test environment

In order to facilitate the investigation, it needs to be reproduced in a controlled environment:

The same my.cnf as the customer
Same MySQL version
Same replication architecture
Execute the same create.sql

The crash can indeed be reproduced stably, and the error log is as follows:

2020-04-28T17:51:47.441886+08:00 0 [ERROR] [MY-013129] [Server] A message intended for a client cannot be sent there as no client-session is attached. Therefore, we're sending the information to the error-log instead: MY-001158 - Got an error reading communication packets
09:51:47 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x46000
2020-04-28T17:51:47.447907+08:00 218 [ERROR] [MY-011161] [Server] Semi-sync master failed on net_flush() before waiting for slave reply.
/opt/mysql/base/8.0.18/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x1ed6cce]
/opt/mysql/base/8.0.18/bin/mysqld(handle_fatal_signal+0x323) [0xfb2d23]
/lib64/libpthread.so.0(+0xf5f0) [0x7f3d781a75f0]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

2. Eliminate the impact of the management platform

Crash occurs only after taking over to the management platform. The biggest operation of the management platform on the database comes from the highly available components:

Delay detection (write operation: write a time stamp every 500ms)
Status query (read operation)

So the next step is to disable high availability and delay detection for testing. The results are as follows:

Preliminary conclusion: After the delayed detection is turned off, it will not crash.

3. What is the impact of delayed detection, causing a crash?

During the test, a phenomenon associated with crash was found:

Do not disable the delay detection, it will crash, but the efficiency of executing SQL is higher (millisecond level):

mysql> source /tmp/insert.sql
Query OK, 1 row affected (0.21 sec)
Query OK, 1 row affected (0.50 sec)
Query OK, 1 row affected (0.03 sec)
Query OK, 1 row affected (0.47 sec)
Query OK, 1 row affected (0.51 sec)
Query OK, 1 row affected (0.02 sec)

And disable high availability detection, it will not crash, each SQL execution time is a little more than 1s:

mysql> source /tmp/insert.sql                            
Query OK, 1 row affected (0.01 sec)
Query OK, 1 row affected (1.01 sec)
Query OK, 1 row affected (1.01 sec)
Query OK, 1 row affected (1.00 sec)
Query OK, 1 row affected (1.01 sec)
Query OK, 1 row affected (1.01 sec)

This seems to be caused by the group submission mechanism, but the group submission parameters are not configured:

mysql> show global variables like '%group_commit%';
+-----------------------------------------+-------+
| Variable_name                           | Value |
+-----------------------------------------+-------+
| binlog_group_commit_sync_delay          | 0     |
| binlog_group_commit_sync_no_delay_count | 0     |
+-----------------------------------------+-------+
2 rows in set (0.01 sec)

This phenomenon will also disappear after turning off semi-synchronous replication. Guess: It is a problem triggered by the combination of semi-synchronization and group submission.

4. Longblob large object

In the previous test, every time the crash occurs, the binlog analysis and the last transaction have one thing in common: they are all inserting data into the same table:

### INSERT INTO `zabbix`.`images`
### SET
###   @1=108 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2=1 /* INT meta=0 nullable=0 is_null=0 */
###   @3='Rackmountable_3U_server_3D_(64)' /* VARSTRING(256) meta=256 nullable=0 is_null=0 */
###   @4=

Check the table structure and find that there is a large lonngblob object. The inserted image is stored in binary format:

CREATE TABLE `images` (
  `imageid` bigint(20) unsigned NOT NULL,
  `imagetype` int(11) NOT NULL DEFAULT '0',
  `name` varchar(64) NOT NULL DEFAULT '0',
  `image` longblob NOT NULL,
  PRIMARY KEY (`imageid`),
  UNIQUE KEY `images_1` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Intercept all the insert statements of the images table in create.sql, and then only execute these insert statements, the crash problem can also be reproduced:

sed -n '2031,2217p' create.sql > insert.sql

So the second condition of crash is: insert longblob large object

5. slave_compressed_protocol

The previous analysis has found 2 conditions that trigger a crash:

When inserting data, there is a longblob large object
Semi-synchronous replication, and when inserting longblob large objects is accompanied by other external write traffic

But in fact, the crash problem cannot be reproduced with the MySQL environment of the same version that comes with the standard installation of the database management platform. The difference is that my.cnf is different, so there must be some parameter as the trigger condition.

After continuous testing, modify a batch of parameters each time (note that it has been positioned to be related to semi-synchronous replication, so you must modify the parameters of the master and slave libraries at the same time) , keep narrowing the scope, and finally locate the slave_compressed_protocol setting slave_compressed_protocol =on's influence.

When the slave library slave_compressed_protocol=ON, it will also cause the slave library slave io thread to keep disconnecting from the main library, and continue to reconnect, and the error log of the slave library is reported as follows:

2020-04-29T10:34:42.361584+08:00 1998 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel '': Lost connection to MySQL server during query (server_errno=2013)
2020-04-29T10:34:42.361668+08:00 1998 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

The main library error log reports the following error:

2020-04-29T10:23:29.480529+08:00 0 [ERROR] [MY-013129] [Server] A message intended for a client cannot be sent there as no client-session is attached. Therefore, we're sending the information to the error-log instead: MY-001158 - Got an error reading communication packets
2020-04-29T10:23:30.330242+08:00 1950 [ERROR] [MY-011161] [Server] Semi-sync master failed on net_flush() before waiting for slave reply.

Correspondingly, because the slave io thread of the slave library is constantly reconnecting, it can be observed that the binlog dump thread of the main library will restart continuously, and sometimes two can be observed:

show processlist;select sleep (1);show processlist;
....
| 2131 | admin | 172.16.21.3:37926 | NULL | Binlog Dump GTID | 3 | Waiting to finalize termination | NULL |
| 2132 | admin | 172.16.21.3:37932 | NULL | Binlog Dump GTID | 1 | Sending binlog event to slave | NULL |
....
+-----------+
| sleep (1) |
+-----------+
| 0         |
+-----------+
...
| 2132 | admin | 172.16.21.3:37932 | NULL | Binlog Dump GTID | 2 | Sending binlog event to slave | NULL |
...

in conclusion

There are 3 trigger conditions for this crash:

Insert longblob large object;
Semi-synchronous replication, and other external write traffic is accompanied by insert longblob large objects;
slave_compressed_protocol=on .

Why didn't a crash occur before taking over to the platform?

Because this library is not online yet, there is no other write traffic when create.sql is executed (equivalent to the effect of turning off delay detection).

solution

Just set slave_compressed_protocol=OFF. In addition, the official document also states that this parameter should not be configured. This parameter will be deleted in subsequent versions:
As of MySQL 8.0.18, this system variable is deprecated. It will be removed in a future MySQL version.

Follow-up

Our company submitted a bug to the official after R&D: https://bugs.mysql.com/bug.php?id=99607.

Because of some security issues (more specifically, I can’t disclose it), this bug was officially set as a private bug. The screenshot is as follows, but as of today (2021.12.28), the official has not fixed it:

Failure Analysis | MySQL: slave_compressed_protocol causes crash