最佳实践 - Technology Sharing | What do I need to do for MySQL database inspection? - 个人文章

Author: Chen Juncong
Member of the database team of China Mobile Information Basic Platform Department, mainly responsible for the maintenance of MySQL, TiDB, Redis, clickhouse and other open source databases.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

I have been in contact with MySQL database for 7 years, and I have been working full-time in MySQL database operation and maintenance for 6 years. In the past 6 years, I have stayed in three companies and have done many database inspections. From the beginning, I downloaded an inspection template from the Internet to cope with the work. After some hasty work, I used the company's specialized database inspection template to do inspections. With more and more experience, I gradually formed my own routine, or methodology. Today, I want to write this article to summarize my personal experience and ideas, and to prove that even if the inspection is so small, as long as you want, you can draw a best practice.

What is the point of best practice? Not everyone is so familiar with MySQL. The best practices are deposited in the form of documents, which can effectively avoid mistakes and minimize the quality of inspections caused by personnel loss.

I think there are several categories of inspections, and their focus is different. The following is my category.

Divide by inspection method

In terms of inspection methods, inspections are divided into human inspections, scripted inspections, and platform-based inspections.

Human flesh inspection is also called manual inspection. It can be done when the number of database servers under management is small and the inspection frequency is low. Go to the server and execute df -h and mysql -u to check the running status of the server and database, and check the monitoring (zabbix, Prometheus, etc.) for analysis. In addition to inefficiency, this method also has a disadvantage, that is, it is very dependent on the skill level of the inspectors. Different engineers may come to different conclusions when they go to inspect.

Scripted inspection. In this stage, the inspection commands are actually packaged into a script. Engineers log in to the server to execute the scripts one by one. Of course, if the company allows, batch operation and maintenance tools such as ansible can be used to run scripts in batches During inspection, the script generates data in html report or csv format for you to analyze and summarize. With this method, our experience is that one person can easily do inspections and complete inspection reports on 2000 instances in one day. Compared with the human flesh inspection method, this method has a qualitative leap in efficiency. He still has shortcomings, that is, he still needs to consume manpower, and still needs to partially rely on the skill level of the inspectors to analyze reports and summarize.

Platform-based inspection, our database management platform not only addresses the three core demands of tenants (high availability management, backup and recovery management, monitoring and alarm management), but also provides platform-based inspection functions, which regularly trigger inspections to generate inspection reports , and use the health score model algorithm that integrates our operation and maintenance experience to score instances. For instances that are less than 60 points, we need to pay attention immediately, automatically notify the database administrator, and automatically and intelligently analyze the existing problems. With it, the number of database operation and maintenance personnel logging into the production environment is greatly reduced, and the skill level of the inspection personnel is no longer dependent on the absolute standardization of inspection.

Divide by time dimension

Divided by the time dimension, database inspections are divided into daily inspections and pre-holiday inspections.

Daily inspections are inspections that are done every day. The easiest way to do this inspection is to look at monitoring, and mainly pay attention to alarms of warning level or above. Monitoring can basically cover more than 95% of daily inspection needs.

Pre-holiday inspections refer to inspections on special days, which are deeper than daily inspections. In a narrow sense, it refers to inspections before national statutory holidays (such as May 1st, Dragon Boat Festival, Qingming Festival, Mid-Autumn Festival, National Day, Chinese New Year, etc.). In a broad sense, it also includes safeguard-level inspections before major events such as Double Eleven. Pre-holiday inspections, for my database operation and maintenance team, the main concern is that there may be traffic peaks in some businesses during the holiday season, and due to the holiday, the staff is insufficient, and it is necessary to inspect in advance to find future problems. , resolved in advance. As an operation and maintenance, the main concern is the availability of the database, so the core of the inspection before the festival is as follows:

system level
- CPU
- RAM
- disk space
Application layer (MySQL instance)
- instance state
- high availability
- replication status
- monitor status
- VIP status

Here I will explain.

CPU, through monitoring, to find instances of high CPU water level, to confirm whether it is consistent with the daily monitoring curve, to determine whether there is an abnormality, and whether there is a need for optimization.

For RAM, find the instances where the RAM usage rate exceeds 80% by monitoring, and check whether there is a problem of insufficient memory. The expansion of the capacity can prevent OOM. Here, check whether there are instances that use swap. If there is sufficient memory and swap is used, it is very likely that numa or vm.swappiness is not set correctly.

Disk space, this is the key point in the pre-holiday inspection. Usually, the alarm line of the disk usage rate is 80%. During inspection, we should obtain instances with a disk usage rate greater than 70%, and expand the capacity in advance to avoid the Spring Festival (7 Days) The disk has the risk of an alarm. At this time, everyone is not feeling well when they go home for the New Year and work overtime.

Instance status, generally speaking, is to check the survival of mysqld, and if conditional, it can be analyzed whether it is healthy. (As for how to judge whether it is healthy, it will not be expanded here)

High availability state, through inspection to prove that the database is "switchable" state. For example, in the case of MHA architecture, check the results of the execution of these three scripts.

masterha_check_ssh --conf=/etc/masterha/app.cnf
masterha_check_repl --conf=/etc/masterha/app.cnf
masterha_check_status --conf=/etc/masterha/app.cnf

Replication status, in fact, the above high availability status check, generally also checks the replication status. But there are some replications that are not in the high availability category, so I need to talk about them here. Check the status of asynchronous replication, semi-synchronous replication, delayed replication, bidirectional replication, and cascading replication to see if it is normal, whether the DTS replication for disaster recovery is normal, whether the high availability of DTS is normal, and so on.

In VIP state, we have some instances with dual network redundant links, and these instances will have dual VIPs. If the VIP of the redundant network link is dropped at this time, the service will not be sensed. Therefore, we should have an inspection mechanism, which can develop a detection task to detect and alarm regularly.

Divided by inspection degree

According to the inspection degree, it is divided into ordinary inspection and deep inspection. Although the above-mentioned pre-holiday inspections are deeper than daily inspections, they are only ordinary inspections and cannot be regarded as in-depth inspections. So what kind of inspection is a deep inspection?

pre-section inspection, our focus is on operation and maintenance and database availability. Therefore, in-depth inspection must focus on expanding outward, focusing on user experience and performance. Such inspection is in-depth inspection.

So what problem does in-depth inspection solve? The purpose of in-depth inspection is to supplement the daily inspection and pre-holiday inspection, so that the database will not only be available in the future, but also be reliable and run faster.

my opinion, in-depth inspection = availability inspection + reliability inspection + performance inspection + analysis and recommendations

Availability inspection

In the pre-holiday inspection mentioned above, the availability of the database has been checked a lot, but those are measured from the perspective of operation and maintenance, from the service, and from the instance level. From the perspective of application and business, this availability check is actually It can be extended. For example, in the in-depth inspection, we will check the usage of the auto-increment key of each table of the tenant. The common types of auto-increment keys for tenants are int unsigned and int signed. The former is an unsigned int type, and the range is (- 2147483648, 2147483647), the latter is a signed int type (0,4294967295), when developers are building a table, it is more common to define int directly. There is no requirement for unsigned or signed, then the default is unsigned, and it is more ideal to not use it int signed value, so the range available for the auto-increment key is doubled. In addition, the auto-increment key is not continuous when inserting. It is related to the way you insert and the parameter settings (innodb_autoinc_lock_mode, auto_increment_increment). The auto-increment key will be allocated before the insertion, so once the insertion fails, the transaction rolls back, Increasing the id will also be naturally wasted. So although the upper limit of int unsigned seems very high, there are as many as 2.1 billion, but due to the reasons just mentioned, it is very likely that when there are only 1 billion rows of data in your table, the auto-increment key is full. When the auto-increment key is full, your table cannot be written to, which is unusable at the business level.

We have encountered such a situation in our production practice. A business is responsible for the analysis of the transaction bill data of a mall. The number of records in the log table of its bills is generally 5 million records per day, and it can reach 900 at peak times. There are more than 10,000 entries. At that time, this table used int unsigned self-incrementing id as the primary key. The self-incrementing primary key was used up less than 9 months after the business went online. The solution is to modify the auto-increment type from int unsigned to bigint signed. We know that MySQL’s modification of the primary key column type is to lock the table, which can only be read but not written, so the business was damaged at that time, and DDL took 6 hours.

Therefore, in-depth inspections need to expand the availability inspections for these situations. Readers can supplement more availability inspections.

Reliability inspection

Before talking about performance inspections, I would like to add that reliability inspections. The pre-holiday inspections mentioned above have a lot of usability checks, but whether availability equals reliability, many people here will confuse them, they do not equal. Availability refers to Availability, which is generally a problem to be solved by high availability, while reliability refers to Reliability, which generally refers to good data, no loss, and consistency of data copies in a database.

The pre-holiday inspection has included many database reliability checks, such as the "switchable" check in the high availability check and the replication status check. But this is not foolproof. Here I propose that in-depth inspection needs to do "core parameter inspection".
The "core parameter check" here includes three aspects

Check whether the parameters in the database meet the core parameter list required by our Interchange Specification
Check whether the parameters of the active and standby databases are consistent
Check whether the database running parameters and the configuration file (my.cnf) parameters are consistent

Check whether the parameters in the database meet the core parameter list required by our communication and dimension specification. This is actually a historical problem, because most of our databases themselves do not come from our deployment and delivery, but are handed over to us by various business departments. For For these newly handed over instances, be sure to check the core parameters to ensure that the data is not lost and the master-slave data is consistent. The relevant parameters include but are not limited to the following:

binlog_format = row
binlog_row_image = full
gtid_mode = on
enforce-gtid-consistency = on
innodb_doublewrite = on
innodb_flush_log_at_trx_commit = 1
log_bin = mysql-bin
master_info_repository = table
sync_binlog = 1
...

Actually we check up to 80 parameters.

Check whether the parameters of the active and standby databases are consistent. This is mainly to avoid inconsistencies in use after the active and standby switchover. Some parameters that must be set inconsistently, such as server_id, are also checked here. Anyway, there is only one purpose, that is, to check the parameters of the master and backup to ensure that they are normal.

Check whether the database running parameters and the configuration file (my.cnf) parameters are consistent. Many people think that the persistent configuration file must be consistent with the running parameters. There is no need to check this. This is wrong. In MySQL 5.7 or before, there is no way to modify the parameters and persist the configuration file at the same time, so modifying the parameters is usually done in two steps. , first set the global parameter = value in the database, and then log in to the server to modify the my.cnf configuration file. Because it is not an atomic operation, the operation and maintenance personnel may make mistakes. Don't trust people, people will always make mistakes. We have had several failures caused by inconsistent running parameters and persistent configuration files before. For example, dynamically modify MySQL's innodb_buffer_pool_size = 128G, and then forget to persist to the configuration file. At that time, the database crashed, and then it was pulled back to the mysqld instance by the high-availability component. It was found that the performance was very poor. After a long time of investigation, the innodb_buffer_pool_size was restored to the default value of 128M!

There is another case, in the mysql 5.6 era, when the hardware performance was not good and there was no good parallel replication technology, the slave library was prone to replication delay due to io bottleneck. The temporary solution was to set sync_binlog=0 and innodb_flush_log_at_trx_commit = 0 to chase the delay. , after the delay is tied, modify it back to "Double 1". At this time, it is easy for the DBA to forget to perform the modification back to the "double 1" operation. If there is a database instance-level failure at this time, resulting in a master-slave switchover, there is a risk of data loss at this time.

In addition, some tenants have super permissions and can modify the parameters of the database, but they do not have server permissions. If the parameters modified by these tenants involve the core parameters we think, the running parameters and configuration files of the core parameters (my .cnf) parameters are inconsistent, it may be buried in the mine, which will lead to database reliability and even availability problems.

"Core parameter inspection" is an example of reliability inspection. Readers can add more reliability inspections.

Performance inspection

In the performance inspection, there are many small items. I will introduce some common ones here.

1. Whether there is a table without a primary key.

The gameplay of MySQL is to have a primary key, preferably an int signed self-incrementing primary key that has nothing to do with business. Specifically, why please go out and turn right to see "Development Specification", how does it affect performance, there are a lot of articles on the Internet, I don't need it here Too much to say.

2. SQL performance optimization

First, in the inspection report, the top 10 slow queries can be listed, allowing tenants to optimize SQL.
Secondly, in the report, you can capture and provide some SQLs that perform TOP 30 full table scans to tenants, because some SQLs have problems in their execution plans. These SQLs may be running very fast, but the data has not been evaluated. As the amount of data grows, when the table becomes larger and larger and reaches a threshold, the performance problem of 100% of the CPU may break out online, becoming an explosive killer slow query.

Thirdly, in terms of indexes, you can pay attention to information such as redundant indexes, invalid indexes, and index discrimination. Redundant index means that there are duplicate indexes in the database. For the joint index idx_a_b_c of columns a, b, and c, he actually has three indexes (a, b, c), (a, b), (a) at the same time. If you create another idx_a, idx_a_b index at this time, then they are redundant indexes, and these indexes should be deleted, because indexes occupy storage space, and the more indexes the better, the maintenance of indexes is expensive Yes, it affects the performance of DML statements, so indexes that are not used should be deleted. Similarly, invalid indexes are indexes that have never been used. The inspection report should list these indexes, and developers should evaluate whether these indexes will not be used in the future. If they are not used, they should be deleted. The index discrimination is used to evaluate whether the values of the column are sufficiently dispersed. The more values, the more suitable for indexing. If it is a gender column, there are only two values for men and women, and it is not suitable for indexing. The closer the degree of discrimination is to 1, the higher the degree of discrimination; if it is lower than 0.1, the degree of discrimination is poor. The developer should re-evaluate the fields involved in the SQL statement, and select multiple fields with high degree of discrimination to create an index.

3. Is there a MyISAM storage engine table?

MyISAM is basically useless. I wrote an article before "Interaction Specification Explaining Series - Why We Ban the Use of MyISAM Storage Engine" to talk about this issue. In fact, we should not also check whether there is a MyISAM table. In 99% of the scenarios, it should be disabled directly. Please prevent the creation of MyISAM tables through parameters like me. Of course, there are many storage engines that I recommend not to use. Innodb is the Forever king, the reference parameters are as follows:
disabled_storage_engines=ARCHIVE,BLACKHOLE,EXAMPLE,FEDERATED,MEMORY,MERGE,NDB,MyISAM

Because MySQL version 5.7 still has 10 metadata tables using the MyISAM storage engine, it may affect the database upgrade. If this parameter is used in version 5.7, operation and maintenance personnel should pay attention to disabling this parameter before making database upgrade changes. Add it back later, refer to https://mp.weixin.qq.com/s/O9UtGskB3IydkEEMscEo1Q . Version 8.0 does not have this problem.

Our goal is to rectify as much as possible after checking the MyISAM table, and add the above parameter, ha.

4. TOP 10 watches

Large tables consume a lot of performance when doing full table scans, and it is even more disastrous in terms of DDL. Tables larger than 100G should be evaluated. Why is there such a large table? Why put it on MySQL, can it be put on TiDB? Can it be split into small tables, horizontally or vertically? Filing, hot and cold separation?

Recommendation: The storage space of a single MySQL instance should be controlled within 500G, the number of rows in a single table should be controlled within 10 million rows, the size should be within 30G, the number of fields in a single table should be within 50, and the number of indexes in a single table should be within 5.

This is previous advice and is for reference only. With the improvement of hardware, my latest opinion is that the MySQL instance is within 2T, and I can accept the single table volume within 100G. Of course, I consider it from the perspective of operation and maintenance. From the perspective of performance, it mainly depends on whether the business is acceptable.

Analysis and Recommendations

The purpose of performance inspection is to issue as much data as possible for tenants to do performance analysis by themselves. There are SQL-related and non-SQL-related ones. As for the processing and analysis of these data, we The report mainly provides text descriptions and suggestions for non-SQL-related ones. We are not experts in this field. There are many ways and tricks in it. This is handed over to high-level tenant developers to analyze and optimize.

The above is my personal summary of what needs to be done for the MySQL database inspection. Corrections are welcome.

Technology Sharing | What do I need to do for MySQL database inspection?