The practice of TiDB slow log in Banyu

Author profile English database, TUG 2020 MOA. Responsible for Banyu database operation and maintenance, big data operation and maintenance, and database platform construction.

This article is from the head of the second phase of the 16143009b63f5f "Energy Titanium" event. Liu Jiang shared the practice of TiDB slow log in Banyu. This article will expand from the following three aspects:

The first part is background and needs , first introduced with the fish do TiDB slow logging system background, and based on this background, we should slow logging system made of what look like;
The second part introduces how to do slow log system
The third part uses several online cases to see how the slow log system locates online problems.

Background and needs

In the first half of this year, Cloud released a new generation of 16143009b6403b DAS (Database Autonomous Service) service, which talked about database problems, and more than 90% of the problems were by 16143009b6403e abnormal requests from the database. In fact, in our daily database problem scenarios, most of the problems are also abnormal SQL requests, such as database bugs or machine-induced failure problems, which are relatively rare. For the location of abnormal SQL, slow log analysis of the database is a particularly effective method.

So when we use slow logs to analyze problems, what are the pain points? Before the slow log system, the slow log of the cluster was distributed on multiple machines. If there is a problem with the database, you need to log in to multiple machines for analysis, and the efficiency of problem handling is very low. Especially when the cluster is very large, there is basically no way to quickly locate the problem.

Of course, TiDB in version 4.0 supports Dashboard , we can view the entire cluster slow log information through Dashboard, such as the last 15 minutes or half an hour of slow log recently. But when the system really has problems, there will be a lot of slow logs. Dashboard will face calculation and loading . At the same time, Dashboard does not support retrieval and analysis statistics, which is not conducive to quickly locating abnormal SQL.

The TiDB system library comes with a table (INFORMATION_SCHEMA.SLOW_QUERY) to store slow logs in real time. We can also use it to locate abnormal SQL. However, this table is a relational table and has no index. When the log volume is particularly large, multi-dimensional retrieval and analysis are particularly slow. At the same time, it is not good at multi-dimensional retrieval and analysis of relational tables.

Based on the above pain points, Banyu's slow log system needs to meet the following requirements:

First of all, it is the centralized collection of slow logs, which can gather all the slow logs of multiple clusters or even dozens of clusters online for centralized analysis. In this way, entrance will be unified .

Secondly, we must ensure that the collected slow logs are quasi real-time . Because if the collected slow logs are delayed too much, it is not helpful for dealing with online problems and analyzing problems.

Then, the slow log can be and statistically analyzed by . Because there are too many slow logs when there is a problem, if you can retrieve and statistically analyze it at this time, you can quickly locate the abnormal SQL.

Finally, the slow log system needs to support monitoring and alarm .

Detailed system

Based on the above background and requirements, let's take a look at how Banyu's slow log system is implemented.

system structure

slow log system 16143009b64180 overall architecture , as shown in the figure below. Filebeat component during the initialization of the TiDB Server machine, through which the collected slow logs are written to Kafka, and the machine IP information is also marked. Then parse out the fields we care about through logstash and store them in ES. ES itself is a search engine, which is very fast for data analysis and statistics. At the same time, we use Kibana to view the slow log data in ES, and do visualization statistics and retrieval.

Of course, there is another architecture for the slow log system, as shown in the figure below. Clickhouse is a popular analytical database in recent years. Some companies send monitoring data to Clickhouse to do real-time monitoring and alarm analysis . We can use Flink to replace the logstash component, and write slow log data to Clickhouse through simple calculations.

’s slow log system was done earlier, it uses the 16143009b641c1 ELK architecture .

First, Filebeat enough lightweight . We have done analytical tests on hundreds of megabytes of files online, and the conclusion is that there is basically no impact on the performance of the database.

Secondly, when a problem occurs online, the instantaneous log volume is particularly large. If the slow log is directly written to Logstash, it will have an impact on the Logstash machine load, so Kafka is used to the peak 16143009b641e9.

When Logstash parses slow logs, it should use as few fuzzy matching rules as possible. Because using fuzzy matching rules to parse the slow log will cause the problem of high CPU of the machine.

Then, the ES index itself is Schema Free , and then the data structure of the inverted index is added. This feature is very suitable for statistical analysis scenarios.

At the same time, Kibana is used for visual retrieval and statistical analysis.

Finally, we read the slow log data of ES 2

Log collection

Next, look at the details of each component. On the left is the configuration collected by Filebeat, as shown in the figure below. When we deploy Filebeat, we will pass in the IP of the deployment machine so that we can know which machine the slow log originated from in the later statistics. Then on the right is the configuration of Kafka. After the data is collected, it will be sent to the Kafka cluster.

The following is an example of TiDB slow log, the format is like this.

The slow log starts with Time , with Query_time, Total_keys, Process_keys, DB , and finally SQL information. After this slow log is collected by Filebeat, it will become a line of text like the following, as shown in the figure below. If you save this line of text directly to ES, there is no way to do retrieval and statistical analysis.

Field filter

A text cannot be used for statistical analysis and multi-dimensional retrieval. If we do, we need to parse in this line of text. Then which fields do we focus on? First look at the slow log of MySQL 5.7, as shown in the figure below. When we deal with MySQL failures, we first look at the query time of a SQL. If the SQL query time is changed for a long time, we think it may be the cause of the online problem.

But when the online request volume is relatively large, the long query time does not mean that it is the root cause of the problem, and a comprehensive analysis should be combined with other keywords. For example, a particularly important keyword in the slow log is Rows_examined. Rows_examined represents the number of rows scanned by data logic. Usually, we can locate the problem SQL through comprehensive analysis of Query_time and Rows_examined.

Next, let's take a look at the slow log of TiDB. First look at the slow log of TiDB 3.0.13 using the KV interface, as shown in the figure below. There are some more important keywords, such as Query_time, DB, SQL and Prewrite_time . These keywords are very helpful for locating online problems.

The following is a slow log of TiDB 3.0.13 in another format, which uses the DistSQL interface, as shown in the figure below.

In addition to printing out Query_time and Total_keys at the same time, it also has Index_names, which represents whether the SQL is indexed or not, and there are also table names and other information in the Index_names field.

After reading the slow log of TiDB 3.0.13, let's take a look at the slow log of TiDB 4.0.13. Compared with TiDB 3.0.13, the content of the slow log has added some fields, such as KV_total, PD_total, Backoff_total and other information.

Through the above slow log, we can find that it contains a lot of key information, and we can even see where the request is slow in the database. If we collect the slow logs, retrieve them through certain relationships, or even aggregate them, it is very helpful for finding problems.

In Banyu, the TiDB slow log fields we are concerned about are mainly as follows:

TiDB IP : When deploying Filebeat, the IP of the machine will be passed in. With this IP, we can know the source of the log and make statistics according to the dimensions of the IP;
DB : DATABASE used when executing the statement. We can perform statistics according to the DB dimension, and at the same time, we can map the DB to a specific database cluster through the internal system;
TABLE : Some slow logs can be parsed to get the table name, and statistics can be made according to the dimensions of the table;
IDX_NAME : In addition to the Insert statement and the statement using the KV interface, the slow log information records whether the statement is indexed or not;
TOTAL_KEYS : the number of keys scanned by Coprocessor;
PROCESS_KEYS : the number of keys processed by Coprocessor;
QUERY_TIME : the time when the statement was executed;
SQL : specific SQL statement.

Field parsing

Parse the required fields of a slow log through Logstash's Grok syntax, as shown in the figure below.

Statistical Analysis

The following figure shows the slow log situation of all our clusters in the last 30 minutes. Through Kibana, we can see the total number of slow logs, which can be retrieved by DB, Quwry_time, Total_keys, and statistics can be calculated by dimensions such as DB, Table, and IP. This can greatly improve the efficiency of locating problem SQL.

In addition to visual retrieval and aggregation through Kibana, our internal platform system will also aggregate and sort through various dimensions of slow logs. Aggregation dimensions include cluster, library, table, IP and operation type etc. The sorting rules can be According to the total time, average time, maximum time and the number of logs . We can periodically send slow log reports to R&D.

Monitoring alarm

In addition to visual retrieval and statistical analysis, we also monitor and alert the slow log, as shown in the figure below. By counting the number of slow SQL entries in each database, each database is set two alarm rules . For example, the advertising library, the general rule is that the execution time exceeds 200 milliseconds , and the alarm threshold reaches 5 will be alerted.

However, we found that there are also cases where the execution frequency is extremely low, but the execution time is extremely long, and there is no way to cover it through general rules. Therefore, a rule was added later: the execution time exceeds 500 milliseconds , and the alarm threshold reaches 2 . In this way, the online slow SQL is basically fully covered.

alarm information , as shown in the figure below. You can take a look at the picture on the left. We collected the DB field of the slow log and correlated the DB with the database cluster through the internal system. We can see which cluster and which library the slow log occurred in. When the slow log was generated, how many slow logs exceeded the threshold , the total SQL execution time and average time, etc. Through such slow log warning information, we can judge the impact of the problem SQL on the line in a short time How big is it.

Case Studies

After talking about how the slow log system is done, let's take a look at how we use the slow log system to find online problems.

The first case is shown in the figure below. One day we found that the Coprocessor CPU of the cluster went up, and the delay on the right also went up. The problematic SQL can be found quickly through the slow log system. Its Total_keys and Process_keys consume more, and its index_name is Empty, indicating that it has not reached the index. By adding an index to this SQL, the performance problem was quickly resolved.

The second case is similar. We found that the Coprocessor CPU index went up, and then searched through the slow log system, and we found that SQL was not indexed. The speed of locating the problem is very fast, because the aggregation and retrieval through ES is extremely fast.

Summarize

The above is some experience of Banyu slow log system in performance problem detection and prevention of database performance risk. I hope it will be helpful to everyone. In the future, we will continue to mine the information of the slow log, enrich the functions of the slow log system, and escort the Banyu database.

The practice of TiDB slow log in Banyu