TiDB fault diagnosis and performance troubleshooting: see when it happens, everything can be traced back, continuous profiling application practice

About 30% of IT failures encountered by enterprises are related to databases. When these failures involve application systems, network environments, and hardware devices, the recovery time may reach several hours, causing damage to business continuity, affecting user experience and even revenue. In a complex distributed system scenario, how to improve the observability of the database, help operation and maintenance personnel quickly diagnose problems, and optimize the troubleshooting process has always been a major problem that plagues enterprises.

A performance troubleshooting experience in a massive data scenario

Customer troubleshooting case without continuous profiling

19:15 New node goes online
19:15-19:32 Nodes that went online due to repeated OOM restarts, resulting in the accumulation of Snapshot files on other nodes, and the node status began to be abnormal
19:32 Received business alarm for too long response time
19:56 The customer contacted PingCAP technical support and reported the situation as follows:
- The cluster response delay is very high. After a TiKV node joins the cluster, the amount is dropped, and then the node is deleted, but other TiKV nodes appear Disconnect Store phenomenon, and a large number of Leader scheduling occurs at the same time, the cluster response delay is high, and the service hangs.
20:00 PingCAP technical support online troubleshooting
20:04-21:08 Technical support for troubleshooting a variety of indicators. From the iotop of metrics, it is found that the read io of the raftstore thread is very high. Through monitoring, it is found that there is a large amount of rocksdb snapshot accumulation. It is suspected that it is caused by the generation of the region snapshot. It is recommended that users delete it. Remove the pending peer on the previously failed TiKV node, and restart the cluster.
20:10 ~ 20:30 The technical support also checked the profiling information and grabbed the flame graph, but because the function that went wrong during the grabbing process did not run, no useful information was seen.

<center> How to view the flame graph: ( : 161bbfd3f89e74 https://www.brendangregg.com/flamegraphs.html ) <center>

The y-axis represents the call stack, and each layer is a function. The deeper the call stack, the higher the flame, the top is the function being executed, and the bottom is its parent function.

The x-axis represents the number of samples. If a function occupies a wider width on the x-axis, it means that it has been drawn more times, that is, the execution time is longer. Note that the x-axis does not represent time, but all the call stacks are merged and arranged in alphabetical order. The flame graph is to see which function on the top layer occupies the largest width. As long as there is "flat top" (plateaus), it means that the function may have performance problems. The color has no special meaning, because the flame graph indicates how busy the CPU is, so warm colors are generally chosen.

From the above viewing method, it can be found that the flame graph captured this time does not have a large "flat top", and the width (long execution time) of all functions is not too large. At this stage, it is disappointing to fail to find the performance bottleneck directly from the flame graph. At this time, the customer is already anxious about resuming business.

21:10 After restarting a TiKV node by deleting the pod, I found that the io did not drop.
21:08-21:53 The client continues to try to restart the TiKV node by deleting the pod.
21:50 crawl again flame map, find raftstore :: Store :: SNAP :: calc_checksum_and_size a lot of CPU at the function occupied, confirm the root cause.

The flames crawled to find a clear view of the "big flat-top", you can clearly see that raftstore :: Store :: SNAP :: calc_checksum_and_size function. This function occupies a lot of CPU execution time, and it can be determined that the overall performance bottleneck is the function-related function here. At this point, we have determined the root cause, and can also determine the recovery plan based on the root cause.
22:04 Take action: Stop TiKV pod and delete all gen files in the snap folder of TiKV node with high traffic. It is gradually recovering.
22:25 The business volume is heavy, and the QPS is restored to the original level, indicating that the operation is effective.
22:30 The cluster is fully recovered

Cluster recovery time: 19:56-22:30, a total of 2 hours and 34 minutes (154 minutes).
Confirm the root cause and propose effective operation time: 19:56-22:04, a total of 2 hours and 8 minutes (128 minutes).

In this case, if we can have the ability to continuously analyze the cluster performance before, during, and after the failure, we can directly compare the flame graph at the time of the failure and before the failure, and quickly find that the CPU execution time is occupied. More functions, greatly saving the time to discover the root cause of the problem in this fault. Therefore, in the same case, if there is a continuous profiling function:

19:15 New node goes online
19:15-19:32 Nodes that went online due to repeated OOM restarts, resulting in the accumulation of snapshot files on other nodes, and the node status began to be abnormal
19:32 Received business alarm for too long response time
19:56 The customer contacted PingCAP technical support and reported the situation as follows:
- The cluster response delay is very high. After a TiKV node joins the cluster, the volume is dropped, and then the node is deleted, but other TiKV nodes appear Disconnect Store phenomenon, and a large number of Leader scheduling occurs at the same time, the cluster response delay is high, and the service hangs
20:00 PingCAP technical support online troubleshooting
20:04-21:40 Technical support to investigate various indicators. From the iotop of metrics, it is found that the read io of the raftstore thread is very high. Through monitoring, it is found that there is a large amount of rocksdb snapshot accumulation, which is suspected to be caused by the generation of region snapshots.
20:10 ~ 20:40 to support troubleshooting information while continuous profiling, see FIG plurality of flame failure occurrence time of the normal flame FIG comparison failure has not occurred, found raftstore :: Store :: SNAP :: calc_checksum_and_size function occupies a large amount of CPU, confirm the root cause
20:55 Take action: stop the TiKV pod and delete all the gen files in the snap folder of the TiKV node with high traffic. Currently recovering gradually
21:16 Business volume is heavy, QPS is restored to the original level, indicating that the operation is effective
21:21 The cluster is fully recovered

Cluster recovery (expected) time-consuming: 19:56 ~ 21:21, a total of 1 hour and 25 minutes (85 minutes), a decrease of 44.8%.
Confirm the root cause and propose effective operation (expected) time-consuming: 19:56-20:55, a total of 59 points, a decrease of 53.9%.

It can be seen that this function can greatly shorten the time to determine the root cause, and help customers as much as possible to recover the business shutdown losses caused by performance failures.

Detailed explanation of "continuous performance analysis" function

In the just-released version of TiDB 5.3, PingCAP took the lead in launching the "Continuous Profiling" function (currently an experimental feature) in the database field, crossing the observability gap of distributed systems, and bringing users a level of database source code. Performance insights, thoroughly answer every database question.

"Continuous performance analysis" is a way to interpret resource overhead from the system call level. After the introduction of this method, TiDB provides performance insights at the level of database source code, and helps R&D and operation and maintenance personnel locate the root cause of performance problems in the form of flame graphs, which can be traced back to the past and present.

Continuous performance analysis achieves continuous snapshots of the internal operating state of the database (similar to CT scan) with a performance loss of less than 0.5%, and interprets the resource overhead from the system call level in the form of flame graphs, turning the original black box database into a white box . After opening continuous performance analysis with one click on the TiDB Dashboard, operation and maintenance personnel can quickly and easily locate the root cause of performance problems.

<center> Flame diagram example<center>

Main application scenarios

When the database is down unexpectedly, the diagnosis time can be reduced by at least 50%

In a case in the Internet industry, when an alarm business was affected in a customer cluster, it was difficult for the operation and maintenance personnel to find the root cause of the failure due to the lack of continuous performance analysis results of the database, and it took 3 hours to locate the problem and restore the cluster. If you use TiDB's continuous performance analysis function, the operation and maintenance personnel can compare the analysis results of daily and failure time, and only need 20 minutes to restore the business, which greatly reduces the loss.

can provide cluster inspection and performance analysis services to ensure the continuous and stable operation of the cluster

Continuous performance analysis is the key to the TiDB cluster inspection service, which provides commercial customers with cluster inspection and inspection result data reporting. Customers can discover and locate potential risks by themselves, implement optimization suggestions, and ensure the continuous and stable operation of each cluster.

provides more efficient business matching in database selection

When selecting a database, companies often need to complete the process of functional verification and performance verification in a short time. The continuous performance analysis function can help companies more intuitively find performance bottlenecks, quickly perform multiple rounds of optimization, ensure that the database is compatible with the business characteristics of the company, and improve the efficiency of database selection.

In-depth understanding and experience of "continuous performance analysis": https://docs.pingcap.com/en/tidb/stable/continuous-profiling

TiDB fault diagnosis and performance troubleshooting: see when it happens, everything can be traced back, continuous profiling application practice

A performance troubleshooting experience in a massive data scenario

Detailed explanation of "continuous performance analysis" function

Main application scenarios

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式