With the release of TiDB 6.0, the PingCAP Clinic service has also unveiled her veil, providing a Tech Preview version for users to try. The Clinic service originates from TiDB Cloud, improves TiDB Cloud SLA with intelligent diagnosis, and reduces TiDB Cloud cost with AIOPS; meanwhile, Clinic will also provide the diagnostic experience and operation and maintenance best practices accumulated in Cloud to locally deployed clusters in the form of diagnostic services. , so that all users under the cloud can also benefit from it.
The Tech Preview version released this time provides local users with rapid collection of diagnostic data and online reproduction of the diagnostic environment. When the TiDB cluster encounters problems, PingCAP technical support personnel are invited to assist in remote positioning, or in the AskTUG community When asking questions, collecting and uploading diagnostic data through the Clinic service will greatly speed up problem location.
Application of Clinic in TiDB Cloud
Xiao Wu is a technical engineer of TiDB Cloud. When assisting TiDB Cloud users with POC, he needs to pay attention to the health status of the customer's cluster and various monitoring indicators in real time, and recommend the optimal cluster topology configuration and database parameter configuration according to the customer's business pressure indicators. ; When the user cluster is abnormal, analyze and solve the problem in time to ensure the cluster SLA.
Clinic diagnosis scene
When Xiao Wu logs in to the Clinic diagnosis service, he can quickly query the diagnosis data of each time period of the user's cluster. Clinic exports and securely stores diagnostic data such as logs and monitoring indicators on the TiDB Cloud platform in real time, and provides a visual display.
In addition to basic viewing, Clinic also provides intelligent analysis.
Intelligent analysis of latency issues
A user complained that the delay during this period of time suddenly increased. Xiao Wu would intuitively find out which part of the system takes more time in which instance. This does not seem to be difficult, but he needs to Going back and forth between Grafana panels to find bottlenecks and find a problem node among many nodes is a time-consuming, labor-intensive and patience-testing job. The intelligent analysis of Clinic's diagnostic service is very intimate and directly prepared for Xiao Wu. The results are as follows:
Identifying the current load type is also an important step in diagnosing the problem of increasing latency. If you just look at the read/write ratio, it may be fine, but if you look at the read/write imbalance between which instances? At present, the number of TiKV instances in some user clusters has reached dozens or even hundreds. It is almost unsuitable for human eyes to analyze the read-write imbalance on these instances. But for the machine, it is not difficult to calculate these, Clinic will combine the above-mentioned problem time period (such as when the delay increases), and give the difference between the imbalance during this period and the usual (baseline data). The following figure is an example of the output of smart analysis:
It can be seen from the analysis in the above figure that during the problem time period, the number of read requests and coprocessor read requests has increased compared to usual, and Xiao Wu can continue to locate according to this clue.
Intelligent log cluster analysis
In addition to analyzing metrics, Xiao Wu also needs to analyze and view cluster logs. The amount of logs in our TiDB cluster is quite astonishing, and the workload is also very large if the components are placed on the instance orthogonally. Clinic provides intelligent log clustering to help Xiao Wu quickly find problems in the massive log.
Log clustering visualizes the trends of different logs in each time period. The number of logs of which type has mutated and the number of logs of which instance has mutated. Xiao Wu can see it at a glance. to the main contradiction. From the figure below, in the current time period, the two things in the red box are most processed by the TiDB cluster, and those with a large proportion can be prioritized for investigation.
TiDB Cloud Scenario Summary
Clinic's intelligent diagnosis is still in its infancy. In the near future, more analysis models will be launched and applied to TiDB Cloud diagnosis. In actual combat, the models are continuously trained, and high-accuracy problem judgment rules are output to detect cluster risks in advance. point, improve the speed of problem repair, and continuously improve the SLA of TiDB Cloud services.
<div class="is-flex is-flex-direction-row is-justify-content-center">
<div class="is-flex is-flex-direction-column">
<a target="_blank" class="button is-link mx-5"
href="https://tidbcloud.com/signup?utm_source=website-zh&utm_medium=referral&utm_campaign=blog-pingcap-clinic-service-tech-preview"
referrerpolicy="no-referrer-when-downgrade" style="background-color: #3a40e1;">
Try TiDB Cloud
</a>
<div style="font-size:12px; text-align:center">适用于中国出海企业和开发者</div>
</div>
<div class="is-flex is-flex-direction-column">
<a target="_blank" class="button is-link mx-5"
href="https://pingcap.com/zh/product-community/"
style="background-color: #4fc172;">
下载 TiDB 社区版
</a>
</div>
<div class="is-flex is-flex-direction-column">
<a target="_blank" class="button is-link mx-5"
href="https://pingcap.com/zh/contact#submit-form"
style="background-color: #3a40e1;">
咨询 TiDB 企业版
</a>
</div>
</div>
Clinic helps diagnose problems in on-premises clusters under the cloud
The Clinic diagnosis service has brought great help to Xiao Wu on TiDB Cloud. We also provide the function of Clinic to the locally deployed cluster, so that the cluster under the cloud can also use this function to diagnose problems, which can greatly speed up the problem solving of users. solve.
In the Tech Preview stage, the functions of data export and diagnostic environment reconstruction in Clinic are open to locally deployed clusters. When the locally deployed TiDB cluster encounters a problem, invite PingCAP technical support personnel to assist in remote location, or ask questions in the AskTUG community, collecting and uploading diagnostic data through the Clinic service will greatly speed up the problem location.
Notice:
The intelligent analysis-related functions of Clinic have not yet been opened to locally deployed clusters in the Tech Preview stage. We need to do more data training in TiDB Cloud. When the accuracy and computing cost of the analysis model reach a certain standard, it will be open to the cluster under the cloud.
Data collection and upload
Xiaoyu is the DBA of the TiDB cluster. Recently, the cluster has been connected to a new upper-layer service, and the cluster has performance problems. Xiaoyu reported the problem to PingCAP technical support, hoping to get optimization suggestions as soon as possible.
In similar scenarios in the past, after reporting a problem, PingCAP technical support would require Xiaoyu to upload various diagnostic information, and Xiaoyu had to manually execute multiple complex commands on the cluster, including grabbing log files of each node, using the Metrics Tool Saving data one by one Dashboard, etc., a set of collection, communication and transmission of data often takes half a day. Today, PingCAP technical support classmates suggested Xiaoyu to use Clinic's collection tool, only one command is needed to quickly complete data collection, and then directly upload the data to share with technical support.
Xiaoyu runs a simple command to collect the logs, metrics, configuration items, and hardware parameters of each cluster node in the last 2 hours:
tiup diag collect ${cluster-name}
After the collection is completed, upload directly to the Clinic service:
tiup diag upload ${filepath}
Clinic's diagnostic environment reproduces
After the data is uploaded, Xiaoyu can log in to Clinic to view his diagnostic data visually. After sharing the data link with PingCAP technical support, PingCAP technical experts can also view comprehensive diagnostic data immediately, speeding up problem location.
View Metrics
Support online viewing of metrics, and provide multiple Grafana Dashboard templates for easy viewing.
View logs
Online log viewing is supported, and logs can be viewed efficiently through various filter conditions.
View slow queries
Support online viewing of slow query information, which is consistent with the information seen on the TiDB Dashboard inside the cluster.
The future of the Clinic
The release of the Clinic service means that PingCAP will continue to invest in ensuring the healthy operation of the database. Clinic's ultimate vision is to improve the overall stability of the TiDB cluster on and off the cloud through the technology accumulation of TiDB Cloud, reduce operation and maintenance costs, and make the database more stable. operation and maintenance is simpler.
The follow-up development direction of Clinic services mainly focuses on the following points:
- Both on and off the cloud: Clinic service always insists on technical precipitation on the cloud, and provides the experience accumulated on the cloud to the off-cloud cluster through diagnostic services and operation and maintenance services, so that all deployment types of clusters can benefit.
- AI for DB: Clinic service uses the latest AI technology, and the database field experts and AI field experts cooperate in-depth to build and train models, and maximize the use of AI capabilities for problem prediction, problem diagnosis, and root cause troubleshooting.
- Database autonomous service: Clinic service gradually realizes database self-prediction, self-optimization and self-repair, replaces manual operation and maintenance operations in an autonomous way, helps users eliminate the complexity of database management and service failures caused by manual operations, and analyzes and solves problems in a timely manner , to ensure the stable operation of the cluster.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。