Exploring the implementation of TiDB's observability solution | "We are such a dish judge will not be angry" team interview - 开源分布式关系型数据库 TiDB

In the TiDB Hackathon 2021, the veteran player Wang Penghan, who has not missed any event, won the award again, and it is also another team with male and female friends competing side by side after Skating Egg.

Wang Penghan is currently working in AppDynamics, a company under Cisco that is engaged in application performance management, mainly engaged in the research and development of log search engines and some work related to observability. Chen Siyu is the research and development of the PingCAP Chaos Mesh team.

This time, Collie Diagnosing Platform is a fault diagnosis, analysis and solution platform that integrates fault scene information collection, UI online observation and analysis, and machine learning-assisted diagnosis. Combining the actual scenes of the two people at work, they explored the working methods of DBAs and operation and maintenance personnel in the next 3-5 years. The judges gave very high evaluations and expectations, and won the third prize of this Hackathon.

The project is of great significance, allowing DBAs to face so many metrics when analyzing problems, so it is no longer a big deal.

——Feng Guangpu, head of the multi-point Dmall database team, commented

Database autonomy is an important direction in the field. The participating teams have done a good job in theory and engineering practice. In the future, they can further improve and perfect the DAS products of Alibaba Cloud, which will make up for the gaps in open source projects in this field.

——Comment from Li Kai, head of Meituan Database R&D Center

About the team
Q: What is the story of the origin of this team name?

Chen Siyu: The team name comes from a Dota voice pack. If a teammate does something stupid in the game, play this voice to achieve the effect.

You can experience it through this video: https://www.bilibili.com/video/BV1N34y1m7wa

Q: Both are veteran players of Hackathon. What is the attraction of Hackathon to you?

Wang Penghan: For me, Hackathon is a deadline setting, forcing you to learn a lot in a short period of time. Originally, I wanted to learn how to use machine learning to do something related to root cause analysis for a long time. But have been thinking about it for a year and actually only watched a little bit. In Hackathon, you can learn about it and use it in a very short time. There is a deadline that will force you to study quickly. At this time, the learning efficiency is very high. When you sleep, your mind is full of thoughts about how to optimize this place and how to do it in that place. Of course, you can also get a little bonus by the way.

Chen Siyu: Similar to him. In addition, you can see many different ideas in Hackathon.

project inspiration
Q: Why did you initially think of doing such a project? Can you share what's your inspiration?

Wang Penghan: In the first two years of our competition, we were basically doing FDW, which is to connect a common external data source to TiDB. This year's first prize project is also a kind of external data source in a sense. I feel a little haggard after doing it. For the elderly like us, the functions of Hard Code can no longer keep up with both mental and physical strength. This year, we can only find another way in the direction of flower work.

One of the core points of my project inspiration is to explore some real situations around me, try to extract a general problem, and then think about how to solve it efficiently through tools or methodology. Including how last year's idea wrote an elegant document, and how this year's idea quickly found faults and diagnosed faults, all of which are closely related to work.

My current work is in the field of observability. Most of the things that are combined with machine learning in this field are still in the thesis stage, but how to better implement it in the actual environment is still relatively few people try it. TiDB has done the whole foundation very well, so I wanted to take the opportunity of Hackathon to try something.

Q: How was the division of labor among your team members during this competition?

Wang Penghan: Siyu is responsible for how to use TiDB to simulate the occurrence of faults, and it happens to use their team's product Chaos Mesh. Then I use some tools to replace the human brain, observe whether there is a problem during this time period, and use machine learning methods to replace people to make some simple judgments. It is equivalent to saying that the operation and maintenance personnel have a large screen with dozens of graphs, and each graph has many polylines. Under normal circumstances, if a system is running smoothly, the line is basically flat, but when a fault occurs, there will be large fluctuations.

Nowadays, DBAs use human eyes to observe, and fault judgments are also based on human experience and thinking patterns. But now there are thousands of metrics like this in TiDB, and there will be more in the future. This means that relying on people to observe these things will become more and more complicated and slower. We can use the machine to help you quickly filter out, for example, 10 pictures in 1000 pictures have this kind of fault, and then you can observe these 10 pictures, which saves you a lot of time.

In an ideal environment, the accuracy rate can reach 70-80%. But if in the real environment, you may think that some are not faults, so this indicator will have some fluctuations and the noise will be very loud.

Technical difficulties & coping
Q: What major technical difficulties did you encounter during the competition?

Wang Penghan: The main problem is the quality of the dataset. Currently in the field of AI, the algorithm may not be the most critical, the most critical is the data set. If your data set is good enough, you can get a good answer through the corresponding algorithm. But if your dataset is bad, your answer will always be wrong. So we spent a lot of time simulating this piece of the dataset.

The other part is thinking about how to reasonably design and use a more efficient and reasonable way to do this when I think about the operation and maintenance of the system from the perspective of DBA. In fact, whether it is TiDB or a system, there is a common methodology, which can be observed by observing resources, such as CPU resources or memory resources, or by observing transactions (one http request, one database query request). Thereby knowing whether the system is operating as expected. If it doesn't work as expected, our app can give you an alert and tell you what caused the system not to run as expected.

This is the so-called Root Cause Analysis. We hope to use machine learning to tell you whether the reason for the failure is that the CPU is not enough, or there are other tasks on the machine that grab the CPU resources, then you should add more CPU resources.

Chen Siyu: In fact, this problem should not only be the concern of DBA or operation and maintenance, because they (APPDynamic where Wang Penghan is located) are all oncall, so he will think about how to solve an oncall problem and how to optimize it The process of this Oncall. The project we are doing this time is also optimizing the overall process of this Oncall.

DBAs may be a little more professional. The products we made this time are aimed at non-DBA personnel, because DBAs will have a clear judgment by looking at the more professional indicators such as Grafana. But if you are just getting started, for example, you are just starting to learn TiDB, you can also use our products to have a preliminary direction to judge whether the cause of the failure is a network delay or something.

Wang Penghan: Another problem encountered is that the existing machine learning or deep learning is still a long way from the so-called AIOps, and it is very difficult. In order to have a finished product for this project, we mainly rely on two papers, one is "DBSherlock: A Performance Diagnostic Tool for Transactional Databases" by SIGMOD 2016, and the other is "Diagnosing Root Causes of Intermittent Slow" by VLDB 2020 Queries in Cloud Databases".

What these two papers do very well is that they limit the scene, and can achieve a high degree of accuracy for a small field and small scene. For example, 10% of an operation and maintenance work may be dealing with such problems. Then this project can automatically solve the 10% of the workload. Then, just like assembling building blocks, assemble the building blocks of this problem today, and assemble another problem next time, and slowly splicing out the whole fully automated thing.

Unfinished Regrets & Expectations
Q: The time for this Hackathon is limited. What regrets do you have during the competition?

Wang Penghan: Satisfied with the experience, completed the personal goals, quickly learned machine learning in a short period of time, and made a small product. aiops can reduce human workload in some specific areas, but it is still a long way from the eventual widespread replacement of operations.

Chen Siyu: The 8-minute presentation time is too short. We have been reducing the PPT, adjusting which key points we want to highlight, and ensuring that so many judges and teachers can get it in such a short period of time.

Q: Your project won the third prize this time. What are your expectations and expectations for the future of this project?

Wang Penghan: First of all, the implementation method of this project is still very rudimentary, and there are many details that need to be discussed and communicated with you to find out. And we have no hope of turning this project into a product, but more of an exploration of the direction, exploring the way of operation and maintenance or DBA in the next 3-5 years.

With the rapid development of technology, the difficulty of operation and maintenance is also increasing. The management object of operation and maintenance has changed from the original single machine to the cloud and Kubernetes. There are more and more nodes, and more and more information emerges. From the development point of view, more information will be exposed for unified management and tracking. But how to make better use of this information? This field is still rarely thought about in China, which is the so-called observability (Observability).

Many foreign companies, including our company (AppDynamic), are major players in this field and are actively exploring in this direction. PingCAP has done a very good job, and has done a good job of observability of the entire database. I hope that more domestic companies and individuals can think about how to use existing tools to make your system more observable, thereby reducing operation and maintenance pressure and costs.

Exploring the implementation of TiDB's observability solution | "We are such a dish judge will not be angry" team interview

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

2025年2月国产数据库大事记-墨天轮

TiDB 可观测性解读（二）丨算子执行信息性能诊断案例分享

【赵渝强老师】使用TiDB的审计日志

架构师必看！现代应用架构发展趋势与数据库选型建议丨TiDB vs MySQL 专题（一）

Dify 基于 TiDB 的数据架构重构实践

TiDB 观测性解读（一）丨索引观测：快速识别无用索引与低效索

Exploring the implementation of TiDB&#39;s observability solution | "We are such a dish judge will not be angry" team interview