Original: Yuantiandi (WeChat public account ID: cxytiandi), welcome to share, please keep the source for reprinting.
The topic of monitoring will never be out of date. I have also talked to you about monitoring and how to quickly implement monitoring to meet daily needs. For example, alarms based on logs, alarms based on global exception handlers, alarms based on Cat, Prometheus, Sentry, etc.
Regardless of the size of the company, start a small company, or stabilize a large company, you need to monitor it. In particular, the monitoring of large companies is more comprehensive, and they are more concerned about monitoring. Relatively speaking, small companies are better, because the business may not be very stable, there are few users, and if there is a failure, it will be a problem, and it can be repaired.
Pain points of monitoring
Monitoring is better than no monitoring, but more monitoring is not necessarily a good thing. The monitoring here has two meanings.
Meaning 1: There are many monitoring systems or monitoring frameworks
In many cases, there are many different kinds of monitoring in the company, with Sentry alarming abnormalities and abnormal log alarms. There are Cat and SkyWalking, which makes you doubt your life, which one to use, and various repetitions of abnormal information.
The only advantage is that whenever there is a problem, you can't panic. Why are there so many abnormal warnings? I went to investigate the problem immediately, which resulted in very good self-drive.
Meaning 2: There are many monitoring alarms
With more monitoring frameworks, the number of alarms will naturally double. There is no doubt about this. In fact, the more important point is that the alarms are not classified into levels, and a messy report causes the alarm group to always have alarm information. It's a bit like the meaning of a wolf coming, you don't bother to watch it later, because there are too many.
How to solve the pain points
Unified monitoring system
First, the monitoring system should be sorted out and a unified monitoring framework should be adopted. a But in many cases, a certain framework cannot meet all the requirements. In this scenario, mixed use occurs, and the poor control is the same as the previous one.
It is enough to cover most of the scenes at a certain monitoring level. If not, it may be necessary to carry out self-developed and extended functions based on the existing open source monitoring system.
Alarm rating
After the monitoring system is unified, the biggest problem is abnormal alarms. Do all exceptions need to be alerted? Can alarms be classified and graded?
There are two types of exceptions, runtime exceptions, such as NPE. Another very common type is business abnormalities, such as insufficient inventory, and products that have been removed from the shelves.
For runtime exceptions, it must be the first priority, because this is a bug, which requires immediate attention and processing. There are often not many such exceptions. If there are too many, then your code is really bad.
Alarm classification refinement
For business exceptions, the alarm level can be lowered. Although this type of abnormality can't report system problems, but can report the status of the business, it still needs a little attention. For example, the core order interface in the e-commerce business fails to place an order 100 times within 1 minute. Can we not pay attention to this situation? Must pay attention.
In addition to degrading, business exceptions must also be classified. This requires that the corresponding code be declared when the business exception is thrown. In this way, directly bring the corresponding code code when alarming, and you will know what the problem is at a glance. For example, the previous orders frequently fail. If you just warn about how many orders have failed, then you must be panicked at this time, because you don't know why at all?
You have to go to the log and the like to find out the real cause of the error. If you have already brought code 1001 when the alarm is issued, you will know at a glance that the stock is insufficient, and a certain product must be rushed for purchase. The code code 1002 risk control verification timed out, immediately contact the risk control classmates for investigation. This kind of warning meets the standard, otherwise it will be too tired.
most important thing is to retain good field data, that is, the reference and response and traceId, or talk about how to solve this warning problem .
to sum up
After the renovation, only an abnormal operation or a large number of errors in a certain minute will send a text message or phone emergency alert to reduce harassment. Other business abnormalities, etc., directly go to the nails, such as Feishu alarm groups. Alarm groups can also be refined, divided into codes that need attention and codes that do not need attention, and separate different groups to improve the accuracy of information access and consumption.
This article mainly talks about abnormal alarms in the project. Other alarms such as some middleware and databases do not need to be so divided. Once these alarms, whether it is a soaring cpu or a soaring memory, it is a very serious problem. All need to be concerned and deal with the plan.
About the author : Yin Jihuan, a simple technology enthusiast, author of "Spring Cloud Microservices-Full Stack Technology and Case Analysis", "Introduction to Spring Cloud initiator of 160cdb2212d020.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。