Vivo unified alarm platform design and practice

1. Background

A set of monitoring system detection and alarm are inseparable, detection is used to find anomalies, and alarms are used to send problem information to the corresponding person. In the era of vivo monitoring system 1.0, each monitoring system maintains a set of calculation, storage, detection, and alarm convergence logic. This architecture is very unfavorable to the underlying data fusion, and it is impossible to realize the application of the monitoring system in a wider range of scenarios, so overall planning is required , Re-adjust the entire monitoring system architecture, and in this context, the goal of unified monitoring has been established.

In the past, monitoring was divided into basic monitoring, general monitoring, call chain, log monitoring, dial test monitoring and other major systems. The goal of unified monitoring is to perform unified calculation, unified storage, unified detection, unified alarm, and unified display of each monitoring index data. . I won’t go into details here, and an article on the evolution of the vivo monitoring system will be published later.

Above we talked about the general background of unified surveillance. In the past, each monitoring system would perform alarm convergence, message assembly, etc., in order to reduce redundancy, the convergence and other tasks need to be handled by a single service. At the same time, the alarm center platform has also reached the stage of update and iteration, so it is necessary to build one Provide a unified alarm platform for alarm convergence, message assembly, and alarm transmission for all internal businesses. With this idea, we are ready to sink the alarm convergence and alarm transmission capabilities of each system, and make the unified alarm service a solution to each monitoring service. Even general services.

2. Current situation analysis

For the monitoring system in the 1.0 era, as shown in Figure 1, each monitoring system must first perform alarm convergence, and then respectively connect with the old alarm center to send alarm messages. Each system must maintain a set of rules separately, and there are many repetitive function constructions. In fact, these functions are highly versatile. It is completely possible to establish a reasonable model to deal with the exceptions generated by the anomaly detection service in a unified manner, thereby generating problems, and then proceeding. Unified message assembly, and finally send an alarm message.

(Figure 1 Alarm flow chart of the old monitoring system)

In the monitoring system, there are several important concepts from the detection of an abnormality to the final alarm:

Abnormal

In a detection window (window size can be customized), if one or several index values reach the abnormal threshold defined by the detection rule, an abnormality is generated. As shown in Figure 2, the detection rule defines that when the indicator value is in a detection period with a detection window of 6, and 3 data points exceed the threshold, it is considered abnormal. We call this detection rule 6-3 for short, as shown in the figure. In one detection window (in the blue dotted basket), only the index values of points 6 and 7 exceed the threshold (95), which does not meet the condition of 6-3, so there is no abnormality in the first detection window. In the second detection window (in the green dashed box), the index values of three points 6, 7, and 8 exceed the threshold (95), so the second window is an abnormality.

Question

The collection of all similar anomalies generated in a continuous cycle is called a problem. As shown in Figure 2, the second detection window is an exception, and this exception will also correspond to a problem A. If the third detection window is also an exception, then the problem corresponding to this exception is also A, so the problem and the exception are One-to-many relationship.

alarm

When a problem is notified to the user by means of SMS, phone, email, etc. through the alarm system, we call it an alarm.

restore

When the abnormality corresponding to the problem does not meet the abnormal conditions defined by the detection rules, it is considered that all the abnormalities have been recovered, and the problem is also considered to be recovered, and the recovery will send a corresponding recovery notification.

(Figure 2 Schematic diagram of time series data anomaly detection)

Three, measurement indicators

How do we measure the quality of a system, how to improve it, and how to manage it? Management master Peter Drucker once said "If you can't measure it, you can't manage it". It can be seen from this that if you want to comprehensively manage and improve a system, you need to first measure its various performance indicators, know where its weaknesses are, and find out where the symptoms are before you can prescribe the right medicine.

(Figure 3 Time node relationship diagram of operation and maintenance indicators)

Figure 3 is a diagram of the relationship between monitoring system operation indicators and corresponding time nodes. It mainly reflects the corresponding relationship between MTTD, MTTA, MTTF, MTTR, MTBF and other indicators and time nodes. These indicators can improve system performance and help the operation and maintenance team find problems early. Very high reference value. There are many cloud alert platforms in the industry that also pay attention to these indicators. Below we will focus on the two indicators that are closely related to the alert platform: MTTA and MTTR:

MTTA (Mean time to acknowledge, average response time):

(Figure 4 MTTA calculation formula)

t[i] - the time taken by the operation and maintenance team or R&D personnel to respond to the problem after the i-th service has a problem during the operation of the monitoring system;
r[i] - The total number of times that the i-th service has problems during the monitoring system.

The average response time is the average time it takes for the operation and maintenance team or R&D team to respond to all questions. The MTTA metric is used to measure the responsiveness of the operation and maintenance team or the R&D team and the efficiency of the alert system. By tracking and minimizing MTTA, the project management team can optimize processes, improve problem solving efficiency, ensure service availability, and improve user satisfaction [1].

MTTR (Mean Time To Repair):

(Figure 5 MTTR calculation formula [2])

t[ri] - the total time for the service to return to normal after r alarms occur for the i-th service during the operation of the monitoring system
r[i] - the total number of alarms for the i-th service during the operation of the monitoring system

Mean Time to Repair (MTTR) is the average time it takes to repair the system and restore it to normal function. Operation and maintenance or R&D personnel begin to handle the exception, and the MTTR is calculated, and continues until the interrupted service is completely restored (including any required test time). In the IT service management industry, the R in MTTR does not always mean maintenance. It can also mean recovery, response or resolution. Although these indicators correspond to MTTR, they all have their own meanings. Therefore, to figure out which MTTR to use will help us better analyze and understand the problem. Let us briefly look at their respective meanings:

1) (Mean time to recovery) is the average time required to recover from system alarms. This covers the entire process from the alarm caused by the service exception to the return to normal. MTTR is a measure of the speed of the entire recovery process.
2) Average response time (Mean time to respond) means the average time from the occurrence of the first alarm until the system recovers from the failure to the normal state, excluding any delay in the alarm system. The MTTR is usually used in network security to measure the effectiveness of the team in mitigating system attacks.
3) (Mean time to resolve) represents the average time it takes to completely resolve the system failure, including the time required to detect the failure, diagnose the problem, and ensure that the failure no longer occurs to solve the problem. This MTTR indicator is mainly used to measure the resolution process of unforeseen events, not service requests.

The core of improving MTTA is to find the right person and find the person[3]. Only by finding the right person who can handle the problem in the shortest time can the MTTR be effectively improved. Usually in the process of production and practice, we will encounter the problem of "alarm flooding". When a large number of alarms appear, operation and maintenance personnel or development students are required to solve it. For students who are sensitive to stress, it is easy to appear "wolf is coming" phenomenon. , As long as you receive an alarm, you will be very nervous. At the same time, when a large number of alarm messages frequently harass our operation and maintenance personnel, it will cause alarm fatigue, which is reflected in too many unimportant events, fewer fundamental problems, and frequent handling of common events. Important information is drowned in the vast ocean. [4]

(Figure 6: Alarm flooding problem diagram [5])

Four, functional design

Through the analysis of the above two important indicators, we concluded that we should number of , alarm convergence , alarm upgrade etc., to reduce the number of alarms sent, improve the accuracy of alarms, and ultimately improve the efficiency of problem solving. Reduce the recovery time of the problem. Below we explain from the system and function level how to reduce the amount of alarms and send the truly valuable alarm information to the users. This article will also focus on explaining the convergence of alarm messages.

It can be seen from Figure 1 that there are many repeated functional modules in each monitoring system, so we can extract these functional modules, as shown in Figure 7 to build the capabilities of alarm convergence, alarm shielding, and alarm upgrades. In the unified alarm service. Under this architecture, the unified alarm service is completely decoupled from the detection-related services, and has a certain degree of versatility in capability. For example, other business teams that have alarms or message convergence requirements want to access unified alarms, and unified alarms must meet the requirements of message convergence and sending, and at the same time, they must also meet the requirements of direct message sending. The unified alarm will provide a flexible and configurable message sending method, and provide simple and diverse functions to meet various needs.

(Figure 7 Structure diagram of unified alarm system)

4.1 Alarm convergence

For the alarm platform, tens of thousands of alarms are generated every day, and these alarms need to be analyzed, prioritized, and handled by operation and maintenance personnel or developers. If tens of thousands of alarms are sent without convergence for every exception, it will inevitably increase the work pressure of operation and maintenance personnel. Of course, not all alarms are required and necessary to be sent to operation and maintenance personnel for processing. Therefore, we need to converge the alarm through various means. Below we introduce how to converge the alarm from four aspects.

first alarm waiting for

When an exception occurs, we will not send an alarm immediately, but wait for a period of time to send an alarm. Generally, this time can be customized by the system. If this value is too large, it will affect the alarm delay, and if it is too small, it cannot be increased. The effect of alarm merging. For example, the waiting time for the first alarm is 5s. When the node 1 has an A indicator abnormality under a service, and the node 2 also has an A indicator abnormality within 5s, then node 1 and node 2 will be merged together to send an alarm notification when sending an alarm.

Alarm interval

Before the problem is recovered, the system will send an alarm message at regular intervals according to the configuration of the alarm interval. The alarm interval is used to control the frequency of alarm sending.

Abnormal convergence dimension

The anomaly convergence dimension is used to merge anomalies in the same dimension. For example, under the same node path A, anomalies generated through the same detection rule will be merged together according to the configured anomaly convergence dimension when the alarm is sent.

Message merge dimension

When multiple exceptions converge into one problem, message merging is involved when sending an alarm. The message merging dimension is used to specify which dimensions can be merged. It may be a little bit obscure to understand this way, we can look at the conversion process from exception to message through Figure 8.

If an anomaly has two dimensions, name and gender, when the two anomalies are uniformly alarmed, we will merge them according to the configured convergence strategy. From the figure, we can see that gender is defined as an abnormally convergent dimension, which is usually an abnormally convergent dimension. The selection must be two or more exceptions with the same attribute, so that after the message is merged, only the same value of the same attribute is taken. Corresponding to the example image, we will replace the ${sex} placeholder with male. The name is defined as an alarm merging dimension, which means that the names in all exceptions must be displayed in the message text, so when the messages are merged, we will splice the information corresponding to the ${name} placeholders in the message text one by one. middle.

(Figure 8 Schematic diagram of message text replacement)

4.2 Alarm claim

When an alarm occurs, if someone claims the alarm, the subsequent same alarm will only be sent to the person who claims the alarm. The main purpose of claiming alarms is to reduce the need to send the alarms to other personnel after someone has followed up on the alarms, and also to solve the problem of repeated processing of alarms to a certain extent. The claimed alarm can be cancelled.

4.3 Alarm mask

For the same problem, you can set the alarm mask. If there is an alarm corresponding to the problem in the future, it will not be sent out. Alarm shielding can reduce the alarms caused by faults in the process of locating and solving, or during the process of service release changes, and can effectively reduce the trouble caused by invalid alarms to operation and maintenance personnel. The shielding can be set to periodic or to shield certain Of course, you can also cancel the blocking for a period of time.

4.4 Alarm callback

When the alarm rule is configured with a callback, then when an alarm is generated, the callback interface will be called to restore the service or business to normal. The purpose of alarm callback is that when an alarm is generated for a certain service, it is hoped that the system can restore the service to a normal state through some automated configuration, shorten the time of failure recovery, and also be able to quickly restore the service as soon as possible in an emergency.

4.5 False notification

For a problem, the user can remark whether the abnormality is a false alarm by marking the error. The main purpose of false notification labeling is to let system developers know which writing points need to be optimized during the anomaly detection process through labeling, improve the accuracy of warnings, and provide users with real and effective warnings.

4.6 Alarm upgrade

When an alarm occurs for a certain period of time and has not been restored, the system will automatically perform the alarm escalation processing according to the configuration, and then send the alarm escalation information to the corresponding personnel through the configuration. To a certain extent, the alarm escalation is to shorten the MTTA. When the alarm has not been restored for a long time, it can be considered that the fault has not been responded to in time. At this time, a higher level of personnel is required to intervene and deal with it.

As shown in Figure 9, the alarm system will send a large number of alarms every day. Of course, these alarms will be sent to the alarm recipients of different services. The more alarms are not the better, but the abnormal situation of the service should be accurately reflected in the first time, so how to improve effective alarms, improve the accuracy of alarms, and reduce the amount of alarms is very important. Through the above system design and function design, repeated alarm sending can be effectively reduced.

(Figure 9 Diagram of the number of host monitoring alarms)

Five, architecture design

Above, we explained how to solve various problems existing under the old architecture from the name of the system and function layers, so what architecture should we use to realize this idea.

Let's take a look at how to design this architecture. As the last link in the entire monitoring, unified alarms must not only meet the alarm sending capabilities but also the needs of business services to send notifications, so the various capabilities of unified alarms must be universal. The unified alarm service must be low-coupled with other services, especially decoupling with the existing monitoring system, so as to truly release the universal capabilities. The service should be able to adapt to different business logics according to different business scenarios. For example, some businesses need to be alarmed and converged, and some businesses do not need them. Then the service needs to provide flexible access methods to meet business needs.

As shown in Figure 10, the core logic of the unified alarm is implemented by the convergence service. The convergence service can consume exceptions in Kafka, or receive exceptions pushed through the RestFul interface. The exceptions will first undergo exception handling to generate a problem, and then the problem and exception Stored in the MySql library, after the alarm convergence module problem, it will be pushed to the Redis delay queue. The delay queue will be used to control the message dequeue time. After the message is taken out of the queue, it will perform text assembly and other operations, and finally will be sent out through the configuration. .

(Figure 10 Unified alarm architecture diagram)

The configuration management service is used to manage configuration information such as applications, events, and alarms, and the metadata synchronization service is used to synchronize metadata required for alarm convergence from other services.

Six, core realization

The core of unified alarm is alarm convergence. The purpose of convergence is to reduce the sending of repeated alarm messages and avoid alarm paralysis for the alarm receiver due to a large number of alarms.

As mentioned above, the delayed queue is used for alarm convergence. Delayed queues are used more in e-commerce and payment projects. For example, the order will be automatically cancelled if the goods are not paid for 10 minutes after the order is placed. The main purpose of using the delay queue in the alarm system is to merge as many exceptions corresponding to the same problem as possible within a certain period of time to reduce the number of alarms sent. For example, if a service A has three nodes, when an abnormality occurs, in general, each node's abnormality will generate an alarm and send it out, but after the alarm convergence, we can merge the alarms of the three nodes into one alarm Make a notice.

There are many ways to implement delay queues. Here we choose Redis to implement delay queues. The main reason for choosing Redis delay queues is that it supports high-performance score sorting. At the same time, Redis's persistence features ensure the consumption and storage of messages. Storage problem.

As shown in Figure 11, a question is placed in the redis delay queue after a series of verifications and deduplication. The question with the smallest expiration time in the queue will be ranked first, and a monitoring task constantly checks whether there is an expiration in the queue. If there is an overdue task, it will be taken out, and the taken out message will be finally formed into a message text after message assembly and other operations, and then sent out through different channels according to the configuration.

(Figure 11 Schematic diagram of delayed task execution [6])

7. Future prospects

Based on the positioning of the unified alarm service, the alarm service should be able to simply, efficiently, and accurately tell operation and maintenance or developers where there is a fault that needs to be dealt with. Therefore, for the construction of follow-up services, we should consider how to further reduce the artificial configuration, enhance the ability of intelligent convergence of alarms, and at the same time enhance the ability of root cause positioning. The above support of AI can well solve such problems. At present, major manufacturers are exploring AIOps, and some products have been put into use, but when AIOps will be launched on a large scale, it will take some time for now. Compared with the use of AI, the most urgent thing is to make unified alarms connect upstream and downstream services in series, so as to open up data, pave the way for data flow, enhance the degree of automation of services, and support alarm sending from a higher dimension to prevent failures. The discovery and resolution provide more accurate information.

8. Reference materials

[1]What are MTTR, MTBF, MTTF, and MTTA? A guide to Incident Management metrics

[2] Mean Repair Time [Z] .

[3] 4 key indicators not to be missed in operation and maintenance!

[4] PIGOSS TOC Smart Service Center makes alarm management smarter

[5] Large-scale intelligent alarm convergence and alarm root cause technology practice [EB/OL] .

[6] Do you know that Redis can implement delay queues?

Author: vivo Internet server team-Chen Ningning

Vivo unified alarm platform design and practice

1. Background

2. Current situation analysis

Three, measurement indicators