Facing the high winds, how to build a highly coordinated and accurate warning system?

Author｜Jiubian

No system in the world is 100% perfect. If you want to ensure availability, then the technical team must have a good grasp of the various statuses of the service, and be able to find the problem in the first time and quickly locate the cause of the problem. But to achieve the above two points, you can only rely on a complete monitoring & alarm system to monitor the running status of the service, but it is impossible for the technical team to stare at the Kanban board and pay attention to all aspects all the time. Therefore, alarms have become the most important means for the team to monitor service quality and availability.

However, in practice, the alarms received by the technical team are often not too few, but too many. Let’s take a look at the daily work of a cross-border e-commerce system SRE. Perhaps every engineer is familiar with this:

Open the communication tool IM, the alarm message of the operation and maintenance group prompts 99+, or even 999+;
Click to open the group to view the message, full-screen alarm title, level and assigned person, but too much information can not quickly filter and determine high-priority alarms;
Open the information one by one, view the content of the alarm and evaluate the actual priority, including but not limited to service timeout, network retransmission, and slow database response;
An alarm with a level of "P1" is found. The check content comes from the transaction system service timeout. The alarm dispatcher is a trading system developer classmate. The developer classmates check that there is no abnormality in the trading system currently, which is considered to be a database problem. Return to the group and click check in turn;
When the company reached the company, opened the alarm center system, sorted by the alarm level, and then clicked on the list entries, and held meetings with business development, network equipment maintenance, and database DBA. Comprehensive analysis found that the "trading system service timeout alarm" was caused by "network retransmission". The "database response is slow".

It can be seen that with the continuous deepening of enterprise digitization, the division of IT systems and the heterogeneity have made the enterprise technical architecture more and more complicated. In order to better ensure the stability of the system, and to avoid missing faults, the technical team usually sets a large number of monitoring indicators and alarm rules for infrastructure, platforms, and applications in the monitoring system, from the network to the machine, from the instance to the module, and then To the upper business. Although the ability to detect faults is greatly improved, it is also easy to cause an abnormality or fault to trigger a large number of alarms, causing an alarm storm. For example, when a machine fails, the alarm rule for monitoring the health of the machine generates an alarm; the alarm rule for monitoring the running status of the instances on the machine also generates an alarm; the upstream application modules of these instances are affected and start to alarm. For example, an instance in an application module generates an alarm, and an upstream application module also generates an alarm. When there are many instances in the application module, hundreds of alarm messages are generated. What's more, the network, machine, domain name, application module, business, etc. simultaneously generate multi-level and multi-faceted abnormal alarms, generating tens of thousands of alarm messages.

At the same time, when an abnormality occurs, the traditional alarm system uses emails, text messages, phone calls, etc. to alert relevant personnel, but a large number of alarm messages cannot help them quickly find the root cause and formulate a stop loss plan, but will drown out the truly effective information. . At the same time, problem handling often requires collaboration with different teams and synchronized progress in time. Single point sending is not conducive to problem description and processing follow-up. A large number of repeated descriptions of the situation and communication with cross-team responsible persons greatly prolonged the MTTR.

Many small and medium-sized Internet companies have relatively complete monitoring and warning systems, and the warning quality and emergency response efficiency are much higher than those of large and very large enterprises. This is because the monitoring system is developed and maintained in an operation and maintenance team, and the business structure, product capabilities, and usage methods are relatively simple and uniform. The main users of the monitoring system are product operation and maintenance engineers, and the quality of configured monitoring and alarms is high. However, as the scale of enterprises continues to grow, small and medium-sized enterprises will also face the same problems as large enterprises:

There are more and more monitoring systems, and the operation methods and product capabilities of each monitoring system cannot be aligned;
Most monitoring systems have poor functional design experience and high learning costs for the technical team. The technical team did not know which monitoring and alarm rules to configure, which resulted in not achieving 100% coverage of risk points, or caused a large number of invalid alarms;
There are more and more responsible persons for different monitoring systems. When the organizational structure changes, the subscription relationship of each monitoring system cannot be updated in time.

The final situation is that the number of alarms is increasing, and the number of invalid alarms is increasing. The technical team gives up monitoring and alarms, and then starts a vicious circle. Specifically due to the above phenomena, we found that the problems are mainly concentrated in the following points:

Lack of "standardized alarm processing flow system"

Alarm source data lacks uniform standards and labels with uniform dimensions

The operation and maintenance system of each domain in the enterprise is built independently, there is no unified standard, and most of the alarm data only contains the title, level and basic content. Operation and maintenance personnel spend a lot of time reading the alarms one by one, analyzing the source and final cause of the alarms. In this process, it relies heavily on the past experience of SRE. The reason behind this is mainly due to the alarm data from various domains, the inconsistent alarm policy configuration logic, no labels or inconsistent label definitions, SRE needs to identify effective information in complex alarms, analyze the correlation between alarms, and find the root cause . In order to standardize and enrich the alarm information, the traditional IT operation and maintenance system will define a unified alarm data standard from the enterprise level, and the alarm system of each domain needs to be connected according to this. The method of mandatory standardization will inevitably encounter the following problems in practice: 1) The transformation of different operation and maintenance domains is costly and project promotion is difficult; 2) the data scalability is poor, and a data standard change affects all operation and maintenance domains.

Alarm data processing and enrichment without a global perspective

IT system operation and maintenance integrates and processes alarms from different domains. The original intention is to grasp more information and make more accurate judgments. However, if the alarm is only passively received and dispatched, the value of the alarm operation and maintenance system as the operation and maintenance information center has not been reflected, and the efficiency and experience have not improved. In this regard, the operation and maintenance personnel can take the initiative to "digest", "absorb" and "enrich" the content of these alarms, and make the noisy information clear and regular. Then, the alarm operation and maintenance system needs tools that can decompose, extract, and enhance the content of alarms.

It is difficult for organizations to co-process alarms

How to handle alarms flexibly through organizational collaboration?

In an organization, the stability of each service is often implemented in the daily work of one or more organizations. Alarm handling requires collaboration within and between teams. When the alarm is triggered, the main on-duty staff will be notified according to the current shift schedule, and the standby on-duty staff will be notified if it is not processed for a period of time. If the main and standby on-duties are not handled in time, they will be upgraded to the administrator. When the staff on duty finds that the alarm needs to be handled by other upstream and downstream teams, or when the priority needs to be processed, they need to be able to modify the alarm level, be able to quickly transfer the alarm to other personnel, and the transferred personnel can obtain the alarm processing authority.

How to avoid the complexity of organizational isolation and handle alarms flexibly?

In a normal scenario, the technical team does not want to see the alarms of other teams, nor does it want the alarms of the team to be seen by other teams (involving sensitive information such as faults). However, when an alarm needs to be handled collaboratively across teams, it is necessary to be able to quickly transfer the alarm to other personnel and authorize it at the same time. How to fulfill these flexible permission management requirements on the cloud? The current traditional authorization method on the cloud is to establish a corresponding sub-account on the cloud for each member and authorize it. This method is obviously not suitable for alarm processing. The online business has been damaged. Do you still need the administrator's authorization to handle the alarm? Faced with the above problems, companies of different sizes have given different solutions:

Small-scale enterprises: Configure people in the organization as alarm contacts on the cloud platform, and notify some of them according to the content after the alarm is triggered.

Advantages: When the team is small, the distribution of alarms can be completed through simple configuration.
Disadvantages: It is necessary to continuously synchronize the relationship between the organizational structure and the alert contact. For example, when new employees enter and old employees leave, they need to be synchronized in time.

Large-scale enterprises: send alarms to the internal alarm platform through a unified webhook for secondary distribution processing.

Advantages: The self-built system can be connected with the internal organizational structure and authority system of the enterprise, and it can meet the complexity of organizational isolation and the flexibility of alarm distribution.
Disadvantages: self-built alarm platform, large investment and high cost.

In view of the above two major problems, we need a more complete idea to solve the above problems. After a lot of practice, we provide the following ideas for your reference:

"Standardized Alarm Event Processing Flow"

Combining the pain points of the above operation and maintenance cases and the difficulties faced by alarm standardization, we no longer force to promote the adaptation of each operation and maintenance domain before integration. The development and operation personnel use the "standardized alarm event processing flow" function provided by the operation and maintenance center to organize and maintain the processing flow in different scenarios by using the following methods to standardize and enhance the content of alarms from different sources.

Rely on the flexible orchestration and combination capabilities of the alarm platform and rich processing actions to quickly handle diverse alarm scenarios

From the perspective of the alarm operation and maintenance center, the alarm data processing procedures of different sources or scenarios are different. Through the provided data processing, data recognition and logic control and other rich processing flow actions, in the face of standardization or scene-based requirements, SRE filters out the currently concerned alarms with conditions and selects the action orchestration processing flow. After the test is enabled, the alarm data will be stored in the alarm system according to the expected standard for dispatch notification; SRE's alarm operation and maintenance experience can be precipitated for subsequent automated processing.

Content CMDB enrichment, breaking information silos

In the process of enterprise IT operation and maintenance, breaking the "information islands" of alarms from different sources is an important and challenging task, and enterprise CMDB data is the best raw material. By maintaining static and API interfaces to integrate CMDB data, the alarm event processing flow can enrich the information through the CMDB, so that the alarms from different domains can be correlated in dimensions. In this way, in the alarm processing process, the alarms between IT resources can be connected, which is convenient for rapid analysis and location of the root cause.

Quickly understand the distribution of alarms through AI content recognition

With the help of AI content recognition capabilities, the alarm content can be analyzed and classified. Operation and maintenance personnel can understand the distribution of system alarms from global statistics, and specific development and operation personnel can clearly identify the object type and error classification of specific alarms, shortening the path from the phenomenon to the root cause. And in the post-event review process, the intelligent classification information can be used as the decision-making reference data for IT system optimization and improvement actions.

"Alarm-oriented Organizational Collaboration"

In addition to standardization, we can see that for alarm handling, organizational collaboration needs to be sufficiently flexible. The "organization" can no longer be the center to deal with alarms, and the "alarm" should be the center of the organization. When an alarm occurs, it is necessary to coordinate the required upstream and downstream processing personnel to build a temporary organization that handles the alarm. The members in the temporary organization have the alarm processing authority. When the alarm is resolved, the temporary organization can be quickly disbanded to avoid frequent interruptions and disturbances by the alarm. Necessary failure information dissemination.

Contacts self-register to the alarm system

For an agile operation and maintenance team, it is necessary to avoid manual maintenance of the contact information of the members of the organization who need to handle alarms in the alarm system. The manual method of maintaining contacts is not suitable for organizations that change frequently. An excellent alerting system should be maintained by each organization member’s own contact information and notification settings, so as to avoid frequent organizational structure changes and the timeliness requirements for administrators to update contact information, and it can also meet different people’s choices for notification methods. Different preferences.

Reuse the existing account system to avoid using multiple account systems at work

Usually companies will use an office collaborative IM tool such as DingTalk, Feishu or Enterprise WeChat. The use of an independent account system in the alarm handling platform should be avoided. If an enterprise usually uses DingTalk and other software for office work, and then the alarm system supports DingTalk to handle alarms, then this warning system can easily be added to the enterprise's production tool chain. Conversely, if the company usually uses DingTalk, but the alarm system needs to use a separate account to log in, it will not only need to maintain two sets of accounts, but it will also easily cause problems such as poor communication and delayed information processing.

Flexible permission distribution method

The alarm authority allocation method should be for the fastest resolution of the alarm. When an alarm is generated, if the staff on duty cannot resolve it by themselves, they should coordinate the required team and resources to resolve the alarm as soon as possible. At the same time, when the alarm processing is completed, the temporarily coordinated member authority can be recovered to ensure business security and avoid information leakage. Combined with the alarm coordination methods commonly used in work, group communication is undoubtedly the most suitable method for alarm processing. When an alarm occurs, the personnel on duty temporarily pull people into the group to view and handle the alarm. At this time, the group becomes a natural authorization carrier, entering the group obtains the permission to view and processing the alarm, and will no longer be disturbed by the alarm after leaving the group.

Rich scalability

In the team collaboration process, there may be many collaboration tools used at the same time. For example, in the alarm processing process, the processing of important alarms needs to be reviewed. After the review, some work content may be designated to fundamentally solve the alarm. This process may involve the use of other tools, such as collaborative document tools and project management tools. The alarm system needs to be able to interface with these systems more conveniently and be more fully integrated into the enterprise office tool chain.

Combining the above ideas, Alibaba Cloud productized it and deeply integrated it with ARMS monitoring to provide customers with a more complete alarm and monitoring system.

ARMS alarm operation and maintenance center core advantages

Connect with 10+ monitoring data sources

ARMS itself already provides data sources such as application monitoring, user experience monitoring, and Prometheus. At the same time, it seamlessly connects a series of data sources such as log services and cloud monitoring that are commonly used on the cloud. Users can complete most of the alarms with one click.

Powerful alarm correlation capability

Based on the ability of ARMS APM, it can quickly correlate common alarm problems and automatically output the corresponding fault analysis report.

ChatOps capabilities based on DingTalk

No need to import organizational structure, no cloud account. The dispatching and claiming of alarm events can be completed in Dingding Group, which greatly improves the efficiency of operation and maintenance.

Basic and Ali fault management experience, provide in-depth analysis of alarm data, and continuously improve the availability of alarms.

Core scene

Core scenario 1: Multi-monitoring system integration

ARMS has integrated most of the monitoring systems on the cloud, ready to use out of the box. At the same time, it supports user-defined data sources.

图片 1.png

Core scenario two: alarm compression

ARMS comes with 20+ rules based on common alarm phenomena to help users quickly compress alarm events and also supports customer-defined event compression.

图片 2.png
图片 3.png

Core scenario 3: Multiple notification channel configuration

Support the processing and distribution of alarms in the Dingding group.

图片 4.png
图片 5.png

Core scenario 4: Alarm data analysis market

图片 6.png

Core scene five: out-of-the-box intelligent noise reduction capability

Automatically identify low information entropy alarms.

图片 7.png

Go to Dingding to search for the group number (32246773) or scan the code to join the community to keep up to date with the latest product developments of the "ARMS Alarm Operation and Maintenance Center"~

二维码.png

Want to experience a better alert center
Come use the ARMS application real-time monitoring service!
Click the link below to experience it!
https://www.aliyun.com/product/arms?spm=5176.19720258.J_8058803260.179.c9a82c4aAnljzB