Ant intelligent monitoring

AntMonitor Profile

AntMonitor is an intelligent monitoring system of the Ant Group. By constructing a real-time and stable data link for monitoring observable data, it provides real-time, stable, reliable and abundant data links for the technical risk brain and system. Observation data and warning service.

AntMonitor daily serves 100+ business domains across AntMonitor, with a peak data cleaning volume of 20TB per minute, a data aggregation volume of 1TB, and a data storage volume of 150 million items. These indicators have doubled during the promotion period. How is such a large and complex system? To guarantee its own stability, this article will lead you to explore it from the perspective of thinking, strategy and realization.

system architecture

In terms of system architecture, AntMonitor can be divided into four subsystems: product, alarm, computing, and storage. Each subsystem can provide services independently and coordinate with each other to assume the role of the data chassis of ant technical risks.

Product System

The product system directly provides users with various visualization services, including two components, monitormeta and monitorprod.

monitormeta is responsible for the management and synchronization of metadata, such as calculation configuration, alarm rules, and various operation and maintenance metadata;
monitorprod is the core service of the product center and provides external data services; internally supports data model definition, data model abstraction, and data model conversion capabilities.

Computing System

The computing system provides integrated data collection, cleaning, aggregation and data life cycle management services. There are many components in the computing system, which can be divided into service layer, computing layer and collection layer for introduction.
1. Service layer

tableapi provides a standard data service interface to the outside;
dimservice is a columnar metadata cache service center.

Second, the computing layer

global-scheduler（gs）
As a global task scheduling component, it is responsible for lifecycle management of the registered compute-space computing cluster, and generates task topology based on the configuration, and then delivers it to the computing cluster.
compute-space（cspace）
As an abstract computing power resource pool, it is responsible for analyzing and executing the task topology delivered by gs to produce data, write it to the storage system, and feed back the task status to gs. Cspace is not tied to a specific data computing resource pool. , The underlying implementation can be any calculation engine. Currently, a cspace node is implemented as a spark cluster.

Third, the acquisition layer

gaea-registry
For the management and control center of the entire collection side, it distributes the collection configuration to agents and vessels, and is also responsible for maintaining the health status and routing information of all agents and vessels.
agent
For the collection service process deployed on the physical machine, it is responsible for reading raw data such as logs, system indicators, and events.
vessel
For the data cleaning cluster, it is responsible for the structured cleaning of the raw data collected by the pull group agent and returns it to cspace in the form of rpc.

alarm system
The alarm system patrols the calculated index data based on the alarm rules configured by the user, generates alarm events and pushes them to subscribers. The components of the alarm system are somewhat similar to the computing system, including:

alarm-global-scheduler (alarm-gs) is the alarm scheduling component;
alarm-compute-space (alarm-cspace) is the alarm computing resource pool;
alarm-gateway is the alarm push gateway.

If you compare the alarm system with the computing system, you can abstractly think: the input of the alarm is not the collection, but the data produced by the computing system; the output of the alarm is not the storage system, but the alarm-gateway.

Storage system
The storage system (ceresdb) provides AntMonitor with time series data read and write services.

ceresdbconsole is a visual operation and maintenance control component;
Ceresmeta is a metadata management component, responsible for the coordination of the main/standby switching, expansion and contraction of the storage cluster;
ceresproxy and ceresbase are two companion processes deployed in the same container. Among them, ceresproxy is the routing layer, which is responsible for summarizing the query or written data and returning the results to the client. At the same time, it provides tenant verification, flow control, Blacklist and other capabilities; ceresbase is a data storage node, responsible for the storage of time series data, index maintenance and other functions.

Stability building

The monitoring system is a special role in the entire Ant's architecture. While carrying the observability and alarm capabilities of all business systems, it also provides data services for other sub-domains of technical risks such as capacity, self-healing, and failure emergency.

Therefore, monitoring has more stringent requirements for its own stability. In cases such as large-scale downtime, computer room network interruption or more extreme conditions, monitoring also needs to ensure that it can operate stably.

Regarding the stability construction of ant intelligent monitoring itself, we mainly promote from two aspects, including the design of the stability architecture and the stability guarantee during operation. The overall picture is shown in the following figure.

Stability architecture

Stability architecture is the most important part of building stability. A meticulously designed stability architecture can enable us to deal with various stability issues as gracefully and calmly as possible in the later stage, instead of fighting hamsters.

Specifically, at the beginning of designing a stable architecture, we should first realize that the runtime environment and input of the system will not be stable.

The instability of the operating environment is mainly reflected in the machine's failure and downtime, network jitter, or more extreme factors such as the cutting of the fiber in the computer room, urban natural disasters and other objective factors. Dealing with such problems usually starts from two aspects:

First, improve the disaster tolerance level of the system as much as possible.

For example, single point, computer room-level disaster recovery, city-level disaster recovery, etc., in terms of the most basic single point, we need to ensure that all scheduling or synchronization nodes (such as monitormeta, gs) are based on the active-standby architecture, all Service nodes (such as monitorprod, tableapi) are stateless, all shard nodes (such as gaea-registry, ceresbase) have redundant copies, and all work nodes (such as cspace, alarm-cspace) , After the downtime, it can be restored by itself.

Second, all data processing procedures should be designed for failure.
Because when a failure occurs, some tasks may have failed in certain cycles. At this time, it is necessary to be able to drive these tasks to retry. For example, cspace needs to be able to tolerate and retry the collection failure of a certain computing task, alarm-gs It is necessary to reschedule the execution failure of the alarm task.

Regarding the uncertainty of system input, we also deal with two situations:

The first case is the disorder of the entry data, such as dirty configuration, dirty metadata, illegal data types, etc. Wrong data flowing into the system may cause unexpected behavior. For such problems, we usually need to do it at the entrance. Check and reject unexpected data flow into the system.

The second situation is the level of control of the ingress data. The performance of any system is linked to the capacity, and its design is also carried out under the assumption of a certain performance and capacity balance. The input of the monitoring system is usually business logs, and the definition of the monitoring configuration is directly facing the user. It is very likely that a randomly defined configuration will cause a large amount of service call detail logs to flow into the monitoring system, causing the cluster to crash, so the traffic is restricted and controlled. Very important. In AntMonitor, each key entry, such as collection entry, calculation entry, storage entry, data query entry, etc., has strict flow control and verification rules to ensure that the monitoring system can operate stably under the expected capacity without being accidental. Overwhelmed by traffic.

In summary, we will focus on the design ideas of AntMonitor's stability architecture from two aspects: disaster tolerance architecture and architecture unitization.

Disaster Recovery Architecture

The previous article briefly mentioned the solution to the single-point problem of the architecture, which is sufficient to cover small-scale failure scenarios such as node downtime and network jitter that may occur daily. However, when a truly devastating disaster comes, a higher level of capacity is needed. Disaster plan to deal with.

At present, based on the distinction of different tenant protection levels and the trade-off of objective factors such as resource quotas, AntMonitor has implemented two different levels of disaster recovery strategies, namely, computer room-level disaster recovery for tenants in conventional business domains and city-level disaster recovery for tenants in high-security business domains. disaster.

room-level disaster recovery

For regular business domain tenants, AntMonitor provides computer room-level disaster recovery capabilities. The computer room-level disaster recovery solutions of each subsystem are implemented as follows.

Product System

Monitorprod is a stateless component deployed in three computer rooms in the same city. Each computer room's service mounts a VIP to solve the single-point disaster recovery problem in the computer room. The VIPs of multiple computer rooms are mounted to a domain name and rely on dns to solve the computer room-level disaster recovery problem.
monitormeta is a master-slave architecture deployed in three computer rooms in the same city, and it selects master nodes based on distributed locks to provide synchronization capabilities.

There are many components in the computing system, and it is directly related to the quality of monitoring data, so the stability work is also more complicated.

Both dimservice and tableapi are stateless nodes deployed in three computer rooms, relying on vip + dns to achieve computer room-level disaster recovery;
gs is a master-slave architecture deployed in three computer rooms in the same city, based on the ant self-developed relational database oceanbase to preempt distributed locks, and select the master node to coordinate the cluster;
A node of cspace, the underlying execution engine is actually a spark cluster. A cspace node internally relies on spark's resource scheduling capabilities (such as yarn) to solve single-point problems. Multiple cspaces register to gs and host their own life cycles, which can be regarded as stateless computing services;
gaea-registry is a master-slave architecture based on preempting oceanbase distributed locks. The master node divides the collection configuration into fragments. Each fragment has two copies, which are allocated to different slaves. When a slave node fails, the master node will coordinate the affected data fragmentation to healthy slave nodes;
Both the agent and the vessel are deployed following the business system. An agent container is deployed on each physical machine, which is responsible for collecting multiple business containers on the same physical machine; each business machine room is deployed with a vessel cluster, which is responsible for cleaning the raw data collected by all agents in the machine room. The reason for this is that the magnitude of the original data (such as logs) is very large, and the transmission across the computer room will bring a lot of network delays and impact on the network equipment, so deployment in the computer room is the most efficient way. It is also true that the disaster tolerance levels of agent and vessel are quite special. The agent cannot solve the single point problem because it is deployed on a single machine. Vessel is the disaster tolerance component in the machine room. When the machine room fails, the business in the machine room itself It is no longer available and needs to rely on the business's own disaster recovery solution to switch to the healthy computer room.

But at this time we will provide another indicator-data completeness-to reflect the accuracy of the monitoring data. For example, a monitoring configuration needs to collect 100 business containers. If the physical machine where an agent is located causes the collection of 5 business containers to fail, the computing layer will gather 95% completeness information; if a certain computer room fails, the computer room vessel The cluster is unavailable, the data of the corresponding 30 business containers is lost, and the computing layer will also gather 70% completeness information.

Alarm system

Due to the similarity between the alarm system and the computing system architecture, the stability architecture is also similar.

alarm-gs selects the master node based on the distributed lock of oceanbase, and coordinates and maintains the registered alarm-cspace;
alarm-cspace can also be regarded as a stateless alarm computing service;
The alarm-gateway is a stateless component of the three computer rooms in the same city, based on vip + dns to achieve computer room-level disaster tolerance.

storage system

ceresdbconsole is a stateless node, based on vip + dns to achieve computer room-level disaster tolerance.
Ceresmeta is based on the distributed consensus algorithm raft, which provides highly reliable metadata management capabilities and cluster scheduling capabilities for the ceresdb cluster. Use the three-machine room 2, 2, and 1 mode deployment. In addition, each machine room will add several more learner roles, and will not participate in the raft election. The learner node can not only cold-serve the data, but also switch to a follower when the cluster majority is unavailable. Role;
Ceresproxy and ceresbase are two accompanying processes in the same container, and the survival status of a single process is guaranteed by the supervisor. Ceresdb cuts the data to form data fragments, and the fragments are stored on ceresbase. Each fragment contains two copies of the master and backup, which are distributed on different ceresbase nodes. When a container where ceresbase is located goes down, ceresmeta will rebalance and switch the affected data shards to the ceresbase node where the healthy copy is located.

City-level disaster recovery

For high-security business domain tenants (such as transaction tenants), AntMonitor provides city-level disaster recovery capabilities.

The specific plan is a remote dual-link deployment. For high-security tenants, in addition to the above-mentioned Shanghai link, we also deployed a complete monitoring link in Heyuan at the same time, that is, at the same time and with the same configuration, the task will be executed twice, two copies of data and alarm events will be generated, and two The domain name independently exposes services to the outside. When a link is unavailable, you can manually switch to another link.

Why choose remote dual link instead of remote active-active or remote hot standby?

There are several reasons:

First, we really need two completely independent environments to avoid certain global factors from affecting monitoring. For example, monitoring configuration or other operation and maintenance metadata, such as oceanbase and other monitoring rely on a small number of self-disaster tolerance components, when an independent link is redundant, these global components can be deployed in duplicate, which greatly reduces their impact on the overall situation. Monitoring the risk of data impact;
Second, we need two copies of data to do double-check to further protect the data accuracy of high-insurance tenants;
Third, the backup link can be used as the gray-scale environment of the main link to a certain extent. When monitoring itself needs to make changes, especially for high-security tenant clusters, you can publish the backup link first to reduce the risk of false positives or false negatives caused by directly publishing the main link;
Fourth, it is due to restrictions on resource quotas. Ant intelligent monitoring is a large resource user of Ant's entire site. Objective factors are not enough to support us to deploy a completely independent active-active or hot-standby cluster for all tenants in different places. Instead, we choose to cross-city for high-guarantee tenants whose traffic accounts for less than 1%. Redundant duplicate data.

The deployment of cross-city dual links enables the entire AntMonitor to survive even in the most demanding environment.

Architecture

Architecture unitization can be understood as the cluster management within AntMonitor.
In the initial top-level design, we split the entire AntMonitor product, calculation, collection, alarm, and storage modules horizontally to achieve different clusters serving different business domain tenants.
The specific implementation is that the resource pool label is defined in AntMonitor. The label is not only marked on a cluster of a component in the system, but also on a specific monitoring configuration. A monitoring configuration, or a monitoring task, is always Scheduled to the monitoring cluster of the same label for execution. Since the granularity of tag attachment is a cluster of components, we can flexibly combine tags to provide monitoring services for different tenants.
For example, there are three tenants A, B, and C. Tenant A can build a separate storage cluster for them because the storage write volume is larger than that of ordinary tenants. Tenants B and C have higher real-time requirements for alarms, which can be: They build separate alarm clusters.

The unitized design brings many benefits to the stable architecture of AntMonitor.

First of all, we can use this to distinguish global resource and task management. For example, for high-security tenants such as transactions, we build a separate cluster for them, and redundant more resources to run at a lower capacity water level to ensure data Have higher real-time, accuracy and stability;
Secondly, a certain degree or even complete physical isolation is achieved between different tenants. When a fault occurs, the explosion radius can be quickly controlled to reduce the scope of influence;
At the same time, according to the different protection levels of each tenant, we can formulate a more reasonable release frequency and grayscale order to reduce the possibility of release changes affecting high-security tenants;
Finally, the unitized architecture can also help us solve classic problems such as change isolation and self-monitoring in the monitoring field, and infrastructure circular dependencies, which will be expanded later.

runtime guarantee

Designing and realizing a good stable structure is like planting a sapling with good genes and straight stems, but to make it grow luxuriantly, it requires regular care, irrigation and pruning. This is the daily routine. Guaranteed stability during runtime.
At this level, our main ideas are self-monitoring, digital operation, and configuration control.

self-monitoring

When it comes to surveillance, many seemingly paradoxical topics are always brought up. For example, how does a surveillance system monitor itself?
For example, is there a circular dependency situation in which infrastructure depends on monitoring to ensure its own stability, and monitoring at the same time relies on certain infrastructure?
The core idea of AntMonitor to deal with such problems is "isolation".
For self-monitoring problems, similar to dual-link, AntMonitor has a set of kernel-mode monitoring clusters. This cluster is deployed independently and has a stable version. On the one hand, it avoids the abnormality of AntMonitor in the production environment causing monitoring development students to fail to discover in time, and on the other hand, it also avoids Due to the change of AntMonitor in the production environment, the risk of abnormality and failure to locate the root cause in time.

Aiming at the problem of circular dependency on infrastructure, on the one hand, AntMonitor used itself as one of the lowest infrastructures of the entire ant at the beginning of its design, relying only on iaas and oceanbase, without relying on any other middleware. On the other hand, for the iaas and oceanbase that AntMonitor relies on, we also built an independent and stable version of the monitoring cluster for them to ensure that the monitoring changes in the production environment will not cause the monitoring services that AntMonitor depends on upstream to become unavailable.

Digital operation

Digital operation, as the name suggests, is to do a full range of digital measurement for the monitoring itself.

S L A

Externally, we will service monitoring capabilities and use SLO indicators to drive stability guarantees. For example, we will set targets for availability of data query services, targets for data calculation delays and breakpoint rates, and targets for alarm delays and accuracy. , Storage read and write availability and time-consuming targets, etc. The SLO quantifies the service goals, and we also need to promise the consequences that the goals are not met, which is the SLA. For example, if the transaction tenant data breakpoint rate exceeds 50% and lasts for more than 60 minutes, it will be recorded as a failure.
In this way, the user-oriented service quality of the entire monitoring system is transparent and tangible, and these indicators can also reflect the long-term trend of monitoring stability.

cost

Internally, we quantify the cost of monitoring. The specific method is to define and count the cost metrics on the key nodes of the entire link.
Based on the analysis of nodes on the current monitoring link that account for more resources and relatively stable cost measurement, we define the amount of data cleaning (MB/minute), the amount of data aggregation (MB/minute), and the amount of data storage (data points/minute) ) Three cost metrics.
For example, in a monitoring configuration, the amount of data collection and cleaning is X, the amount of data aggregation is Y, and the amount of data storage is Z. For the performance status of AntMonitor, for every 4C8G container provided, the cleaning cluster can clean data N, the computing cluster can aggregate data M, and the storage cluster can store data L. Then the cost of this monitoring configuration is 4 (X/N + Y/M + Z/L) core CPU + 8 (X/N + Y/M + Z/L) G memory, with the help of quantified cost information, on the one hand, we can clearly measure the performance optimization effect for a specific configuration; On the other hand, we can make more accurate system capacity planning and configuration degradation guidelines to ensure the stability of the monitoring system.

Configuration control

"Sandbox Intercept"

Experience tells us that changes are often the easiest to introduce stability issues into the system.
In addition to its own iterative release, changes to the monitoring system also include a large number of user configuration changes (for example, large computing tasks caused by unreasonable user configuration).
For AntMonitor's own changes, the unitized architecture introduced above can minimize the possibility of the overall impact.
In response to user configuration changes, we have introduced a sandbox interception mechanism. The specific approach is to build a set of sandbox clusters independent of the production cluster based on unitized capabilities. When the configuration changes, the new configuration will be intercepted by the product system and delivered to the sandbox cluster. After the new configuration runs for a period of time, cost data will be output. The inspection component judges whether the change is legal according to the cost, only legal The changes will be applied to the production cluster, otherwise it will be put into the approval process. Sandbox interception is still based on the idea of isolation, which isolates the risk of configuration changes to the sandbox cluster, thereby avoiding the possibility of affecting the monitoring production environment.

Summary

The stability construction of the system is a long-term investment and a process of continuous improvement. From the perspective of intuitively splitting stability into "stable" and "sex", "stability" is the system guarantee at the current business volume and capacity level. The bottom line.
However, the business is constantly expanding. Under the premise of ensuring the stability at hand, we still need to think about how to improve the system performance in pursuit of the limited resources that can stably support a larger business volume, that is, "sex."
For individuals, participating in the construction of system stability can help you have a more intuitive grasp and deeper understanding of the system's overall architecture, performance capacity, and evolution trends, that is, how the system looks like, how many things can be done, and how Doing more and better can be known, this is a work worth trying.

Ant intelligent monitoring

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

每一份投入，都该物有所值：观测云如何用按需计费重塑可观测性价值

JeecgBoot AI 应用开发平台，AIGC 功能介绍

【机器学习入门】从基础概念到实践应用，揭开智能算法的神秘面纱

高德地图xRokid，联合打造首个智能眼镜导航行业标杆

可观测性第四大支柱：配置数据的监控

【故障定位系列】Web应用接口级故障如何定位