SRE capacity building for cloud services with a scale of 10,000 nodes

Author: Song Ao (Fan Xing)

Background and current situation

Introduction to System Architecture

title=

The above picture shows the system architecture actually used in Alibaba Cloud. The main purpose of the system is the calculation and storage of real-time data streams. Using Alibaba Cloud's container service ACK as the system base, the deployment, release, and control of containerization are all based on the K8s standard. Use the Gateway service developed by yourself as the system traffic entry, and deploy it based on the LoadBalancer type of service.

This architecture is also a common practice in using K8s as a system. The CCM component provided by ACK automatically binds the service to the underlying SLB instance, and accepts external traffic through this instance. After the Gateway receives the reported data stream, it puts the data into Kafka and buffers the data in a topic. When the Consumer perceives the reporting of the data stream, it will further calculate and process the data from the corresponding topic in Kafka, and write the final result to the storage medium.

There are two storage media here: Alibaba Cloud block storage ESSD is relatively fast, but expensive; file storage NAS is mainly used to store data with low performance requirements. The metadata is managed by ACM, and the self-developed components Consumer and Center are responsible for querying the calculation results from the storage medium and returning them to the user.

System Status

title=

The system is currently available globally and deployed in nearly 20 regions around the world. It can also be indirectly seen from the figure that the data read and write links are long and need to be processed by multiple components. The monitoring objects are also more complex, including infrastructure (scheduling, storage, network, etc.), middleware (Kafka), and business components (Gateway, Center, Consumer).

work content

Collection of observable data

title=

Observability mainly includes three types of data: Metrics, Tracing and Logging. The three types of data perform their respective functions and complement each other.

Metrics is responsible for answering whether there is a problem with the system, and it is also the entrance to the system's self-monitoring system. Investigating abnormal alarms or problems with Metrics can quickly determine whether there is a problem with the system and whether human intervention is required.

Tracing is responsible for answering where the system problem occurs. It can provide specific details such as the calling relationship between components, each request, and which two components have problems in the calling process.

After finding the faulty component, you need to locate the root cause of the problem through the most detailed problem log, namely Logging.

Data Visualization & Troubleshooting

title=

After collecting the three types of data, the next step is to display the data in a visual way through the dashboard, which can identify the problem at a glance and make troubleshooting more efficient.

Alarm & Emergency Handling

title=

After the problem is identified, the problem needs to be solved through the alarm and emergency handling process.

The emergency handling process mainly includes the following points:

First, a hierarchical alarm notification and escalation strategy is required, so that the corresponding personnel can be quickly found to handle the alarm. At the same time, if the original personnel cannot handle it in time for some reason, the alarm needs to be escalated to other personnel to handle it in time.

Second, the problem identification and handling procedures are standardized.

Third, post-event statistics and review. After the system fails and is resolved, post-mortem statistics and review work are also required, so that the system can avoid the same problems in the future through failures and lessons, and make the system more stable.

Fourth, the operation and maintenance processing is tooled and blanked, the work of manually inputting commands is minimized, and the entire set of standard processing actions is solidified with tools.

Practical experience sharing

Observable data collection & access

Metrics data

title=

The first step in monitoring is to collect observable data, which can be divided into three levels in the system.

The top layer is the application layer, which mainly cares about the health of the core business interface, which is measured by the three golden indicators of RED (Rate, Error, Duration). Among them, Rate refers to the QPS or TPS of the interface, Error refers to the error rate or number of errors, and Duration refers to how long the interface can return. SLOs can be defined by golden metrics and assigned Error Budget . If the Error Budget is exhausted quickly, the SLO should be adjusted in time until the system is optimized enough before increasing it. Apdex Score can also be used to measure the health of the service. The calculation formula is shown in the figure above. In addition, the application layer will also care about indicators that are strongly related to the business, such as revenue, number of users, UV, PV, etc.

title=

The middle layer is middleware and storage, mainly concerned with the submission status of the Kafka client-side consumer sites that are widely used in the system, the occupancy rate of the producer buffer, and whether the buffer will be filled up in advance, so that new messages cannot come in and be consumed. Latency, average message size, etc., such as water level, read and write traffic, disk usage, etc. on the Kafka Broker side, as well as cloud disk ESSD mount success rate, IOPS, disk free space, etc.

title=

The bottom layer is the infrastructure layer, and the indicators concerned are more complex, such as ECS (K8s Node) CPU memory water level, restart times, timed operation and maintenance events, etc., such as API server, ETCD, scheduling related indicators of K8s core components, etc. For example, the Pending status of the business Pod, whether there are enough resources for scheduling, OOMKilled events, Error events, etc., and VPC/SLB related egress bandwidth, number of dropped connections, etc.

title=

Monitoring ECS nodes needs to be deployed using the node-exporter Daemonset. The core components of K8s cloud native can expose metrics through the Metrics interface for Prometheus to capture. Because Kafka or storage components use Alibaba Cloud's cloud products, they provide basic monitoring indicators and can be directly accessed. ACK also provides very core indicators for CSI storage plug-ins, such as mount rate, iops, space occupancy, etc., which also need to be connected. The application layer includes Java applications and Go applications. Java uses MicroMeter or Spring Boot Actuator for monitoring, and Go applications directly use Prometheus official SDK for embedding.

title=

Based on ACK and K8s system, the best selection of Metrics protocol is Prometheus. The access difficulty of the open source self-built Prometheus is comparable to that of cloud hosting, but in terms of reliability, operation and maintenance costs, etc., cloud-hosted Prometheus is better than self-built. For example, using the ARMS system, you can directly install very lightweight probes in the cluster, and then store the data in a fully managed storage component, and perform monitoring, alarming, visualization, etc. around it. The whole process is very convenient and does not require Self-built open source components. Cloud hosting also has advantages in acquisition and storage capabilities.

title=

Prometheus provides one-click access to ECS nodes, K8s core components, Kafka and storage components.

title=

K8s basic monitoring and node status access are mainly through the ACK component management center, which can be used simply by installing the relevant components.

title=

The access to Kafka is mainly through the Prometheus cloud service instance. The cloud service instance has been connected to Alibaba Cloud's very complete PaaS layer cloud products. When accessing Kafka, the indicators that need to be filled in are common indicators, and there is no additional learning cost.

title=

The access of cloud disk ESSD monitoring is mainly monitored through the CSI storage plug-in. The csi-provisioner component in ACK provides a useful observability capability. Through the information provided by it, you can check which disk is not mounted, which disk is mounted after the iops fails to meet the requirements or configure alarms, and timely Discover the abnormal status of the cloud disk.

Through ARMS Prometheus access, preset collection jobs such as K8s cluster level, node level PV status can be easily monitored.

title=

There is no convenient way of one-click access at the application layer. It is necessary to expose the Metrics interface of the entire application through the method of application buried point + service discovery, and provide it to the probe capture of ARMS Prometheus.

title=

Java applications need to be embedded using MicroMeter or Spring Boot Actuator.

Take the above code as an example, a lot of JVM-related information in MicroMeter can be easily exposed through a few lines of code, and other more helpful information, such as process-level memory or thread, system information, etc., can also be accessed in a very simple way. enter. After setting the buried point, you need to start the server internally, expose the Metrics information through the HTTP interface, and finally bind the specified port. At this point, the burying process is over.

In addition, if there are business metrics, just register your own metrics with the global Prometheus Registry.

title=

ARMS probe captures exposed endpoints, just add ServiceMonitor, ServiceMonitor directly through the console white screen, simply write a few lines of YAML definition, you can complete the collection, storage and visualization of the entire monitoring run through .

title=

Go applications are similar to Java, but the way of burying points is different, mainly using the official Prometheus SDK to bury points.

Take the above picture as an example. For example, there is a query component in the system that cares about the time distribution of each query. You can use the histogram type indicators in Prometheus to record the time distribution of each request, and specify the commonly used sub-systems when burying points. statistics. Then ServiceMonitor writes the endpoint of the Go application, and finally completes the access of the entire application on the console.

title=

ARMS also provides a non-intrusive observability implementation solution based on eBPF, which is mainly suitable for scenarios where you do not want to modify the code. The RED of the system interface is monitored in a non-intrusive way, and the ARMS Cmonitor probe is combined with eBPF. filter realizes the capture and storage of information.

title=

Using the above method requires additional installation of Cmonitor App in the cluster. After installation, Cmonitor Agent can be seen in Daemonset. Each node needs to start Cmonitor Agent. The function is to register into the system kernel through eBPF and capture the network of the entire machine. Traffic, and then filter out the desired service network topology and other information. As shown above, it can monitor the QPS of the entire system, the distribution of response time and other information.

Link (Trace) and log (Log) data

title=

Trace also uses Cmonitor to implement related capabilities. In terms of logs, logs of system components, logs of K8s control plane, DCLog of JVM, etc. are mainly delivered to Loki for long-term storage through arms-Promtail (a probe for collecting application logs).

The log of K8s system events is mainly based on the function of the ARMS event center to monitor the key events of K8s, OOM, Error and other events on Pod scheduling.

Read and write link monitoring and problem location

title=

The picture above is a screenshot of part of the system.

For example, the trace-related console can view standard information such as the components that each request passes through, receiving time, processing time, and response time.

Component operation log collection and storage

title=

In terms of log collection, by configuring Loki's data source in Grafana, you can quickly obtain Pod logs or Pod logs mounted under a service based on keywords or Pod labels, which greatly facilitates troubleshooting.

K8s running event collection and monitoring

title=

The picture above is a screenshot of the Event Center workbench provided by ARMS. The workbench provides a list of key events, and you can subscribe to higher-level events. Subscription only takes a few simple steps: fill in the basic rules, select the mode that needs to match the event, and select the alarm sending object, then you can monitor the key events and realize purpose of prompt response.

Data visualization and troubleshooting

Grafana market configuration practice

title=

After data collection is complete, the next step is to create efficient and usable data visualization and troubleshooting tools.

Prometheus and Grafana are a pair of golden partners, so we also chose Grafana as a data visualization tool. The above figure enumerates the key points in the large-scale configuration process:

When loading a large disk, you need to control the query timeline on each panel to avoid displaying too many timelines on the front end, which will cause a lot of browser rendering pressure. Also, for troubleshooting, showing multiple timelines at once doesn't help.
The large disk is equipped with many flexible Variables, which can switch various data sources and query conditions on a large disk at will.
Use Transform to allow Table Panel to display statistics flexibly.
Distinguish between Range and Instant query switches to avoid useless range queries to slow down the display speed of the market.

title=

K8s cluster overview

The above picture shows the large disk of monitoring node information and K8s cluster Pod information. The overall layout is clear at a glance, key information is marked with different colors, and the numbers are displayed in different forms through Grafana's dynamic threshold function to improve the efficiency of troubleshooting.

title=

node water level

The above picture shows the node water level panel, showing important information such as disk iops, read and write time, network traffic, memory usage, and CPU usage.

title=

Global SLO

The picture above shows the global SLO market. Configure a custom large disk through the Grafana hosting service, which is also the latest feature of ARMS. If you host a Grafana instance on the cloud, you can directly log in to the Grafana interface through your cloud account, which includes unique functions customized by Alibaba Cloud.

The market includes global latency, QPS, success rate, error code distribution, QPS trends, and some finer-grained information, such as single sharding, load balancing, gateway components, center components, etc. In case of a release, you can check the difference between the previous and previous versions by bringing the version number, you can display the datasource in the form of variables in the pannel, or you can select various regions around the world to switch and view.

title=

Kafka client and server monitoring

The cluster relies on Kafka client and server, and its monitoring comes from cloud monitoring integration.

title=

Internal components are strongly dependent on Kafka, and indicators are used to monitor Kafka and the number of connections between it and the broker, average message length, site submission rate, consumption traffic, etc. The Producer side provides information such as the occupancy rate of the buffer and the number of active producers.

title=

Java application health monitoring

If the component is written in Java, it is also necessary to monitor the JVM-related GC situation. The dashboard can provide the JVM memory situation, GC-related situation, CPU usage, number of threads, thread status, etc. In addition, for example, for releases or dynamically loaded classes, you can check whether there is a continuous rising state in the statistics of loaded classes.

title=

Table format error type re-examination statistics table

If you want to use a spreadsheet to count the installation status or key customer status in the cluster, you can use Grafana's transform function to turn the entire dashboard into a spreadsheet-like experience. You can use transform to map fields to a spreadsheet, and after opening filter, you can get query results by filtering various fields.

Troubleshooting case sharing

title=

The display of log information needs to query Loki's data in Grafana. For example, there is a query log in the center, which contains a lot of original information, such as the time of the query, UID and so on. Through post-processing, first filter the desired logs by row, then extract information through pattern matching, and then match the size relationship of some of the fields according to PromQL, and finally match the log format. Perform secondary processing.

Alerting and Graded Response Mechanisms

title=

The above figure shows the flow of the alarm and hierarchical response mechanism, followed by alarm configuration, personnel scheduling, alarm triggering, emergency handling, post-event review, and mechanism optimization. Finally, the mechanism optimization is reflected in the alarm configuration to form a complete closed loop.

title=

The above process is a self-built alarm system. By relying on the self-built system to periodically run tasks to check indicators, and then calling DingTalk's webhook or webhook of other operation and maintenance systems to issue an alarm, there are several shortcomings in the process:

You need to be responsible for the stability of the self-built system. If the stability of the alarm system is worse than that of the operation and maintenance system, the existence of the alarm system is meaningless. Secondly, as more and more regions are opened in the entire cluster, the configuration becomes more and more complex, and it is difficult to maintain the configuration by yourself and ensure that the configuration takes effect globally.

title=

Relying on manual scheduling, it is very easy to miss scheduling or backup missing.

title=

In the alarm trigger stage, the trigger conditions are very simple, and it is difficult to add additional business logic to the alarm trigger link, such as dynamic thresholds and dynamic marking.

title=

In the emergency processing stage, information is sent very frequently, and it is impossible to actively claim and close it. When there is a problem with the system, similar alarms will be sent in high density in the group, and the inability to actively shield the alarms is also one of the defects.

title=

There is no data support for post-event review and optimization, and the entire process cannot be optimized based on the existing alarm statistics.

Therefore, we choose to build an alarm system based on ARMS. ARMS's powerful alert and graded response mechanism brings us many conveniences:

title=

Globally valid alarm template function: You only need to configure an alarm rule once to make the alarm rule valid for different clusters. For example, if there is no alarm template, you need to configure alarms separately for the indicators in each cluster. With the template, it is very convenient to apply PromQL or AlertRule of the alert to each region cluster around the world through the alert rule template.

title=

2. Alarm schedule and dynamic notification: The system can dynamically realize shift replacement work, which is more reliable than manual shift scheduling.

title=

3. Event processing flow and alarm enrichment: Through the event processing flow of ARMS, the event processing flow of the alarm center and the alarm enrichment function, the alarm can be dynamically marked and graded after the alarm is triggered. As shown in the figure above, the alarm can be marked with a priority label, the alarm level with higher priority can be upgraded to P1, and the alarm recipient can be dynamically modified.

In order to realize the above functions, a data source is required to provide the basis for marking. There is a data source function on the console of the alarm operation and maintenance center. When an alarm is triggered, the data source can be called through HTTP request or RPC request, and then the marking result can be obtained from the HTTP URL. The implementation of this interface is mainly through the IFC lightweight tool to write the code online. The code is mainly responsible for reading the information configuration items in the ACM configuration center, and then reading the configuration items and exposing the HTTP interface to the alarm operation and maintenance center for dynamic calls.

After completing the above steps, you need to configure the event processing flow, pass the required information to the above interface by matching the update mode, return the priority, and finally hit the alarm.

title=

4. Claiming, closing, and blocking alarms: ARMS provides practical functions such as claiming, closing, following, and blocking, which significantly improves the quality of alarms.

title=

5. Statistics on the acceptance rate of alarms: when reviewing the data, you need to know how many alarms each person has handled, the processing time, the average alarm recovery time, etc. After introducing the mechanism of claiming, closing, recovering, and shielding, the ARMS Alarm Center records in the background After the event log is obtained, useful review information can be provided by analyzing the log.

title=

After getting the alarm information, users want to be able to deal with the problem on the white screen interface, so we introduced the white screen operation and maintenance tool chain based on Grafana. The principle is to introduce dynamic information when configuring the big disk, and splicing it into the big disk in the form of links.

We have various systems internally. If there is no official link splicing, we need to first spell the URL or search manually, which is very inefficient. By embedding links in Grafana, the operation and maintenance actions are solidified into jumps between links, which is very helpful for improving efficiency. All the work can be completed through a set of Grafana tools, and the operation and maintenance actions can be standardized and solidified, reducing the possibility of manual errors.

Summary and future work

title=

First, the optimization of alarm accuracy and acceptance rate. At present, there is no good way to use the alarm review information efficiently. In the future, we will try to adjust the unreasonable alarm threshold in time through the information of the alarm accuracy and takeover rate, and may also try multi-threshold alarms. For example, the alarm level is within the range of A to B, and the level is above B.

Second, multi-type data linkage. For example, when troubleshooting problems, in addition to Metrics, Trace, and Log, there are also information such as profiler and CPU flame graph, but at present, the linkage between these information and observable data is low. Improving data linkage can improve the efficiency of troubleshooting.

Third, buried point cost control. For external customers, the cost of burying points is directly related to the cost of using Alibaba Cloud. We will regularly conduct targeted governance on the dimensions of self-monitoring indicators, divergent dimensions, etc., and clean the data for useless dimensions to control the cost of burying points at a low level.

Scan the code on DingTalk and join the group to learn more about observable practices

title=

SRE capacity building for cloud services with a scale of 10,000 nodes

Background and current situation

Introduction to System Architecture

System Status

work content

Practical experience sharing

Observable data collection & access

Data visualization and troubleshooting

Alerting and Graded Response Mechanisms

Summary and future work

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？