Enterprise Practice | Application Business Indicator Monitoring of Distributed System Observability - 阿里巴巴云原生

About the Author:

Zhao Jun | Cloud native engineer of the Infrastructure Department of Nanjing Aifu Road Automotive Technology Co., Ltd. He has been engaged in Java-related architecture and R&D work in the past. Currently, he is mainly responsible for the company's cloud native implementation, and is responsible for the comprehensive cloud migration and cloud native transformation of F6 infrastructure and business core applications.

Xu Hang | Cloud native engineer of the Infrastructure Department of Nanjing Aifu Road Automotive Technology Co., Ltd. In the past, he has been responsible for database high availability and related operation and maintenance and tuning. Currently, he is mainly responsible for the implementation of DevOps for R&D efficiency and the transformation of cloud-native observability of business systems.

As distributed architecture has gradually become the mainstream of architecture design, the term Observability has been mentioned more and more frequently.

After the 2017 Distributed Tracing Summit (2017 Distributed Tracing Summit) ended, Peter Bourgon wrote a summary article "Metrics, Tracing, and Logging" which systematically explained the definitions, characteristics, and relationships and differences between the three . In this paper, the observability problem is mapped to how to deal with three types of data: metrics, tracing, and logging.

Later, in his book "Distributed Systems Observability", Cindy Sridharan further stated that metrics, tracking, and logging are the three pillars of observability.

title=

In 2018, CNCF Landscape pioneered the concept of Observability, introducing Observability from Cybernetics to the IT field. In cybernetics, observability refers to the degree to which a system can infer its internal state from its external output. The more observable the system, the more controllable we have over the system.

What problems can observability solve? Chapter 12 of the Google SRE Book has a neat answer: Quick Troubleshooting.

There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are:
Building observability—with both white-box metrics and structured logs—into each component from the ground up
Designing systems with well-understood and observable interfaces between components.
Google SRE Book, Chapter 12

In the cloud-native era, distributed systems are becoming more and more complex, and changes to distributed systems are very frequent, and each change may lead to new types of failures. After the application is launched, if there is a lack of effective monitoring, it is likely to cause problems that we do not know ourselves, and we need to rely on user feedback to know that there is a problem with the application.

This article mainly describes how to establish application business indicators Metrics monitoring and how to achieve accurate alarms. Metrics can be translated into measures or indicators, which refer to regular statistics in the form of aggregated and numerical values for some key information, and to draw various trend charts. Through it, we can observe the state and trend of the system.

Technology stack selection

Our applications are all Spring Boot applications, and use Spring Boot Actuator to implement application health checks. Starting from Spring Boot 2.0, Actuator changed the bottom layer to Micrometer, providing stronger and more flexible monitoring capabilities. Micrometer supports docking with various monitoring systems, including Prometheus.

Therefore, we choose Micrometer to collect business indicators, Prometheus to store and query indicators, display them through Grafana, and achieve accurate alarms through Alibaba Cloud's alarm center.

Metrics collection

For the entire R&D department, it should focus on the most core indicators that can reflect the company's business status in real time. For example, Amazon and eBay track sales, and Google and Facebook track real-time metrics directly related to revenue, such as ad impressions.

Prometheus defaults to a metrics protocol called OpenMetrics. OpenMetrics is a text-based format. Below is an example of a metric representation format based on the OpenMetrics format.

 # HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027
http_requests_total{method="post",code="400"}    3

# Escaping in label values:
msdos_file_access_time_seconds{path="C:\DIR\FILE.TXT",error="Cannot find file:\n"FILE.TXT""} 1.458255915e9

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

The data of a metric consists of a metric name (metric_name), a set of key/value labels (label_name=label_value), a numeric metric value (value), and a timestamp.

 metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

Meter

Micrometer provides a variety of metric libraries (Meter), and Meter refers to a set of interfaces used to collect metric data in applications. In Micrometer, the specific types of Meter include: Timer, Counter, Gauge, DistributionSummary, LongTaskTimer, FunctionCounter, FunctionTimer, and TimeGauge

Counter is used to describe a monotonically increasing variable, such as the number of calls to a method, the total number of cache hits/accesses, etc. Supports the configuration of recordFailuresOnly, that is, only the number of failed method calls is recorded. The indicator data of Counter has four labels by default: class, method, exception, result.
Timer will simultaneously record totalcount, sumtime, maxtime three kinds of data, there is a default label: exception.
Gauge is used to describe a variable that fluctuates continuously within a range. Gauge is usually used for variable measurements, such as the number of messages in the queue, the number of thread pool task queues, etc.
DistributionSummary for statistical data distribution.

Application access process

In order to facilitate the access of microservice applications, we encapsulate the micrometer-spring-boot-starter. The specific implementation of micrometer-spring-boot-starter is as follows.

Introduce Spring Boot Actuator dependencies

 <dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
  <version>${micrometer.version}</version>
</dependency>

Do initial configuration

Actuator enables the collection of some metrics by default, such as system, jvm, http, which can be turned off by configuration. In fact, it's just that we need to close, because we have already picked up the jmx exporter.

 management.metrics.enable.jvm=false
management.metrics.enable.process=false
management.metrics.enable.system=false

If you don't want the Actuator management port of the web application to overlap with the application port, you can use management.server.port to set a separate port. This is a good practice. You can see hackers attacking actuators, but if you change the port number, there will be fewer problems without exposing the public network.

 1management.server.port=xxxx

Configure spring beans

TimedAspect's Tags.empty() is intentional to prevent prometheus from stressing over long class names.

 @PropertySource(value = {"classpath:/micrometer.properties"})
@Configuration
public class MetricsConfig {

    @Bean
    public TimedAspect timedAspect(MeterRegistry registry) {
        return new TimedAspect(registry, (pjp) -> Tags.empty());
    }

    @Bean
    public CountedAspect countedAspect(MeterRegistry registry) {
        return new CountedAspect(registry);
    }

    @Bean
    public PrometheusMetricScrapeEndpoint prometheusMetricScrapeEndpoint(CollectorRegistry collectorRegistry) {
        return new PrometheusMetricScrapeEndpoint(collectorRegistry);
    }

    @Bean
    public PrometheusMetricScrapeMvcEndpoint prometheusMvcEndpoint(PrometheusMetricScrapeEndpoint delegate) {
        return new PrometheusMetricScrapeMvcEndpoint(delegate);
    }

}

When the application is connected, the micrometer-spring-boot-starter dependency is introduced

 <dependency>
  <groupId>xxx</groupId>
  <artifactId>micrometer-spring-boot-starter</artifactId>
</dependency>

You can now view the data recorded by Micrometer by visiting http://ip:port/actuator/prometheus .

Custom business metrics

Micrometer has built-in Counted and Timed annotations. By adding @Timed and @Counted annotations to the corresponding method, you can collect information such as the number of method calls, time, and whether an exception occurs.

@Timed

If you want to record the number and time of invocation of the print method, you need to annotate the print method with @Timed and define a name for the indicator.

 @Timed(value = "biz.print", percentiles = {0.95, 0.99}, description = "metrics of print")
public String print(PrintData printData) {

}

After adding the @Timed annotation to the print method, Micrometer will record the number of print method calls (count), the maximum method call time (max), and the total method call time (sum). percentiles = {0.95, 0.99} means to calculate the request time of p95, p99. The recorded indicator data is as follows.

 biz_print_seconds_count{exception="none"} 4.0
biz_print_seconds_sum{exception="none"} 7.783213927
biz_print_seconds_max{exception="none"} 6.14639717
biz_print_seconds{exception="NullPointerException"} 0.318767104
biz_print_seconds{exception="none",quantile="0.95",} 0.58720256
biz_print_seconds{exception="none",quantile="0.99",} 6.157238272

The @Timed annotation supports configuring some properties:

value: required, indicator name
extraTags: define tags for the indicator, support multiple, format {"key", "value", "key", "value"}
percentiles: a number less than or equal to 1, the percentage distribution of computing time, such as p95, p99
histogram: histogram histogram type indicator of the time-consuming recording method

@Timed will log exceptions thrown by the method. Different exceptions are logged as separate data. The code logic is to first catch the exception thrown by the method, record the exception name, and then throw the exception of the method itself:

 try {
    return pjp.proceed();
} catch (Exception ex) {
    exceptionClass = ex.getClass().getSimpleName();
    throw ex;
} finally {
    try {
        sample.stop(Timer.builder(metricName)
                    .description(timed.description().isEmpty() ? null : timed.description())
                    .tags(timed.extraTags())
                    .tags(EXCEPTION_TAG, exceptionClass)
                    .tags(tagsBasedOnJoinPoint.apply(pjp))
                    .publishPercentileHistogram(timed.histogram())
                    .publishPercentiles(timed.percentiles().length == 0 ? null : timed.percentiles())
                    .register(registry));
    } catch (Exception e) {
        // ignoring on purpose
    }
}

@Counted

If you don't care about the execution time of the method, only the number of method calls, or even the number of exceptions occurred in the method call, using the @Counted annotation is a better choice. recordFailuresOnly = true means that only the number of method calls with exceptions is recorded.

 @Timed(value = "biz.print", recordFailuresOnly = true, description = "metrics of print")
public String print(PrintData printData) {

}

The recorded indicator data is as follows.

 biz_print_failure_total{class="com.xxx.print.service.impl.PrintServiceImpl",exception="NullPointerException",method="print",result="failure",} 4.0

counter is an incrementing value that increments by 1 after each method call.

 private void record(ProceedingJoinPoint pjp, Counted counted, String exception, String result) {
    counter(pjp, counted)
            .tag(EXCEPTION_TAG, exception)
            .tag(RESULT_TAG, result)
            .register(meterRegistry)
            .increment();
}

private Counter.Builder counter(ProceedingJoinPoint pjp, Counted counted) {
    Counter.Builder builder = Counter.builder(counted.value()).tags(tagsBasedOnJoinPoint.apply(pjp));
    String description = counted.description();
    if (!description.isEmpty()) {
        builder.description(description);
    }
    return builder;
}

Gauge

Gauge is used to describe a variable that fluctuates continuously within a range. Gauge is usually used for variable measurements, such as the workId of the snowflake algorithm, the printed template id, the number of thread pool task queues, etc.

Inject PrometheusMeterRegistry
Construct Gauge. Name and assign a value to the indicator.

 @Autowired
private PrometheusMeterRegistry meterRegistry;

public void buildGauge(Long workId) {
    Gauge.builder("biz.alphard.snowFlakeIdGenerator.workId", workId, Long::longValue)
            .description("alphard snowFlakeIdGenerator workId")
            .tag("workId", workId.toString())
            .register(meterRegistry).measure();
}

The recorded indicator data is as follows.

 biz_alphard_snowFlakeIdGenerator_workId{workId="2"} 2

Configure SLA metrics

If you want to record the sla distribution of metric time data, Micrometer provides the corresponding configuration:

 management.metrics.distribution.sla[biz.print]=300ms,400ms,500ms,1s,10s

The recorded indicator data is as follows.

 biz_print_seconds_bucket{exception="none",le="0.3",} 1.0
biz_print_seconds_bucket{exception="none",le="0.4",} 3.0
biz_print_seconds_bucket{exception="none",le="0.5",} 10.0
biz_print_seconds_bucket{exception="none",le="0.6",} 11.0
biz_print_seconds_bucket{exception="none",le="1.0",} 11.0
biz_print_seconds_bucket{exception="none",le="10.0",} 12.0
biz_print_seconds_bucket{exception="none",le="+Inf",} 12.0

Stored query

We use Prometheus to store and query metrics data. Prometheus uses Pull-Based Metrics Collection. Pull means that Prometheus actively pulls indicators from the target system. On the other hand, Push means that the target system actively pushes indicators. Prometheus officially explained the reasons for choosing Pull.

Pulling over HTTP offers a number of advantages:
You can run your monitoring on your laptop when developing changes.
You can more easily tell if a target is down.
You can manually go to a target and inspect its health with a web browser.
Overall, we believe that pulling is slightly better than pushing, but it should not be considered a major point when considering a monitoring system.

Prometheus also supports the Push collection method, which is Pushgateway.

For cases where you must push, we offer the Pushgateway.

In order for Prometheus to collect application metrics data, we need to do two things:

The application exposes the actuator port through the service and adds label: monitor/metrics

 apiVersion: v1
kind: Service
metadata:
  name: print-svc
  labels:
    monitor/metrics: ""
spec:
  ports:
  - name: custom-metrics
    port: xxxx
    targetPort: xxxx
    protocol: TCP
  type: ClusterIP
  selector:
    app: print-test

Add ServiceMonitor

 apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: metrics
  labels:
    app: metric-monitor
spec:
  namespaceSelector:
    any: true
  endpoints:
  - interval: 15s
    port: custom-metrics
    path: "/manage/prometheusMetric"
  selector:
    matchLabels:
      monitor/metrics: ""

Prometheus will regularly access the endpoints of the service ( http://podip:port/manage/prometheusMetric ), pull the metrics of the application, and save them to its own time series database.

The data stored in Prometheus is in text format. Although Prometheus also has Graph, it is not cool enough and has limited functions. It is also necessary to have some visualization tools to display the data, and to know the running status of the current system through a standard and easy-to-use visualization dashboard. The more common solution is Grafana. Prometheus has a built-in powerful time series database and provides the data query language of PromQL, which can perform rich query, aggregation and logical operations on time series data. By configuring the Prometheus data source and PromQL in Grafana, let Grafana query the indicator data of Prometheus and display it in the form of charts.

grafana configure Prometheus data source

title=

Add Kanban, configure data source, query statement, chart style

title=

Multiple Kanban boards can be added to a dasborad to form a monitoring board.

title=

Precise alarm

No system is perfect. When abnormalities and failures occur, it is particularly important to be able to find the problem at the first time and quickly locate the cause of the problem. However, in order to achieve the above two points, only data collection is not enough, and it is necessary to rely on a sound monitoring and alarm system to quickly respond and issue alarms.

Our initial plan was to create alert rules based on PrometheusRule of Prometheus operator, Prometheus servers send alerts to Alertmanager, and Alertmanager is responsible for sending alerts to DingTalk robots. But after running like this for a while, we found that there are some problems with this method. The leaders of the SRE team and the R&D team received too many alarms. All the alarms were sent to a group, the group message was opened, and the alarm title, alarm severity, and alarm value were filled on the screen. Among them, there are system alarms that need to be processed by operation and maintenance, and application alarms that need to be processed by R&D. There is too much information, and it is difficult to quickly filter out high-priority alarms, and it is difficult to quickly assign alarms to corresponding handlers. Therefore, we hope that application alarms can be accurately sent to the R&D team to which the application belongs.

After a period of research, we finally chose Alibaba Cloud's "ARMS Alarm Operation and Maintenance Center" to be responsible for alarm management. ARMS Alarm Operation and Maintenance Center supports access to Prometheus data sources, and supports adding DingTalk robots as contacts.

Collect the webhook addresses of the DingTalk robots of the R&D team, and create a robot as a contact.

title=

Configure a notification strategy for each R&D team, filter the team field in the alarm information with the notification strategy, and bind the corresponding DingTalk group robot contact. \

title=

In this way, the alarms of the application are directly sent to the corresponding R&D team, which saves time for information screening and secondary assignment, and improves the efficiency of alarm processing.

The effect is as follows:

title=

ARMS alarm operation and maintenance center supports access to grafana, zabbix, arms and other data sources, with alarm dispatch and claim, alarm summary and deduplication, and multiple reminders for alarms that have not been processed for a long time through upgrade notification, or upgrade notification to Leaders to ensure that alarms are resolved in a timely manner.

Enterprise Practice | Application Business Indicator Monitoring of Distributed System Observability

Technology stack selection

Metrics collection

Meter

Application access process

Custom business metrics

@Timed

@Counted

Gauge

Configure SLA metrics

Stored query

Precise alarm

阿里云云原生

引用和评论

如何在通义灵码里使用 MCP 能力？

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

全网首发 | PAI Model Gallery一键部署阶跃星辰Step-Video-T2V、Step-Audio-Chat模型

无需编码5分钟免费部署云上调用满血版DeepSeek

支付宝H5下载被拦截的原因排查与解决指南

如何在通义灵码里用上DeepSeek-V3 和 DeepSeek-R1 满血版671B模型？

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践