ARMS Practice | Application of Logs in Observable Scenarios - 阿里巴巴云原生

Author: Chen Chen

Application of logs in observable scenarios

With the change of IT architecture and the practice of cloud native technology, the perspective of development and business departments is integrated, and the operation and maintenance team has a wider and more proactive observability capability than the original monitoring. As one of the three observable pillars (Tracing, Metrics, and Logs), the log helps the operation and maintenance team track the running status of the program, locate the root cause of the fault, and restore the fault scene. The use of log scenarios for fault discovery and fault location can be roughly divided into two categories: log search and log analysis:

Log search:

Search logs by log keywords;
Search logs by thread name and class name;
Combined with the Trace context information, it derives the search log based on TraceID, spanName, parentSpanName, serviceName, and parentServiceName.

Log analysis:

View and analyze the trend of the specified number of logs;
Generate indicators according to the log content (for example, when a log is successfully printed for each transaction, an indicator about the transaction amount can be generated);
Automatically identify log modes (for example, view the changes in the number of logs of different modes, the proportion).

In actual production, by flexibly combining the above several usage methods, the operation and maintenance team can well eliminate the interference factors in the process of daily observation and fault location, and delimit and even locate the root cause of the problem more quickly.

Shortcomings of Common Open Source Logging Solutions

Most common log solutions are to use the log collection agent installed on the host, and configure the log collection path to collect logs to a third-party system for storage, query, display, and analysis. The more mature ELK (Elasticsearch, Logstash, Kibana) open source solution has attracted many users with its active community, simple installation process, and convenient usage.

title=

But the ELK scheme also has some shortcomings:

High operation and maintenance costs : To build a complete ELK system, you need to deploy ES clusters, kafka clusters and logstash components, etc., and as the log scale grows, problems such as multi-cluster splitting, multi-cluster upgrades, and stability are often required. Invest more manpower.
High resource overhead : The resource overhead of almost all components in the ELK architecture will increase linearly with the growth of the log scale, which takes up a lot of cost.
Lack of enterprise-level capabilities : Logs often contain business-critical information, which requires a complete multi-tenant isolation and fine-grained permission control scheme, which is lacking in the open source free ELK architecture.

ARMS-based logging solution

Compared with the ELK open source self-built solution, is there a lighter and easier O&M logging solution?

At present, the application real-time monitoring service ARMS provides a set of simple and easy-to-use log solutions, allowing the operation and maintenance team to integrate application logs with one click. Compared with open source solutions, it enriches functionality, reduces costs, and further improves ease of use.

Feature

1. Automatically enrich logs

The associated call chain context includes TraceID, ServerIP, spanName, parentSpanName, serviceName, parentServiceName. It fully satisfies the observable scenarios that require correlation analysis between Tracing and Logs, such as searching logs based on TraceID, finding upstream applications that trigger abnormal log printing, and upstream interfaces.

2. Provide intelligent log clustering capability

Summarize and abstractly cluster logs that are large in scale, complex in content, and difficult to achieve in a unified and standardized format, so that operation and maintenance personnel can quickly find the "category" difference between abnormal logs and normal logs, so as to quickly locate abnormal logs and find problems. .

3. Provide LiveTail capability

Real-time monitoring and analysis of online logs, delay in reporting logs in milliseconds, and log viewing experience closest to tail-f, effectively reducing operation and maintenance pressure.

4. Based on the Arthas capability of ARMS, adjust the logger output level at runtime

5. The ability to generate log-based alarms and log conversion indicators with one click (coming soon in the internal test).

Ease of use

One-click activation of the ARMS console enables you to use a full set of log-related functions;
There is no need to install additional log collection components to avoid application modification;
There is no need to manage the operation and maintenance log server and logs, reducing the workload of daily operation and maintenance;
Supports logs collected directly by Log Service SLS and ARMS.

Operation and maintenance cost

The log function is in the public beta stage and is completely free;
Provide a flexible and configurable log discarding strategy to reduce a large number of invalid logs from the source;
Provides flexible and configurable log storage policies, and the log storage duration can be configured according to the importance of the application.

ARMS Log Function Demonstration & Scenario Best Practice

Prerequisites

Upgrade to Agent 2.7.1.4 and later (K8s applications will be upgraded to Agent 2.7.1.4 after restarting, non-K8s applications require users to manually download and mount the latest version of Agent).

title=

On the application list page of the ARMS console, click on the application that needs to enable the log collection function, click Application Settings at the bottom left, click on the custom configuration page, turn on the log collection switch and configure the corresponding parameters according to the actual scene, and finally click Save.

For the logs collected directly, the output of the log framework is collected by the ARMS probe and pushed directly to the log analysis center of ARMS.

title=

If you need to collect application logs to the log service SLS, and configure the corresponding Project and Logstore in the ARMS application configuration, ARMS will embed the log service page to facilitate your log analysis.

title=

Functional application demonstration

Search logs by TraceID

title=

View the trend of the number of log entries containing the top keywords

title=

LiveTail

Click the link below to view the video in action:
https://developer.aliyun.com/live/250112

Log clustering In the following figure, the upper left side shows the change trend of the number of logs in different patterns. The right picture shows the total number of logs in different patterns in descending order. The bottom part is the original log text in different patterns. Search for different log modes in to view the log text samples in that mode.

title=

For more cases of the ARMS log function, you can view the official ARMS documentation:

https://help.aliyun.com/document_detail/432298.html

Best Practices

The following briefly introduces the best practices of the two Alibaba Cloud observable teams using the ARMS log function in the cloud service SRE scenario.

Case: Troubleshooting the drop of indicators

background

Application A is mainly responsible for receiving traffic information reported by business applications through RPC, parsing the information, and writing and storing after simple processing. The traffic information of the service includes a timestamp, a service application name, an interface name, the amount of interface requests in one minute, and the total time-consuming of interface requests in one minute. After writing to the storage, you can view the traffic monitoring information of the service application on the console. One day, a business application B reported that the traffic monitoring information dropped after the expansion, and immediately began to troubleshoot the problem.

Troubleshooting plan

First open the log platform. View application A related logs. If you see a lot of write storage current limiting exceptions, statistics on the number of the exceptions in the last 3 hours show that there is no significant increase, indicating that the abnormal state occurs in a small amount without any impact.
It is suspected that some nodes of application A hang dead, causing application B to fail to report data. Then check the log output of different instances of application A. Substantially uniform was found, and this suspicion was ruled out.
At this point, the problem of application A is basically ruled out, and the abnormality of data reporting begins to be suspected. Since the traffic monitoring information of application B only drops but does not drop to 0, it is suspected that the data reporting of some nodes of application B is abnormal. Through log analysis, the IP list of the current application B's current normal reporting data is obtained and given to the user. It is found that the newly expanded machine of application B has not successfully reported data, and the network of the newly expanded machine is suspected to be abnormal.
View the application B log through the log platform, and see many network exceptions. Check the abnormal distribution machines, all of which are distributed on the new capacity expansion machine, which is consistent with the conclusion in the previous step. Immediately log in to a machine and find that the network of application A is really not working, and then contact the network classmates to restore the problem.

Scenario Summary

Through log retrieval combined with log analysis, the root cause of the problem can be finally located.

Case: Log Storage Cost Reduction

background

Application C Due to the large number of developers, the unreasonable log printing level setting, the large amount of logs, and the high cost of the log function, there is an urgent need to reduce costs and improve efficiency.

Governance Program

Based on past log troubleshooting experience, it is rarely necessary to view logs from a week ago. Therefore, the log storage time policy is shortened from one month to one week.
Through the automatic identification function of ARMS log mode, check the log mode of the current top-k, and find that the logs with many modes are invalid logs. Set the log discarding policy to discard invalid logs.

Scenario Summary

Combined with storage duration adjustment and log mode self-identification, the overall log cost is reduced to one-tenth of the previous one. At present, the ARMS log application function has been fully opened, allowing the operation and maintenance team to quickly have log analysis and search capabilities!

Application Real-time Monitoring Service ARMS Product Capability Dynamics in July

title=

Click here to try it now for free!

ARMS Practice | Application of Logs in Observable Scenarios

Application of logs in observable scenarios

Shortcomings of Common Open Source Logging Solutions

ARMS-based logging solution

Feature

1. Automatically enrich logs

2. Provide intelligent log clustering capability

3. Provide LiveTail capability

4. Based on the Arthas capability of ARMS, adjust the logger output level at runtime

5. The ability to generate log-based alarms and log conversion indicators with one click (coming soon in the internal test).

Ease of use

Operation and maintenance cost

ARMS Log Function Demonstration & Scenario Best Practice

Prerequisites

Functional application demonstration

Best Practices

Case: Troubleshooting the drop of indicators

Case: Log Storage Cost Reduction

Application Real-time Monitoring Service ARMS Product Capability Dynamics in July

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？