Practice of building a unified observable platform from 0 to 1 based on OPLG

Author: Xia Ming

Application Architecture and Observable Technology Evolution

title=

In the early stage of software development, the monolithic application architecture has been widely used because of its simple structure, easy testing and deployment, and the corresponding monitoring and diagnosis technology is mainly based on the indicator monitoring of logs and log keywords. With the continuous improvement of software complexity, the monolithic application architecture is gradually evolving to a distributed and microservice architecture, and the overall calling environment is becoming more and more complex. It is gradually difficult to quickly locate problems in complex environments only by relying on logs and indicators.

Therefore, link tracking technology came into being. However, the combination of early link tracking technology and log indicators is relatively simple, and more of them exist in the form of APM software at the application layer.

With the popularization of cloud computing and cloud native concepts, from the business layer to the application layer, the boundaries between containers and infrastructure are constantly being broken, and the responsibilities of R&D, operation and maintenance, security and other types of work are also constantly blurred. The demand for observation has also become stronger, and the connection between Traces, Metrics, and Logs has become closer.

A typical cloud native architecture and observable demands

title=

A typical cloud-native architecture is often in the form of a hybrid cloud. For reasons of security or disaster tolerance, some applications may be deployed in the public cloud and the other in self-built computer rooms. For the consideration of software development efficiency and performance, different applications may use multiple development languages, so the observable demands can be summarized into the following four points:

1. Full-stack three-dimensional unified monitoring and alarming : For example, the transaction volume of the business layer, the business indicators of the payment volume and the application gold 3 indicators, the CPU utilization of the infrastructure and the network situation can be placed on a large plate for overall monitoring , which is also the more commonly used method during the promotion period.

2. Front-end and back-end/multi-language full-link tracking : User requests are initiated from the terminal, all the way to the gateway, and then trace the call trajectory between the back-end application and cloud components, which can quickly locate where the user request is abnormal.

3. Unified visualization of cross-cloud data: The unified visualization of different types of data and data in different environments requires strong visualization components.

4. Secondary processing of data in open source format : Due to the needs of business customization, secondary processing and analysis are required. If it can be based on open source data format standards, a lot of work will be easier to implement, and many existing things can be reused.

Why build a unified observable platform based on OPLG

title=

Traditional monitoring and diagnosis platforms often have the following pain points:

1. Many buried points are implemented by users themselves. This closed source implementation will lead to inconsistent data formats, and it is difficult to reuse buried points among various systems, and the access cost is very high.

2. Metrics indicators are scattered in various monitoring subsystems in isolation, such as some in the network, some in the application, and some in the container. When troubleshooting full-link problems, the experience requirements of developers and users are very high, and the efficiency is very low.

3. Traces cannot be connected in series due to insufficient coverage of buried points or inconsistent protocols, resulting in frequent disconnections.

4. The detailed data of log or link data is reported to the server in full, which will also bring very high costs, and the query rate is low, which will also cause performance bottlenecks in hot spots.

5. The front-end development cost of the self-built console is high, the development cycle is long, the flexibility is poor, and it is difficult to keep up with the efficiency of business iteration.

6. There is a lack of unified label management among the observable data of each system, and the correlation is poor, so it is difficult to do a comprehensive analysis.

In order to solve the above problems, we gradually settled down a more feasible solution in the production environment, that is, to build a unified observable platform based on OPLG. This scheme mainly has the following advantages:

1. Open source and open source : All open source technology stack, with the help of the joint efforts of the community, such as Traces buried points of OpenTelemetry, Metrics Exporters of Prometheus indicators, without excessive development, can ensure the collection, generation and reporting of most common component data, reduce access cost.

2. Unified goal : Open source and based on a unified set of specifications, it is easy to realize the connection and correlation analysis between various internal subsystems and even external three-party systems.

3. Freedom and flexibility : Based on some good designs of OPLG, especially Grafana, the observable data can be organized very flexibly, and the large-scale charts required in each scenario can be flexibly customized to meet the customized needs.

4. Edge computing : Based on the OpeTelemetry Collector technology, data processing can be "left shifted" to the user cluster. Through edge computing technology, the value of data can be extracted in advance, and the extracted data can be sent to the server, reducing the cost of public network transmission and the storage cost of the server.

Building a cloud-native observable platform solution based on OPLG

title=

OPLG is mainly composed of the following four modules:

1. End-to-end data generation and reporting : Traces data generation is completed through OpenTelemetry, indicator data is completed through Prometheus, and log adoption is completed through Loki.

2. Unified data processing and routing at the edge : After all data collection is completed, the OpenTelemetry Collector can be used to complete unified edge processing and routing of data.

3. Fully managed server : It can provide better performance and more stable services, and it will not bind the technology stack, so the freedom and flexibility of migration are high.

4. Unified visualization : Grafana can be used to complete unified customized and flexible monitoring, or cloud service providers such as ARMS can be used to provide refined interactive dashboards in specific refined interaction scenarios to improve query experience. In addition, if you have your own needs, you can also build your own console through open source data formats or open OpenAPI.

Unified data collection and edge computing based on OpenTelemetry Collector

title=

OpenTelemetry Collector first completes unified data collection. Data collection can be performed for any data type, and then performs general processing, such as formatting, data labeling, and pre-aggregation actions for some indicators. The most common ones are calling Chain, the data can be aggregated and then sampled according to the granularity of service IP, which can ensure the accuracy of the indicators, and uploading to the server can also reduce costs.

OpenTelemetry Collector can also provide local storage capabilities. It can temporarily cache some recent data first, and then perform full-scale queries in the last 10 minutes, all errors and slow acquisitions, etc., which can make better use of edge storage capabilities.

For processed data, OpenTelemetry Collector provides a very flexible forwarding method, which can support different protocols, such as Prometheus protocol, OTLP protocol, etc. It can also support the forwarding of multiple data sources, which can be sent to the cloud server or dumped. To edge storage, more flexible.

Provide fast and stable query experience based on ARMS hosting server

title=

The ARMS-based managed server has implemented a lot of query performance optimization and acceleration technologies for massive data scenarios. For example, through operator pushdown, the query performance in more than 70% of the scenarios is improved by more than 10 times compared to open source; For long-term queries such as 7-10 days, the down-sampling technology has further improved the query performance by an order of magnitude; for divergent dimensions such as URLs, the automatic convergence technology has solved the problem of query freezes caused by hot spots; For link data, route scanning is done at two levels of application and Traces ID, and corresponding optimization is made according to the usage characteristics of link query.

In addition to optimizing the query performance of massive data, we have also done systematic construction on the HA side, such as supporting global deployment and multi-availability zone disaster recovery by default, avoiding the risk of a single region or AZ being unavailable; In the event of sudden traffic or rapid user growth, if it is a self-built computer room, you need to consider the capacity issue, but using ARMS can be adaptively expanded according to traffic, without worrying about performance bottlenecks caused by burst traffic; in extreme cases, you can also The availability of core functions is guaranteed through dynamic configuration push-down or automatic flow control degradation. Finally, the construction of full-link SLA monitoring and early warning is provided, and there is a 7*24-hour emergency response, which can detect availability problems in time and restore them quickly.

Build a flexible and sophisticated visual interface based on Grafana+ARMS

title=

In addition, based on Grafana + ARMS, it provides a flexible and fine-grained visualization experience.

Grafana's rich dashboard plugins and extensive data source support can integrate all kinds of data into one dashboard. And through flexible query syntax such as PromQL and LogQL, front-end intervention is not required. Back-end R&D, testing, SRE, etc. can quickly build their own scene dashboards in the form of low-code, improving observable efficiency.

Thanks to the open source nature of Grafana, if you want to migrate from a self-built computer room to the cloud, or to migrate between clouds, the entire visualization platform can be quickly copied through JSON files or other methods, and the end-to-end migration can be easily completed. Strongly bound by a specific vendor.

However, Grafana also has defects. For example, its experience in interactive scenarios is not good enough. Therefore, ARMS provides more refined interactive pages in strong interactive scenarios such as call chain correlation analysis, online diagnosis, and configuration management. ARMS will further enhance Grafana's chart plug-in, providing new chart plug-ins to enhance the visualization capabilities of the hosted version of Grafana.

Demo 1: How to implement full-link tracing and application diagnosis based on OpenTemeletry and ARMS

title=

Enter the ARMS console - access center, and find the access method of OpenTelemetry (other access methods can also be selected). Take a Java application as an example, you can generate data through OT Java Aagent, and then modify the startup parameters, such as access point or authentication token.

title=

In addition to direct reporting, data forwarding can also be completed through the OT Collector to achieve lossless statistics, just change the endpoint to a local service.

title=

After the installation is complete, you can analyze the call chain through the Traces Explorer page provided by ARMS.

title=

The analysis of the call chain is a strong interaction scenario. For example, you can quickly filter out the abnormal call chain through the quick filter on the left, and then select one of the links to view the end-to-end full link trace.

title=

In the Java scene, ARMS has made more detailed local method embedding for interface granularity, which can better locate the root cause. On the right side of the above figure, you can see additional information related to the current span, including the monitoring metrics of the JVM and the host.

title=

Application logs related to spans can also be quickly integrated, and troubleshooting of business problems can be combined with logs to better locate.

title=

In addition to the query of the call chain, Traces Explorer can also do real-time dynamic analysis. For example, you can check whether the abnormal link is concentrated on a specific IP, whether there is the possibility of a single machine failure, or whether it is concentrated on a specific interface. It is also possible to aggregate many call chains for full links. Multiple links can see the situation of each branch, and you can also see a more intuitive topology of the application dimension.

title=

In addition, ARMS provides better interactive diagrams for Java. In addition to JVM monitoring and host monitoring, it also includes container Pod monitoring, thread pool monitoring, and more. During peak business hours, the database connection pool is likely to be full. In the past, such problems were difficult to troubleshoot. However, with pooled monitoring, you can locate the problem at a glance. Through upstream and downstream analysis, it is easy to know the current situation of the caller of the application.

title=

In the database call, you can see the detailed statistics of SQL and the operation of the cache.

title=

ARMS also provides high-level diagnostic capabilities, such as thread analysis, which can observe the CPU consumption, time consumption, and number of threads of threads for each type of thread pool, as well as method stacks.

title=

For difficult problems in Java applications, you can capture and capture JVM running data in real time through the white-screen Arthas diagnosis, such as viewing the traces and parameters of method calls.

title=

In addition, APM's indicator data can also be written to Prometheus and displayed through Grafana. Users can customize the APM market they want through PromQL, and can combine APM data with other indicator data such as business, infrastructure, cloud components, database servers, containers, etc., to customize their own market form.

Demo 2: How to do unified monitoring and alerting based on Prometheus and Grafana

First, select the components to be connected in the access center, including MySQL, Redis, ES, etc. By default, many components on Alibaba Cloud are supported.

title=

Taking MySQL as an example, first select the instance to be connected, fill in the exporter name, select the address, and then write the user password. Here, you can also view the metrics collected by the current exporter.

title=

If the instance is not connected, you can choose to create a new instance. For example, for an ECS environment or a self-built computer room, you can install the Prometheus Agent by downloading the helm provided by ARMS, or you can directly report data through Remote Write. If you want to view the data sources of multiple regions together, you can also display multiple data in a unified manner through the Global View global interface instance.

title=

After entering the integration center, you can see the components installed in the current instance, view the indicators collected by the components, and select which indicators need to be collected and which ones do not need to be collected more precisely.

title=

In the dashboard list, ARMS provides a lot of preset Grafana dashboards, such as the overview view or node details view of K8s, you can view various statuses of the current node, and users can also edit new charts based on the view.

title=

Because data is written to Prometheus, alerts can also be extended based on PromQL. We provide many default alarm templates, such as the CPU usage of nodes, etc. In addition to customizing the alarm content, you can also choose a notification strategy, such as sending different alarms to different on-duty personnel.

Grafana ARMS provides two types of Grafana, a shared version and a managed exclusive version. We prefer to open an exclusive hosting version, which can do custom account management, and can also get better usability and security guarantees.

Demo 3: Log query analysis based on Loki

Since ARMS has not yet provided commercial Loki services, this article will use the Grafana example on the official website to demonstrate the access effect of Loki.

title=

After completing log access, you can quickly filter based on LokiQL provided by Loki. It also indexes key indexed fields, based on which the retrieval speed can be further improved.

title=

Based on Loki, you can customize charts with richer forms. The above picture shows the Loki Nginx market provided by the official website. The statistical values or regional distribution can be customized through LogQL, which is very flexible.

Demo 4: RASP-based application security protection

Hazardous component detection

title=

After the application is connected to RUSP, it can automatically scan whether there are dangerous components in the current application according to the cve standard, which can quickly prompt the current possible risks of the application.

title=

You can view the instance details and vulnerability details, including the specific vulnerability description and the corresponding repair plan. According to this repair plan, the vulnerability repair can be completed.

In addition to known dangerous components, full component self-checks can also be performed based on unreported or unique components.

Application Security Attack Protection

title=

After being maliciously attacked, the system can automatically identify the current malicious attack behavior, and can see which node the attack occurred on, the requested parameters and the stack of the calling attack.

In addition to attack identification, active protection can also be performed, and the protection mode can be set from monitoring to monitoring and blocking.

title=

After modifying the protection mode, the platform not only monitors the latest attack behavior, but can also block it directly without affecting the specific logic of the business. At the same time, an exception is thrown, indicating that the current risk behavior has been blocked.

Demo 5: User Experience Monitoring provided by ARMS

title=

In the access center, different types of end-side devices can be monitored. Taking the web terminal as an example, you can create a new site, select the corresponding configuration, and complete the access through CDN or NPM package.

title=

After the connection is completed, you can view the current console PV, UV, and API request success rate, and you can see which geographical locations, operators or browsers the requests are distributed to.

title=

For example, a new function page has been developed. You can view the access of the top 20 users on the page. You can also select specific users to view how users use this product through session tracking, which can better observe user behavior.

title=

Application front-end monitoring can also be used in conjunction with dial testing. You can simulate real user behavior through dial testing, simulate website quality, web page performance, file download or API performance, and find problems in advance, before users perceive possible risks. . Notifications can be automatically sent for different sites when an abnormality in availability is detected.

title=

The above picture shows the demo task that has been created. You can observe the website delay, packet loss rate, and the status and results of each dial-up test. You can also check which areas the problem is focused on through multi-dimensional reports.

Although OPLG does not cover all observable areas, it provides a very good foundation and specification for observability. On the basis of this system, other observable technologies or product capabilities can be continuously integrated to meet the observable demands of various scenarios, so as to meet the challenges brought by the increasingly complex and changeable environment.

Trends and technology selection of observable technologies in the cloud-native era

title=

The observable technology in the cloud-native era will develop more towards distributed cloud and hybrid cloud at the software architecture level. At the same time, containerization, microservices, DevOps and even DevSecOps will gradually become mainstream.

Observable technology trends include the following:

1. The standard is gradually formed and converged . The technology stack will converge towards OpenTelemetry, Prometheus, etc. Based on more standardized data, automation will be less expensive and more efficient. After the monitoring boundary between layers is broken, the three-dimensional monitoring will become more and more mature.

2. The non-intrusive detection of the network and kernel based on eBPF brings the possibility of more observation techniques.

3. Decentralization trend . First, data decentralization, for example, based on OpenTelemetry Collector, many observable preprocessing can be left-shifted to the user cluster, reducing the pressure on the centralized server. Secondly, collaboration is decentralized, not only can observable data be viewed on a unified console, but also the entire collaboration process can be completed on instant messaging software such as DingTalk.

More products for a limited time

title=

Click here to see more information about cloud native observables~

Practice of building a unified observable platform from 0 to 1 based on OPLG

Application Architecture and Observable Technology Evolution

A typical cloud native architecture and observable demands

Why build a unified observable platform based on OPLG

Building a cloud-native observable platform solution based on OPLG

Unified data collection and edge computing based on OpenTelemetry Collector

Provide fast and stable query experience based on ARMS hosting server

Build a flexible and sophisticated visual interface based on Grafana+ARMS

Demo 1: How to implement full-link tracing and application diagnosis based on OpenTemeletry and ARMS

Demo 2: How to do unified monitoring and alerting based on Prometheus and Grafana

Demo 3: Log query analysis based on Loki

Demo 4: RASP-based application security protection

Demo 5: User Experience Monitoring provided by ARMS

Trends and technology selection of observable technologies in the cloud-native era

More products for a limited time

阿里云云原生

引用和评论

Spring AI Alibaba 发布企业级 MCP 分布式部署方案

支付宝H5下载被拦截的原因排查与解决指南

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

JManus - 面向 Java 开发者的开源通用智能体

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强