Born in response to the cloud, understand the construction of an end-to-end observable system in one article

At the beginning of 2021, the concept of observability was rarely mentioned in the domestic market, but in the second half of 2021, discussions and practices related to observability began to spring up one after another, and the well-known company Grafana even directly integrated the original monitoring tools. Changed to the observability stack and pushed a series of services. Can observability really solve many of the problems faced by traditional monitoring systems? How to build an observable system? In this issue, Amazon Cloud Technology Tech Talk specially invited Jiang Shuomiao, CEO of Observation Cloud, to share the "Best Practices of Building an End-to-End Observable System".

Why is observability suddenly "out of the circle"

Observability may seem like a new word, but its origins are much earlier than we realize. Observability was first proposed by Hungarian-born engineer Rudolf Kalman for linear dynamic systems. From the signal flow graph, if all the internal states can be output to the output signal, the system is observable. Observability was also mentioned in Burt Wiener's 1948 book Cybernetics - The Science of Control and Communication in Animals and Machines. Observability in control theory refers to the degree to which a system can infer its internal state from its external outputs.

With the development of cloud computing, the concept of observability has gradually entered the field of computer software. Why has observability been so hot recently?

Jiang Shuomiao believes that this is largely due to the increased complexity of the system. The essence of an IT system is a digital system. In the past, the system itself had a simple structure, mostly a monolithic structure, and the infrastructure was relatively fixed, and the system could be viewed through monitoring. However, with the arrival of the cloud-native era, the management object has gradually changed from a single host to a cloud, and then to a cloud-native distributed complex system. Traditional infrastructure-oriented monitoring, simple logs, and simple APM cannot solve the problem. , therefore, complete observability of the system needs to be built.

The main data classes used in observability are metrics, logs, links. They are often referred to as the "Three Pillars of Observability".

Metric: A metric is a record of system values in continuous time. Basic metrics are usually used to describe two data types, one is Count and the other is Gauge.
Log (Log): Time-related records output by the system/application, usually output by system/software developers, to facilitate locating system errors and states.
Link (Tracing): The software modules based on the directed acyclic graph directly call the relationship.

The three pillars are crucial, and it is through these three dimensions of data that developers determine the status of the application system. Compared with traditional monitoring, the observable system has many advantages.

Traditional monitoring is oriented to known problems, and can only detect and notify those known failures that may occur, such as: CPU>90%. The main monitoring objects are IT objects, which are only for server-side components and solve basic operation and maintenance problems.

Observability can help find and locate unknown problems. Its core is to continuously collect various core indicators and data generated by the system, and ensure and optimize the business through data analysis. For example, it is found that the payment failure rate of the applet client in a certain city is very high, so as to determine whether it is at the code level causes such an exception. The main monitoring objects of observability are not only IT objects, but also applications and businesses, which are oriented to cloud, distributed systems, and APPs/applets.

In the sharing, Jiang Shuomiao mentioned that with the development of infrastructure, traditional monitoring will gradually be replaced by observability.

He summarizes the value of building observability in the following five points:

Make SLOs visible, with clear goals and status quo
Find and locate unknown problems
Reduce clarification costs between teams
Reduce unpredictable economic losses caused by business exceptions
Improve end-user experience and satisfaction

Open source or SaaS, what is the correct way to open observability construction?

Compared with traditional monitoring systems, building observability has many advantages and values. So how do you build observability?

First, it is necessary to collect basic data on all relevant aspects of all component systems as much as possible, including clouds, hosts, containers, Kubernetes clusters, applications, and various terminals. The cost of collecting these data in real time is not high, but if it is not collected, once the system failure needs to be investigated and analyzed, it is impossible to effectively evaluate the current state.

Second, clarify the responsibility for building system observability. Who is the builder of this component, who is responsible for defining the SLI of this component, who is responsible for collecting all the relevant basic data and building the corresponding dashboard, and who is responsible for the SLO of the related component, needs to be held accountable.

Third, developers are responsible for observability. Developers should expose observability data of their own development systems as part of software quality engineering. If unit testing is to ensure the usability of the smallest unit of code, then developers standardize exposure of observability basic data. will be a necessary condition for the reliability of the production system.

Fourth, it is necessary to establish unified indicators, logs, and link specifications to unify the team's tool chain. That is, the same indicator naming convention, the same log format, and the same link system are adopted. If there are still differences after following the OpenTelemetry standard, you can define a unified TAG specification that connects the entire system, such as: all errors are state:error.

Fifth, it is necessary to continuously optimize and improve the overall observability. For the observability of the entire system, including data collection, view construction, and TAG system establishment, these steps all take time, and the past method cannot be used because the coverage or the built dashboard fails to play a role in an accident. deal with problems. Each unobserved failure is an excellent opportunity to further increase the observable range.

It is not difficult to see from the path of observability construction that the process is very complex. So, what are the mainstream construction methods? Jiang Shuomiao introduced the two most common observability building methods, namely building through open source and building with SaaS products.

Thanks to the vigorous development of the open source ecosystem, there are many options for the construction of observability. Building in an open source way requires the builder to have a very detailed understanding of the relevant knowledge from the front-end data capture to the back-end data processing, including data display, alarms and other peripheral functions. Therefore, this method is suitable for teams with sufficient strength or relatively sufficient learning and time costs.

Building observability with a mature SaaS product is a more efficient way than open source. Jiang Shuomiao took the cloud observation product as an example, and introduced four advantages of this method.

does not do stitching monsters: only one agent in the server to collect all relevant system data of this host, avoiding piles of agents and configuration items.

not a guinea pig: can provide complete end-to-end coverage, and can be used out of the box to avoid uneven integration. For example, Observing Cloud can support more than 200 technology stacks to achieve end-to-end coverage.

not closed and highly programmable: can easily build any observable scene, and even introduce business data parameters into the overall observation, with strong flexibility. In addition, it can avoid rigid integration and has strong secondary development capabilities.

does not leave hidden dangers: observes the cloud to permanently open source the user-side code, one-way communication, and cannot and cannot issue instructions to the customer environment. All data collection is desensitized by default and the entire process can be controlled by the user.

As mentioned earlier, the construction of observability is based on the "cloud". Not only that, the observation cloud itself is also a complete cloud-native product. The entire set of products in the observation cloud, including the data platform, are deployed on the EKS of Amazon Cloud Technology and are orchestrated based on containers. The overall architecture of the observation cloud is very simple, that is, the massive data is unified through an agent, enters the data platform, and then provides complete observability through the capabilities of the platform. The whole system is divided into core platform layer, web layer and data access layer. The core platform layer is completely self-developed by Observation Cloud and is not open source. The upper web layer has a set of APIs that connect with the platform on the core data processing platform. Jiang Shuomiao said: "For customers, it is more recommended to directly choose SaaS products of Observation Cloud. If they want, customers can also deploy completely isolated on Amazon, which is also very convenient, but the overall cost is higher than that of directly adopting SaaS products. much more.

Why choose Amazon Cloud Technology? Mainly based on the following considerations:

observation system itself must have an order of magnitude higher reliability and higher SLA: observation cloud is a platform to help customers build an observability system, so it needs to have high reliability itself. If it cannot provide high enough reliability , once the observation system fails, it is impossible to remind customers in time, and it is impossible to provide detailed analysis. In addition, the choice of cloud service itself can also allow a part of the SLA of the observation cloud platform to be provided by Amazon.

more mature Marketplace: users can directly purchase products on Amazon through the Chinese team, and Amazon Cloud Technology will directly book product consumption on the Marketplace. It should be noted that the products of Observing Cloud are paid according to the scale of data, and it is almost free when the user does not have the amount of data.

Global: Amazon cloud technology can provide better compatibility than overseas products, especially for China's technology stack, the overall cost is lower. Jiang Shuomiao revealed in his sharing: "After the Spring Festival, Observation Cloud will deploy our observation platform in overseas Amazon cloud technology nodes. Observation Cloud hopes to use China's power to provide China's overseas customers with better and lower cost products than overseas products. choose."

APN to integrate into the global network of Amazon cloud technology: Observation Cloud hopes to use the powerful ecosystem of Amazon cloud technology to take observability as the ultimate means of providing services to customers, and hopes to leverage APN to help more users understand observability This is also one of the very important reasons why Observation Cloud chooses Amazon technology.

In addition to being a complete cloud-native product, there are several very interesting designs in the cloud-observing system. First, on the acquisition side:

Observation Cloud converts third-party indicators, logs, and link acquisition protocols into observation cloud protocols.
Plug-in collection stack design, using go coroutine isolation between plug-ins, without affecting each other
Active resource consumption control prevents excessive resource pressure on the agent side (cgroup controls the occupation of collection resources)
Passive resource consumption control prevents excessive pressure on the server side (back pressure mechanism)
Distributed log parsing of the tidal mechanism (pipeline)

Secondly, on the storage query side, the observation cloud unifies the query syntax, and users do not need to care about the underlying data storage, which is simple and easy to use.

Third, on the analysis side, Observation Cloud realizes the concatenation of all data, and builds a unified viewer to analyze the raw data in a manner similar to multi-dimensional analysis and list. Users can build their own viewers. In addition, due to the large amount of data, in order to avoid excessive pressure on the user's browser caused by the front end, Observation Cloud can collect data according to a specified percentage, and provide a SLO/SLI panel to help customers build a measure of the overall reliability of their own application systems.

Construction of an end-to-end observable system practice case

After a detailed introduction to the conceptual and technical levels, Jiang Shuomiao took an e-commerce customer as a case to explain how to build an end-to-end observable system.

The problem faced by the e-commerce customer in the case is: the transaction process from the customer's order to the warehouse to the final financial accounting, an order requires nearly 10 interface calls, and any link may have problems, such as program problems, network abnormalities, inventory stuck etc. At present, there is no effective monitoring tool to monitor the order process. Problems are generally reported by store employees, and then the operation and maintenance personnel refer to the process to check the problem according to the order, which is very passive and has a large workload. Every day The operation and maintenance personnel need to check whether the business interface has been completed.

The process of building an end-to-end observable system for this customer is roughly divided into four steps: The first step is to sort out the integrated access of the observed objects. With the observation cloud product, the entire access process can be completed in about 30 minutes.

The second step is unified viewing and analysis. The specific steps are: first, monitor the user experience, then check the link connected to the backend under the behavior, click the specific link to enter the link viewer, and finally check the log of the corresponding link.

Third, the observability of the business is realized through the viewer.

Fourth, early warning through the SLO monitor.

After completing the end-to-end observability construction through the observation cloud, the e-commerce customer visualizes the status of the nodes in the order process, and can retrieve the node status of the order process by the order number, where the process is stuck, and what the error message is at a glance. From the user interface, network, back-end services to dependent middleware and operating systems, any fault can provide clear traceability and analysis. Not only that, the observation cloud also provides real-time anomaly monitoring and alerts to ensure that problems can be discovered and dealt with in a timely manner.

In addition to applications in the e-commerce field, Observing Cloud's SaaS products are also suitable for many application scenarios. There are complete best practices for system observability construction on the official website of Observation Cloud. Interested partners can directly go to the official website of Observation Cloud to view the corresponding documents.

Click to read the original text directly to the official website
https://www.guance.com/#home

Born in response to the cloud, understand the construction of an end-to-end observable system in one article

Why is observability suddenly "out of the circle"

Open source or SaaS, what is the correct way to open observability construction?

Construction of an end-to-end observable system practice case

亚马逊云开发者

引用和评论

使用 Amazon Q Developer CLI 调用 MCP Server 实现 Amazon Support 案例自动创建

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

强烈推荐|新手从搭建到二开TinyEngine低代码引擎

AIBrix 深度解读：字节跳动大模型推理的云原生实践

面对开源大模型浪潮，基础模型公司如何持续盈利？