All in one: How to build an end-to-end observable system

Author: Xijie & Bai Yu

Observable past and present

The observable and fault analysis of the system is an important measurement standard in system operation and maintenance. As the system evolves in architecture, resource units, resource acquisition methods, and communication methods, huge challenges are encountered. These challenges are also forcing the development of operation and maintenance related technologies. Before officially starting today's content, let's talk about observable past and present. Throughout the entire development process of operation and maintenance monitoring, monitoring and observability have been developed for nearly 30 years.

In the late 1990s, with the gradual transfer of computing from mainframes to desktop computers, the application of client-server architecture began to prevail, and everyone began to pay attention to network performance and host resources. In order to better monitor the application of this CS, the first generation of APM software was born. The operation and maintenance team valued network performance and host performance during this period, because the application architecture at this time was still very simple. At this time, we also call these tools monitoring tools.

In 2000, the Internet developed rapidly and browsers became the new user interface. The application evolved into a browser-based three-tier architecture of Browser-App-DB. At the same time, Java, as the first programming language for enterprise-level software, became popular. The concept of write once, run anywhere (write once, run anywhere) has greatly improved In order to improve the productivity of the code, the Java virtual machine also shields the details of code operation, making tuning and troubleshooting more difficult. Therefore, code-level tracking and diagnosis and database tuning have become a new focus, and a new generation has been born. The monitoring tool APM (application performance monitoring).

After 2005, distributed applications have become the first choice of many enterprises, such as SOA architecture and ESB-based applications. At the same time, virtualization technology has gradually become popular, and the physical unit of traditional servers has gradually faded into an invisible and intangible virtual resource model. Three-party components such as message queues and caches have also begun to be used in production environments. In such a technological environment, a new generation of APM software was born, and companies began to need to perform full-link tracking, while monitoring virtual resources and three-way component monitoring, thus deriving the core capabilities of a new generation of APM.

After 2010, as the cloud-native architecture began to be implemented, the application architecture began to gradually transform from a single system to microservices, and the business logic therein became calls and requests between microservices. At the same time, virtualization has become more thorough, container management platforms have been accepted by more and more enterprises, three-party components have gradually evolved into cloud services, and the entire application architecture has become a cloud-native architecture. The call path of the service becomes longer, which makes the direction of traffic uncontrollable, and the difficulty of troubleshooting increases. A new observability capability is required to cover various observable data (indicators, logs, links, etc.) of the entire stack. Event) Continuous analysis is carried out in the entire application life process of development, testing, operation and maintenance.

It can be seen that observability has become a cloud-native infrastructure. The entire observable capability evolves from a pure operation and maintenance state to a test and development state. The observable purpose has also expanded from supporting the normal operation of the business to accelerating business innovation, allowing the business to iterate quickly.

Monitoring & APM & Observable Cognitive Similarities and Differences

From the above process, we can see that the process from monitoring to APM to observable is a continuous evolution. Next, we talk about the specific relationship between these three. In order to explain better, a classic cognitive model is first introduced here. For everything in the world, we usually divide it according to two dimensions: “awareness” and “understanding”, namely “perception” and “understanding”.

So, first of all, what we know and understand is what we call facts. Falling into the topic just discussed, this part corresponds to monitoring. For example, when performing operation and maintenance work, it is designed to monitor the CPU utilization of the server at the beginning. Whether this utilization is 80% or 90%, it is an objective fact. This is what monitoring solves, that is, based on knowing what to monitor, develop and collect corresponding indicators, and establish a monitoring market.

Next, there are things we know but don't understand. For example, it is monitored that the CPU utilization rate reaches 90%, but why is it so high? What caused it? This is a verification process. APM can collect and analyze the application performance on the host, and find a high-latency log frame during the application link invocation process, which causes the CPU utilization on the host to soar. This is the reason behind the high CPU utilization with the help of APM through application layer analysis.

Then, there are things we understand but don't know. It is still a case of high CPU utilization. If you predict that there will be a surge in CPU utilization at some point in the future by learning historical data and related events, you can achieve early warning.

Finally, there are things we don’t know and don’t understand. Still in the above example, if the CPU usage is soaring through monitoring, it is caused by the application log framework found through APM. But further, if you analyze the user's access data during this time period, it is found that in the Shanghai area, the response time of the request through the Apple terminal is 10 times longer than other cases, and this type of request is due to the configuration of the log framework. A large number of Info logs were generated, which caused the CPU of some machines to soar. This is an observable process. Observable is the need to solve things that you don't know in advance (the access performance problem of Apple terminal from Shanghai) or understand (the error configuration log framework generates massive info logs)

To summarize briefly, we pay attention to indicators in the monitoring field. These indicators may be concentrated on the infrastructure layer, such as machine and network performance indicators. Then, based on these indicators, establish corresponding kanban and alarm rules to monitor things in the known range. After the monitoring finds the problem, APM uses application-level link, memory and thread and other diagnostic tools to locate the root cause of abnormal monitoring indicators.

Observability is application-centric, and the root cause can be found more quickly and directly by correlating and analyzing various observable data sources such as logs, links, indicators, and events. It also provides an observable interface, allowing users to explore and analyze these observable data flexibly and freely. At the same time, the observability capabilities are connected with cloud services, which can immediately strengthen the application's elastic expansion and shrinkage, high availability and other capabilities, so that when problems are discovered, related problems can be resolved more quickly and application services can be restored.

Key points of building an observable system

While the observable capability brings great business value, it also brings a lot of system construction challenges. This is not only the selection of tools or technologies, but also an operation and maintenance concept. This includes three parts: observable data collection, analysis, and value output.

Observable data collection

At present, the observable data widely promoted in the industry includes three pillars: Logging, Tracing, and Metrics. Among them, there are some commonalities that require attention.

1) Full stack coverage

The basic layer, the container layer, the cloud service application built on top, and the corresponding observable data of the user terminal as well as the corresponding indicators, links, and events need to be collected.

2) Unified standards

The entire industry is advancing the unification of standards. The first is metrics. Prometheus has formed a consensus as the indicator data standard in the cloud-native era; the link data standard has gradually become the mainstream with the implementation of OpenTracing and OpenTelemetry; in the log field Although the degree of data structure is relatively low, it is difficult to form a data standard, but open source rookies such as Fluentd and Loki have also emerged on the side of acquisition, storage and analysis; on the other hand, Grafana has become more clear as a display standard for various observable data.

3) Data quality

Data quality is an important part that is easy to be overlooked. On the one hand, the data sources of different monitoring systems need to define data standards to ensure the accuracy of analysis. On the other hand, the same event may lead to a large number of duplicate indicators, alarms, logs, etc. Through filtering, noise reduction and aggregation, analysis of data with analytical value is an important part of ensuring data quality. This is often the place where the gap between open source tools and commercial tools is relatively large. For a simple example, when we collect the call link of an application, how deep is the collection? What is the strategy for calling link sampling? Can all of them be picked up when mistakes or slowness occur? Whether the sampling strategy can be dynamically adjusted based on certain rules, etc., all determine the quality of observable data collection.

Observable data analysis

1) Horizontal and vertical association

In the current observable system, the application is a very good point of view for analysis. First of all, applications and applications are related to each other, which can be linked through a call chain. Including how the microservices are called, how the application and cloud services, and how the three-party components are called, can all be associated through links. At the same time, the application and container layer and resource layer can also be mapped vertically. With the application as the center, the global observable data association is formed through the horizontal and vertical directions. When a problem needs to be located, a unified analysis can be carried out from an application perspective.

2) Domain knowledge

Faced with massive amounts of data, how to find problems more quickly and locate them more accurately. In addition to application-centric data association, it is also necessary to locate and analyze the domain knowledge of the problem. For observable tools or products, the most important thing is to continuously accumulate the best troubleshooting path, common problem location, root cause decision-making link methods, and solidify relevant experience. This is equivalent to equipping the operation and maintenance team with experienced operation and maintenance engineers to quickly discover problems and locate the root cause. This is also different from traditional AIOps capabilities.

Observable value output

1) Unified display

As mentioned above, the observable needs to cover all levels, and each level has corresponding observable data. However, the current observable related tools are very fragmented, and how to display the data generated by these tools in a unified manner has become a big challenge. The unification of observable data is actually relatively difficult, including issues such as format, coding rules, and dictionary values. However, the unified presentation of data results can be achieved. The current mainstream solution is to use Grafana to build a unified monitoring market.

2) Collaborative processing

After unified display and unified alerting, how to use the collaborative platforms like DingTalk and Enterprise WeChat to more efficiently discover and process ChartOps for problem tracking has gradually become a rigid demand.

3) Cloud service linkage

Observability has become a cloud-native infrastructure. After the observable platform finds and locates problems, it needs to quickly interact with various cloud services to quickly expand and contract or load balance, so as to solve the problem faster.

Prometheus + Grafana practice

Thanks to the vigorous development of the cloud-native open source ecosystem, we can easily build a monitoring system, such as using Prometheus + Grafana to build basic monitoring, SkyWalking or Jaeger to build a tracking system, and ELK or Loki to build a log system. However, for the operation and maintenance team, different types of observable data are scattered and stored in different backends, and troubleshooting still needs to jump between multiple systems, and efficiency cannot be guaranteed. Based on the above, Alibaba Cloud also provides enterprises with a one-stop observable platform ARMS (real-time monitoring service for applications). As a product family, ARMS includes a variety of products in different observable scenarios, such as:

For the infrastructure layer, the Prometheus monitoring service monitors various cloud services including ECS, VPC, containers, and third-party middleware.
For the application layer, application monitoring based on Alibaba Cloud's self-developed Java probes fully meets the needs of application monitoring. Compared with open source tools, data quality is greatly improved. And through link tracking, even if open source SDK or probes are used, data can be reported to the application monitoring platform.
For the user experience layer, through mobile monitoring, front-end monitoring, cloud dial test and other modules, comprehensive coverage of user experience and performance on different terminals.
Unified alarms, perform unified alarms and root cause analysis on the data and alarm information collected at each layer, and directly present the findings through Insight.
The unified interface, whether it is the reported data of ARMS, Prometheus, or various data sources such as log service, ElasticSearch, MongoDB, etc., can be presented through the fully managed Grafana service for unified data observable data, establish a unified monitoring market, and cooperate with Ali Cloud Various cloud services are linked to provide CloudOps capabilities.

As mentioned above, ARMS has a lot of capabilities as a one-stop product. At present, companies have built some capabilities similar to ARMS, or adopted some products in ARMS, such as application monitoring and front-end monitoring. However, a complete observable system is still very important for enterprises, and they hope to build an observable system that meets their own business needs based on open source. In the following example, we will focus on explaining how Prometheus + Grafana builds an observable system.

Fast data access

In ARMS, we can quickly create an exclusive instance of Grafana. ARMS Prometheus, SLS log service, and CMS cloud monitoring data sources can all synchronize data very conveniently. Open Configuration, you can quickly view the corresponding data source. While quickly accessing various data sources in time, reduce the workload of daily data source management as much as possible.

Pre-built data market

After the data is connected, Grafana will automatically create the corresponding data disk for everyone. Taking application monitoring and container monitoring as an example, basic data such as the golden three indicators and interface changes will be provided by default.

It can be seen that although Grafana has helped everyone to build up various data boards, they still see scattered markets. In the daily operation and maintenance process, it is also necessary to create a unified market based on the business domain or based on the application. The data of the infrastructure layer, the container layer, the application layer, and the user terminal layer can all be displayed on the same market to realize overall monitoring. .

Full stack unified market

When establishing a full-stack unified market, we prepared in accordance with dimensions such as user experience, application performance, container layer, cloud services, and underlying resources.

1) User experience monitoring

Common PV, UV data, JS error rate, first rendering time, API request success rate, TopN page performance and other key data will be presented in the first time.

2) Application performance monitoring

Request volume, error rate, and response time represented by the golden three indicators. And distinguish according to different applications and different services.

3) Container layer monitoring

The performance and usage of each Pod, as well as the departments that run on these applications are also listed. These deployment-related Pod performance information are all presented in this section.

4) Cloud service monitoring

In addition, it is related to cloud service monitoring. Here we take the message queue Kafka as an example, like the common related data indicators of message services such as consumption accumulation, consumption and other data.

5) Host node monitoring

For the entire host node, CPU, running Pod and other data.

In this way, this big market covers the overall performance monitoring situation from the user experience layer to the application layer to the container layer of the infrastructure. More importantly, the entire market contains all the relevant data of microservices. When cutting a certain service, the performance data associated with the service will be displayed independently. Filtering is performed at different levels such as containers, applications, and cloud services. Here is a little mention of how to do it. When Prometheus monitors to collect these cloud services, it will collect all the tags on the cloud service by the way. By marking tags, these cloud services can be distinguished based on different business dimensions or different applications. When doing our unified market, we will definitely encounter a lot of data source management problems. Here we provide the globalview capability to gather all Prometheus instances under this user name for unified query. No matter it is the information of the application layer or the information of the cloud service.

With the help of the above scenario, we are inviting ideas to propose the observability platform design direction: based on the system and service observation perspective, different data is fused and analyzed in the back-end, instead of deliberately emphasizing that the system supports the separate query of the three types of data for observability, in the product function And the interaction logic shields users from the separation of Metrics, Tracing, and Logging as much as possible. Establish a complete observable closed loop, from abnormal discovery before accidents, troubleshooting during accidents to active early warning monitoring after accidents, to provide an integrated platform for continuous business monitoring and optimization of service performance.

Click here to watch the wonderful video speech and learn more about observable practical dry goods!