Many R&D personnel often encounter the following two problems in their daily work: it can't be run, why? It works, why?
Therefore, they have high expectations that observables can provide ideas for solving problems.
introduction
In 2017, Twitter engineer Cindy published an article called "Monitoring and Observability", which brought the term observability into the developer's field of vision for the first time, and joked about the difference between observability and monitoring in a half-joking way. . In the world of software products and services, monitoring tells us whether a service is working properly, and observability tells us why a service isn't working properly.
As can be seen from the Google Trends graph, the popularity of observability is increasing year by year. It is also regarded as an attribute of the system and will gradually become a feature that the system needs to have in the process of development and design.
observable trends
After 2020, the observable search trend has exploded. A large part of the reason is that the reliability engineering of SRE sites has gradually become popular, and major domestic manufacturers have set up relevant positions and corresponding recruitment indicators, so that observability has also received more attention in China. This also means that more and more basic services are facing stability challenges, and an important means to solve the stability challenges is to provide observability.
The lower left corner of the above figure shows observable global search trends, of which China has a high search interest.
observability definition
Observability is a mathematical concept developed by Hungarian engineers that refers to the degree to which a system can infer its internal state from external outputs. In other words, observability should be able to analyze the specific operating details of its internal operation from the data output.
Difficulties and Challenges
Business is booming, demands for stability surge
F6 Auto Technology is an Internet platform company focusing on the informatization construction of the automotive aftermarket, and is currently in the leading position in the industry. With the booming business development, the number of merchants supported by F6 has skyrocketed dozens of times in a short period of time. At the same time, it has gradually launched services for C-side scenarios such as technicians, such as Vin code parsing, data query, etc. The requirements for stability have been significantly improved. .
How Conway's Law Works
Conway's Law is the guiding law for the splitting of microservices across the entire organizational structure in IT history. Any organization is a replica of the organizational structure in the process of designing a system. With the expansion of business, the effect of Conway's law will lead to the splitting method when designing microservices to be similar to the organizational structure. Business growth will lead to division of departments. When designing microservices later It will also be very close to the organizational structure. Even if the early organizational structure and the microservice split are inconsistent, the later microservices will gradually compromise with the organizational structure.
Although the convergence of microservices and organizational structure makes system communication more efficient, it also brings many distributed system problems. For example, in the interaction between microservices, no one can have a holistic and global understanding of the services. The most direct expectation of developers is to have the inspection efficiency of a single-machine system in a distributed system, which prompts us to use the system to The server-centric idea is transformed into a call-chain-centric idea.
Stability demands surge
The chimney-type construction was adopted in the earliest business development of F6. Monolithic applications are relatively simple, but they have many problems with scalability and maintainability. For example, all research and development is carried out on the system, there are many code conflicts, when it can be released, and how much business loss will be caused by the release. Therefore, more and more situations lead to the need to split microservices, and the splitting and invocation of microservices will result in a very complicated and cumbersome call chain. As shown on the right side of the figure above, it is almost impossible to manually analyze the call chain.
So, how to reduce the difficulty of online troubleshooting as much as possible?
observable evolution
Traditional monitoring + micro application log collection
- ELKStack gets logs and queries ElastAlert to complete log alerts
Traditional monitoring and microservice log collection generally use ELKStack for log collection. ELK is an acronym for three open source projects, Elasticsearch, Logstash , and Kibana .
We rely heavily on ELK to collect microservice logs. At the same time, we also use the ElastAlert component, an open source ES-based alert system, whose main function is to query matching rules from ES and alert related types of data.
The above figure describes the idea of daily query through log collection. For example, R&D personnel will query online logs through pipeline, and ElastAlert will obtain abnormal data from ES logs through matching rule alarms, and kibana can query and prioritize exceptions that occur in the system.
Architecture upgrade + observability introduction
- Grafana Kanban + Zorka supports jvm monitoring
With the development of the business, the system's requirements for logs have gradually increased. For example, there are many teams and various alarm rules need to be configured. Therefore, we introduced Grafana to gradually replace the query function of kibana and Zabbix. You can use Grafana's ES plug-in query to alert the log, and then use the alert function to complete the exclusion of the original ElastAlert. At the same time, you can use Grafana to make a more intuitive visual display on the big screen.
In addition to logs, we also expected to collect Java application metrics, so Zorka open source components were introduced. Zorka and Zabbix can be easily combined, and the collected information can be reported to Zabbix for display through Zorka. And Zabbix can directly output data through the Grafana Zabbix plug-in, and finally collect the entire application screen and Kanban information into the Grafana interface.
The working mechanism of Zorka is similar to the way of Zabbix Java gateway. It is automatically mounted into the Java process through Java Agent, which is used to count common application containers and request number indicators, etc., which initially solves our observation needs for Java processes.
Cloud native transformation
- The orchestration capabilities of K8s and microservices complement each other and urgently need the support of trace components
With the continuous improvement of the degree of microservices, the operation and maintenance costs of traditional methods are getting higher and higher. Therefore, we have launched cloud-native transformation.
First of all, the transformation of cloud native is the preparation of K8s side readiness probe and survival probe. The writing of the survival probe improves the self-healing ability of the service. After the OOM appears, the service can automatically recover and start new nodes to ensure the normal provision of data services. In addition to K8s, we also introduced Prometheus and ARMS application monitoring.
As the No. 2 project of CNCF after K8s, Prometheus has formed a sufficient voice in the entire metrics field; ARMS application monitoring, as the flagship product of Alibaba Cloud's commercial APM, enables us to combine the cloud-native approach to realize R&D without feeling, without the need for Make any code changes to have trace functionality. More importantly, the Alibaba Cloud team can maintain continuous iteration and support more and more middleware, so we think it will definitely become a diagnostic tool.
- JmxExporter quickly supports jvm information display of cloud native Java components
After the cloud-native transformation, the monitoring model has also changed. The earliest monitoring model is push, Zorka is on the same machine every time it is released, so it has a fixed host; after going to the cloud, the containerization transformation causes the Pod to no longer be fixed, and new applications may expand or shrink, etc. question. Therefore, we gradually convert the monitoring model from push to pull mode, which is more in line with Prometheus' collection model, and gradually strip Zorka from the observable system.
JMX metrics are not collected directly using ARMS because ARMS does not cover all online and offline java applications, and applications that are not covered also expect JVM data collection capabilities, and ARMS costs slightly more. Therefore, for cost reasons, we did not use ARMS as a complete access, but chose the JMX Exporter component.
JMX Export is also one of the exporters provided by the official Prometheus community. It uses the Java JMX mechanism to read the JVM information through the Java Agent, and can directly convert the data into the metrics format that Prometheus can recognize, so that Prometheus can monitor and collect it, and register the corresponding Service Moninor through the Prometheus Operator to complete the metrics collection.
- Use the configuration center to complete accurate alarms applied to the owner
As the company's business boomed, personnel surged, microservices surged, and the number of R&D personnel and alerts also increased dramatically. In order to improve the alarm reach rate and response rate, we reused the multi-language SDK of the Apollo Configuration Center, and developed a set of Apollo business-based application reminders through Go. The overall process is as follows:
ES alarms or alarms in other scenarios are collected through Grafana, and then associated with the alert through the metrics application. The alarm will be forwarded to the precise alarm service written in the Go language mentioned above. The precise alarm service is parsed into the corresponding application, and the owner's name, mobile phone number and other information are obtained from the Apollo configuration center according to the application, and then based on this information, DingTalk sends an alarm, which greatly improves the message reading rate.
- ARMS: Non-intrusive support for Trace * *\
In addition, we also introduced Alibaba Cloud's application real-time monitoring service ARMS, which can support most middleware and frameworks, such as Kafka, MySQL, and Dubbo, without any code modification. After cloud native, you only need to add annotations in deployment to support the loading of related probes, and the maintainability of microservices is extremely friendly. At the same time, it also provides a relatively complete trace view, you can view the entire trace link of the online application node call log, and also provides the Gantt chart viewing method, dependency topology diagram, upstream and downstream time-consuming diagrams and other data.
observable upgrade
- Log Trace Metrics concept upgrade
Observables have blossomed everywhere in the domestic microservice field. Therefore, F6 also upgraded the observable idea. Industry-wide observability consists of three pillars: log events, distributed link tracking, and metrics monitoring. Monitoring is required in any era, but it is no longer a core requirement. As can be seen from the above figure, monitoring only includes alarms and application overview, but in fact observability also needs to include troubleshooting analysis and dependency analysis.
The earliest monitoring function users are the operation and maintenance personnel, but most of the operation and maintenance personnel can only deal with system service alarms. When it comes to the entire microservice field, more problems may appear between applications, requiring troubleshooting. For example, if a service has a slow request, it may be due to a code problem, or because of insufficient locks or thread pools, or insufficient connections. The above problems are ultimately characterized by slowness and unresponsive services. With so many possibilities, it is necessary to locate the real root cause through observability. However, locating the root cause is not a real requirement. The real requirement is to use observability to analyze the node where the problem is located, and then to improve the overall SLA as much as possible through measures such as replacing the corresponding components, fusing or limiting current.
Observability can also analyze problems very well, such as slow online service, the time consumption of each node, and the time consumption of each request can be observed. Dependency analysis can also be solved, such as whether the service dependencies are reasonable, whether the service dependency calling links are normal, etc.
With more and more applications, there are more and more demands for observability and stability. Therefore, we have developed a simple root cause analysis system, which uses the text similarity algorithm to classify and cluster the current service logs.
- Simple root cause analysis goes live
The above figure shows a typical ONS failure, which relies on services for service upgrades. If this is a log obtained by intelligent analysis of log capture, after doing SRE for a long time, the change will also cause great damage to online stability. It would be of great benefit to be able to collect changes etc. into an observable system as well. Similarly, if the information about the ONS to be upgraded can be collected into the observable system, the root cause can be analyzed through various event correlations, which will be extremely beneficial to system stability and troubleshooting.
- **ARMS supports traceId to reveal responseHeader
F6 and the ARMS team also worked closely together to explore observable best practices. ARMS has recently withdrawn a new feature that directly exposes the traceID to the HTTP header, which can be output to the corresponding log in the access layer log and retrieved through Kibana.
When a fault occurs, the customer reports the log information and traceID to the technical support personnel. Finally, the R&D personnel can quickly locate the cause of the problem and the upstream and downstream links through the traceID, because the traceID is unique in the entire call chain, which is very suitable for Search condition.
At the same time, ARMS supports the transparent transmission of traceID directly through MDC, supports mainstream Java log frameworks, including Logback, Log4j, Log4j2, etc., and can also output traceID as standard Python.
The above figure shows the background configuration of ARMS in a typical log output scenario. It can open the associated business log and traceID, and support various components. It only needs to define eagleeye_traceid in the log system to output the traceID.
The above solution is relatively simple, and there are few modifications in research and development, and it can even be free of modification, and the degree of association between Loggin and Trace has been greatly improved, reducing data islands.
- ARMS supports operation and maintenance alarm platform
ARMS provides a lot of data to further reduce MTTR, but how the data reaches SRE, DevOps or R&D operation and maintenance personnel still needs some thought.
Therefore, ARMS has launched an operation and maintenance alarm platform, which completes event processing such as alarm forwarding and triage in a visual way, and supports multiple integration methods through the silent function and grouping. Currently, F6 is using Prometheus, buffalo, ARMS and cloud monitoring. Cloud monitoring includes a lot of data including ONS and Redis. R&D personnel or SRE personnel can claim the corresponding response events in the DingTalk group. At the same time, it also supports functions such as reports, timing reminders, and event escalation, which is convenient for post-event review and improvement.
The above picture is a screenshot of the interface for online processing of problems. For example, the alarms in the DingTalk group will indicate who handled the last similar alarm, the alarm list, and the corresponding event handling process. At the same time, it also provides a filtering function, which can segment the content, and replace the content filling template through field enrichment or matching update, which is used for accurate alarming. This function will gradually replace the Apollo SDK application written in Go language.
- Java ecological modification-free injection agent method-JAVA_TOOL_OPTIONS
At the same time, we also learn from ARMS's way of injecting Agent without modification. ARMS injects a lot of ARMS information through one point initContainer, and also mounts a mount named home/admin/.opt to store logs. Because of initContainer, it can realize automatic upgrade.
In the initContainer, you can obtain the latest version of the current ARMS Agent by calling the ARMS interface, then download the latest version of the Java Agent, put it in the mount directory, communicate with the corresponding array Pod in the directory, and complete the Java Agent sharing by sharing the volume. The core point of the process is that the Java Agent is mounted through JAVA_TOOL_OPTIONS.
Through the above method, we simulate a set of processes, and modify the deployment by patch method through the openkruise component workpread. The simplest practice is to use openkruise workspread to annotate the corresponding deployment, so that it does not need to be processed by the R&D or SRE team, as long as the corresponding CRD is written, and the CRD process is directly injected into JAVA_TOOL_OPTIONS (see the code in the lower right corner of the above figure). Its application scenarios are also relatively rich, which can be used for application traffic playback, automated testing, etc.
- Prometheus Exporter
In addition to commercial products such as ARMS, we also actively open source, embrace the Prometheus community, and access many Exporter components, including SSLExport and BlackBoxExporter, which greatly improves the observability of the entire system. Exporter can be used for black box probes, such as detecting whether the HTTP request is normal, whether the HTTPS request is normal, whether the DNS is normal, whether the TCP is normal, etc. The typical usage scenario is to detect whether the current service entry address is normal. SSL certificate exceptions are more common. Through SSLExporter, you can periodically poll whether the certificate has expired to further improve the observability.
- cost observation
In addition to daily service observability, we also practice optimization projects such as cost observability. For cloud-native environments, kubecost open source components can be used to optimize costs, directly output resource usage and reports, etc., and feedback to R&D personnel for optimization. It can even be used to find out whether the CPU and memory are in a normal ratio, so as to achieve reasonable resource allocation as much as possible.
Imagine the future
One-stop observability for Kubernetes based on eBPF
eBPF cloud-native components are increasingly entering the deep water area, and many problems are no longer limited to the application level, but will appear more at the system level and network level, requiring more underlying data for tracking and troubleshooting. Using eBPF can better answer the questions raised by Google SRE such as latency, traffic, error rate, saturation and other golden indicators.
For example, during the flash switching process of Redis, a TCP half-open connection may be formed, which will affect the business; for example, when the TCP connection is just established, whether the backlog is reasonable, etc., can be obtained from the data. Corresponding conclusions.
Chaos-engineering
Chaos-engineering Chaos engineering encourages and exploits observability in an attempt to help users preemptively discover and overcome system weaknesses. In June 2020, the CNCF proposed a Special Interest Group on Observability. In addition to the three pillars, CNCF also proposes chaos engineering and continuous optimization.
The current community is still skeptical about whether chaos engineering can be divided into observability, and the CNCF Observability Special Interest Group has included chaos-engineering and continuous optimization. We believe that there is some truth to CNCF's approach, that chaos engineering can be considered an observability analysis tool, but its important premise is observability. Just think, if a failure occurs during the implementation of chaos engineering, we can't even determine whether it is caused by chaos engineering, which will also cause great trouble.
Therefore, it is possible to use the observability to reduce the explosion radius and localization problems as much as possible, and to continuously optimize the system through chaos-engineering, so as to discover the weak points of the system in advance, and better escort the stability of the system.
OpenTelemetry
OpenTelemetry is an open source framework merged from multiple projects. We need to develop a unified view of observability more terminal-oriented. For example, it is expected that logging, metrics, and tracing data will be correlated and marked, so as to reduce data islands as much as possible, and improve overall observability through the association of multiple data sources. And use observability to shorten online troubleshooting time as much as possible to buy time for business service recovery.
For more observable product details, click here !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。