Online glitches? How to diagnose, troubleshoot and recover urgently

Author: ten sleep

Overview

Stability is above everything else, so we need a more efficient way to avoid online failures. Under the assumption that failures are unavoidable, we need to be able to repair quickly and reduce online impact. Based on the above ideas, we propose a 1-5-10 fast recovery goal. The so-called 1-5-10 goal is to enable us to find online problems in 1 minute, locate them in 5 minutes, and fix them in 10 minutes. The following will introduce some best practices for fault recovery and diagnosis on Alibaba Cloud.

This article is excerpted from Section 3.9 of the Microservice Governance Technical White Paper

Follow the Alibaba Cloud Cloud Native Official Account, and reply to the keyword [Technical White Paper] in the background to download!

1 minute discovery

monitor

The role of monitoring can be summed up in one sentence: to find problems in the application, and to alert the technical personnel to deal with the problems in time. Monitoring types can be divided into monitoring of system problems and monitoring of business problems. System problems: common software and hardware related problems, such as program exceptions, memory fullGC, etc. Since there are no business features, monitoring strategies can be applied to various applications. Business problems: Problems defined in specific business scenarios, such as no coupons for products, over-issue of rights and interests, etc., need to customize monitoring strategies according to business characteristics.

Alibaba Cloud's real-time application monitoring service ARMS can automatically discover and monitor common web frameworks and RPC frameworks in application code, and count indicators such as interface calls, response time, and errors. At the same time, you can further obtain the slow SQL, MQ accumulation analysis report or exception classification report of the interface, and conduct more detailed analysis on common problems such as errors and slowness.

ARMS also provides the ability to monitor business, visually define business requests in a code-free way, and provide rich performance indicators and diagnostic capabilities that fit the business. A new way to measure application performance and stability from a business perspective, and to monitor the entire link of key business transactions. By tracking and collecting business information in applications, business monitoring displays business-level indicators in real time, such as business response time, times, and error rates, solving the problem of inability to map and correlate between applications and business performance.

There are three requirements for monitoring. Real-time: requires real-time problem discovery and early warning, shortening the delay of problem generation and discovery; accurate: requires accurate monitoring and early warning, including the definition of monitoring problems, early warning thresholds, early warning levels, and responsibility Configuration to avoid false positives; comprehensive: The warning information is required to be comprehensive and can help troubleshoot and solve problems.

"Regardless of any problems with the application, ARMS can clearly show which line of code the problem is in. ARMS is very important to us, greatly shortening the time to fix failures, and significantly improving the user experience. Since using ARMS, we can detect and Fix the problem and never be bothered by user complaints again.” —— China Resources Vanguard

alert

When a problem is found in the monitoring, it is necessary to alert the technical personnel of the problem in time through different levels of alarms for processing. ARMS alarm management can improve the operation and maintenance efficiency of the system from the following points.

Integrated post-event management is more efficient.
- Alarm management supports one-click integration of common monitoring tools of Alibaba Cloud by default, and supports manual access of more monitoring tools, which is convenient for unified maintenance.
- The event access module is stable and can provide 7x24 hours of uninterrupted event processing services.
- Low latency is guaranteed when processing massive event data.
Timely and accurate notification of alerts to contacts.
- Configure notification rules, and then send alarm notifications after merging events to reduce notification fatigue for operation and maintenance personnel.
- According to the urgency of the alarm, choose different notification methods such as email, SMS, phone, DingTalk, etc., to remind the contact to handle the alarm.
- Alarms that have not been processed for a long time are reminded multiple times through escalation notifications to ensure timely resolution of alarms.
Helps you manage alerts quickly and easily.
- Contacts can handle alarms at any time through DingTalk.
- Using a common alert format, contacts can better analyze alerts.
- Multiple contacts are co-processed through DingTalk.
Statistics alarm data, real-time analysis and processing, improve alarm processing efficiency.

5 minutes to locate the fault

Service instance isolation and diagnostics

In the online microservice scenario, when some instances of the service provider are abnormal, on the one hand, it is necessary to prevent the service consumer from accessing the abnormal instance, and on the other hand, the abnormal scene needs to be retained to facilitate subsequent troubleshooting. For another thought, we all know that dump memory will affect the performance of our application to a certain extent, which may affect our online business. Can we isolate business traffic from this instance before dumping memory? The service instance isolation and diagnosis function of MSE Governance Center can help us isolate the traffic of abnormal instances. On the one hand, it supports the isolation of traffic from microservices, and on the other hand, it supports the isolation of traffic from K8s Service, which can completely isolate the production environment. Then we can combine the memory snapshot generation capability provided by the Alibaba Cloud application real-time monitoring service ARMS to generate online environment memory snapshots of abnormal instances in time to help us analyze and diagnose subsequent problems. The service instance isolation and diagnosis function can well help us deal with sudden online accidents (such as memory leaks, etc.) and improve the overall stability of the microservice system.

practice

We can see a list of online instances in the MSE Service Governance console.

title=

Select a specific abnormal instance, perform the service offline operation, and remove the instance from the registry. At the same time, if we configure the readiness check probe provided by MSE, the traffic from the K8s Service will be isolated. We can log in the event center. Check whether the corresponding instance goes offline successfully.

title=

After the offline operation, we can see whether there is still traffic through the second-level node monitoring provided by MSE. After the traffic is completely stopped, you can create a memory snapshot for the abnormal instance through the memory snapshot function provided by the Alibaba Cloud application monitoring service ARMS for further troubleshooting.

title=

Click the Go to Create Memory Snapshot button to create a memory snapshot.

title=

Click Save to create a snapshot task.

title=

After clicking save, we see that the instance 172.16.0.200 already has a snapshot

title=

Further according to the prompts of the console, we will dump, analyze, and view the Core File respectively.

title=

After clicking View, it automatically jumps to the page of Grace analysis. We can see the overview of memory analysis, memory leak report, class loader and a series of information. View detailed information on memory usage through detailed memory analysis data to further troubleshoot memory issues such as memory leaks and memory waste.

title=

As a final note, we can restore isolated traffic by going live with the service.

title=

Arthas Diagnosis

Arthas is a powerful tool for diagnosing online problems in the Java field. Using bytecode enhancement technology, you can view the running status of the program without restarting the JVM process.

JVM overview

The JVM overview supports viewing JVM-related information of the application, including JVM memory, operating system information, variable information, etc., to help us understand the overall situation of the JVM.

1. JVM memory: Information about JVM memory, including heap memory usage, non-heap memory usage, GC, etc.

title=

2. Operating system information: related information of the operating system, including average load, operating system name, operating system version, Java version, etc.

title=

3. Variable information: information about variables, including system variables and environment variables.

title=

Thread time-consuming analysis

Thread time-consuming analysis supports displaying all threads of the application and viewing thread stack information, helping us to quickly locate threads with high time-consuming.

1. The thread time-consuming analysis tab will obtain the thread time-consuming status of the current JVM process in real time, and aggregate similar threads. You can view the thread's ID, CPU usage, and status.

title=

2. We can click to view the real-time stack in the operation column on the right side of the target thread.

title=

Method execution analysis

The method execution analysis supports the capture and drill-in of the time-consuming, input parameters, return value and other information of a certain execution of the method, helping you to quickly locate the root cause of the slow call, and scenarios such as the problem that cannot be reproduced offline or the log is missing.

title=

As shown in the figure below, the execution time of each internal method will be displayed in the source code as a comment.

title=

Object Viewer

The object viewer is used to view the current status of some singleton objects and to troubleshoot application status exceptions, such as application configuration, black and white lists, member variables, etc.

title=

Real-time Kanban

The real-time kanban is used to view the real-time status of key components used in the system, such as viewing the usage of the database connection pool, the usage of the HTTP connection pool, etc., which is helpful for troubleshooting resource types.

The following figure shows the real-time status information of a Druid connection pool, including basic configuration, connection pool status, and execution time-consuming distribution.

title=

performance analysis

Performance analysis supports sampling objects such as CPU time consumption and memory allocation for a certain period of time and generates corresponding flame graphs, helping you to quickly locate the performance bottleneck of the application.

title=

10 minute recovery

Outlier Instance Removal

In the microservice architecture, when some instances of the service provider's application are abnormal and the service consumer cannot perceive it, it will affect the normal invocation of the service, and affect the service performance and even the availability of the consumer. The outlier instance removal function detects the availability of application instances and makes dynamic adjustments to ensure successful service invocation, thereby improving business stability and service quality.

title=

Service interruption and downgrade

When the application encounters a peak period of business, it is found that the downstream service provider encounters a performance bottleneck, and even is about to affect the business. We can perform service fuse operations on some service consumers, and automatically fuse for continuous unstable calls, thereby improving the stability of the overall service. When the downstream services that the application relies on are unavailable, business traffic is lost. You can configure the service fuse capability. When the downstream service is abnormal, the service downgrade enables the traffic to "fail fast" on the calling end, effectively preventing avalanches.

During peak business hours, some downstream service providers encounter performance bottlenecks and even affect business. We configure automatic fuse for some non-critical service consumers. When the proportion of slow calls or errors in a period of time reaches a certain condition, the fuse is automatically triggered, and the service calls for a period of time will directly return the result of the Mock, which can not only ensure that the calling end is not blocked Unstable services are dragged down, giving unstable downstream services some "breathing" time, and at the same time ensuring the normal operation of the entire business link.

title=

In other scenarios, service degradation can help us protect some important services. Some non-critical services are not stable. We hope to temporarily downgrade these weakly dependent service calls before important activities and reserve resources for other core services to ensure the smoothness of the overall business.

title=

Outlier instance removal, service fuse, and service degradation are mainly reflected in two points:

1. Automatic completion: Service downgrade is an operation and maintenance action, which needs to be configured through the console and the corresponding service name can be specified to achieve the corresponding effect; while the outlier instance removal and service fuse capabilities will actively detect upstream nodes In the case of survival or abnormal success of service invocation, slow invocation, etc., automatic isolation or fuse operation is performed on this link to ensure the quality of service.

2. Removal granularity: the service downgrade is (service + node IP). Taking Dubbo as an example, a process will publish a microservice with the service interface name (Interface) as the service name. If the downgrade of this service is triggered, the next time This service for this node will no longer be called, but other services will still be called. But outlier removal is that the entire node will not try to call.

Flow control, expansion, restart, rollback

Flow control: According to indicators such as flow, number of concurrent threads, and response time, adjust the random incoming traffic into an appropriate shape, that is, traffic shaping. Through the flow control capability, configure flow control rules for the service interface, allowing requests within the capacity range to pass, and unnecessary requests are rejected, which is equivalent to the role of an airbag. Layer-by-layer protection, coarse-grained protection is performed at the Nginx/Ingress gateway layer, and API, interface, method, and parameter granularity control is performed at the microservice layer. Avoid applications being overwhelmed by instantaneous traffic peaks, thus ensuring high availability of applications.

title=

Capacity expansion: Horizontal horizontal expansion improves cluster availability
Restart: Restarts the JVM process to temporarily eliminate long-running accumulated problems such as memory leaks
Rollback: Eliminate problems introduced by changes

One-click streaming based on the same availability zone priority

The characteristic of the same city is that RT is generally at a relatively low latency (< 3ms), so by default, we can build a large LAN based on different availability zones in the same city, and then distribute our applications across availability zones. In this way, when a single AZ fails, you can better control the impact of the failure.

title=

MSE service governance provides the ability to prioritize routing in the same computer room at the service framework level. If the target service is in the same availability zone as its own, it will preferentially route traffic to the node in the same availability zone as the current one. When an Availability Zone becomes unavailable, we only need to cut traffic at the gateway to isolate the traffic in the failed Availability Zone, and our business can be restored immediately.

tail

1-5-10 fault recovery quickly, fault response in 1 minute, 5 minutes location, 10 minutes recovery; only by constantly designing for failure and rehearsing based on fault emergency methods, then we can be more effective when we encounter online faults. Face failures gracefully. We hope that the new generation of cloud-native microservices will have more system self-healing capabilities. The microservice architecture can automatically sense the failure of external components and automatically switch to the backup link, truly killing the failure in the cradle.

10% discount for the first purchase of MSE Registration and Configuration Center Professional Edition, 10% discount for MSE Cloud Native Gateway Prepaid Full Specifications.

Click here to take advantage of the discount!

Online glitches? How to diagnose, troubleshoot and recover urgently

Overview

1 minute discovery

monitor

alert

5 minutes to locate the fault

Service instance isolation and diagnostics

Arthas Diagnosis

10 minute recovery

Outlier Instance Removal

Service interruption and downgrade

Flow control, expansion, restart, rollback

One-click streaming based on the same availability zone priority

tail

阿里云云原生

引用和评论

用通义灵码，从 0 开始打造一个完整APP，无需编程经验就可以完成

Go 程序如何实现优雅退出？来看看 K8s 是怎么做的——上篇

Kubernetes 网关流量管理：Ingress 与 Gateway API

Kubernetes CNI 网络模型概览：VETH & Bridge / Overlay / BGP

一键生成 HTTP + gRPC 混合架构微服务代码：更简单、更灵活、更兼容的微服务系统构建方式

通义灵码使用安装教程，3分钟快速上手体验

2024年9月中国数据库流行度排行榜：TiDB重回前三，GoldenDB问鼎前五