云原生 - "Resilience" can be more "willful" | Microsoft Cloud Native Resilience Design Guide - 个人文章

Faced with the complex and changeable business and operation and maintenance environment, many people rack their brains to maintain continuous business operations.
However, in many cases, no matter whether the deletion of the library and running away causes the enterprise to lose all key business data, or the external construction error digs and breaks the optical cable and wire, even a small misconfiguration of some internal or external basic services causes service interruption within half of the globe. …All these problems, big or small, will always make a lot of people rush to deal with it for a long time, and will also cause a lot of impact on the business and even the corporate reputation.

Although that sentence is good: is more difficult to destroy the steady state, the stronger our confidence in the behavior of the system; and as long as a weakness can be found, we have an improvement goal .

However, in the past, when deploying and running critical applications locally, many factors including infrastructure and underlying hardware can be controlled by the enterprise itself, so it is easy to find and solve the weaknesses (may require a lot of capital and manpower). But when companies start to go to the cloud and run these key applications through the cloud platform, the management and maintenance of the underlying infrastructure is undertaken by the cloud platform. At this time, how to solve the weaknesses and create a more stable and resilient operation and maintenance environment and application?

This article will start from the perspective of design ideas and tell you how to improve the resilience of cloud-native applications to ensure that the business can still run steadily after an incident.

All applications communicating with remote services and resources must be temporary failure . This is especially true for applications running in the cloud, because the nature of the environment and the characteristics of establishing a connection via the Internet mean that it is more likely to encounter such problems. Temporary failures include instantaneous disconnection of the network connection between the client and the service, temporary unavailability of background services, or timeout due to excessive concurrency. These errors are usually self-healing. If the impact of the failure can be controlled within a certain range, the impact on the end user can be minimized.

Why are there temporary failures in the cloud?

Temporary failures will occur in any environment, any platform or operating system, and any type of application. In solutions running on local infrastructure, the performance and availability of applications and their components are usually guaranteed by expensive and underutilized redundant hardware. Although this method reduces the possibility of failure, it may still cause temporary failures or even interruptions due to unpredictable events such as external power/network problems or other disaster situations.

Managed cloud services (PaaS) can use shared resources, redundancy, automatic failover, and dynamic resource allocation across multiple computing nodes to achieve higher overall availability. But the nature of these environments means that temporary failures are more likely to occur, for reasons including:

Many resources in the cloud environment are shared. In order to effectively manage these resources, the cloud usually strictly controls access to these resources. For example, when the load rises to a certain level or reaches the upper limit of the throughput ratio, some services will refuse additional connections in order to process existing requests and maintain service performance for all existing users. Restrictions help maintain quality of service for neighbors and other tenants who share resources.
The cloud environment is constructed using a large number of commercial hardware units. The cloud environment dynamically distributes the load to multiple computing units and infrastructure components to obtain more performance, and provides reliability by automatically recycling or replacing failed units. This dynamic nature means that temporary failures or temporary connection failures may occasionally occur.
There are usually multiple hardware components between applications and resources and the services they use, including network infrastructure such as routers and load balancers. These additional components occasionally cause additional connection delays or temporary connection failures.
The network conditions between the client and the server will change from time to time, especially when communicating via the Internet. Even in a local location, high traffic loads can slow down communication speeds and cause intermittent connection failures.

The challenge

Temporary failures can have a huge impact on user-perceived usability, even if the application has been thoroughly tested under all foreseeable circumstances. To ensure reliable operation of cloud-hosted applications, applications must be able to meet the following challenges:

application must be able to detect the occurrence of and determine whether these failures may be temporary, permanent, or terminal failures. When a failure occurs, different resources may return different responses, and these responses may vary according to different operations. For example, the response returned for an error that occurs when reading from storage is different from the response returned for an error that occurs when writing to storage. Many resources and services have properly formulated strategies for temporary failures. However, if such information is not provided, it is difficult to discover the nature of the failure and whether the failure is temporary.
If a temporary failure can be determined, the application must be able to retry operation , and retry the tracking operation.
The application must use appropriate retry strategy . This policy specifies the number of times the application should retry, the delay time for every two attempts, and the action to be performed after the attempt fails. The appropriate number of attempts and the delay time between each two attempts are often difficult to determine, and will vary according to the type of resource and the current operating conditions of the application itself.

Resilience Design Guide

The following guidelines will help you design a suitable temporary fault handling mechanism for your application:

Determine if there is a built-in retry mechanism

Many services provide SDKs or client libraries that contain temporary fault handling mechanisms. The retry strategy used by the service is usually customized according to the nature and requirements of the target service. Or for determining whether the retry is correct and how long to wait before the next retry attempt, the REST interface of the service may return useful information.
If available, use the built-in retry mechanism, unless there are specific and clear requirements to make different retry behaviors more appropriate.

Determine if the operation is suitable for retry

Retry operations should only be performed under temporary failures and at least a certain probability of success when retrying. For operations that indicate invalid operations (such as database updates to non-existent items, or requests for services or resources with fatal errors), it is meaningless to try again.
Generally speaking, retrying is only recommended when the full impact of the operation can be determined and the situation is fully understood and can be verified. Otherwise, the calling code should implement the retry. Keep in mind that errors returned from uncontrollable resources and services may evolve over time, and it may be necessary to re-establish the detection logic for access temporary failures.
When creating a service or component, consider implementing error checking codes and message handling to help the client determine whether the failed operation should be retried. In particular, indicate whether the client should retry the operation, and suggest an appropriate delay before the next retry attempt. If building a web service, consider returning a custom error defined in the service contract. Even though general clients may not be able to read this information, they can be very useful when building custom clients.

Determine the appropriate retry count and interval

Optimizing the retry count and interval for the use case type is critical. If there are not enough retries, the application will not be able to complete the operation and may experience failure; if multiple retries, or the retry interval is too short, the application may occupy resources such as threads, connections, and memory for a long time, which will not The operating status of the application is adversely affected.
The appropriate values for the time interval and the number of retries depend on the type of operation being attempted. For example, if the operation is part of user interaction, the interval should be short and try only a few times to avoid making the user wait for a response (this will keep open connections and reduce the availability of other users; if the operation is long-running or critical Part of the workflow where canceling and restarting the process is time consuming and laborious, so it is appropriate to wait longer between attempts and retry.
Determining the appropriate interval between retries is the most difficult part of designing a successful strategy. Typical strategies use the following types of retry intervals:
(A) Exponential delay: The application waits briefly before the first retry, and the interval between each subsequent retry increases exponentially. For example, retry the operation after 3 seconds, 12 seconds, and 30 seconds.
(B) Incremental interval: The application waits for a short time before the first retry, and the interval increments for each subsequent retries. For example, retry the operation after 3 seconds, 7 seconds, and 13 seconds.
(C) Fixed interval: The interval between each attempt of the application is the same. For example, fixed retry operation every 3 seconds
(D) Retry immediately: Sometimes temporary failures are short-lived, possibly due to events such as network data packet conflicts or spikes in hardware components. In this case, it is appropriate to retry the operation immediately, because if the fault is cleared when the operation lets the application assemble and send the next request, the operation may succeed. However, there should not be multiple immediate retry attempts. If the immediate retry fails, switch to an alternate strategy, such as exponential backoff or rollback operations.
(F) Randomization: Any of the retry strategies listed above may include randomization to prevent multiple instances of the client from sending subsequent retry attempts at the same time. For example, one instance may retry the operation after 3 seconds, 11 seconds, and 28 seconds, while another instance may retry the operation after 4 seconds, 12 seconds, and 26 seconds. Randomization is a useful technique that can be combined with other strategies.
The general guiding principle is to use an exponential back-off strategy for background operations and an immediate or fixed interval retry strategy for interactive operations. In the above two cases, the delay and retry count should be selected so that the upper limit of the delay of all retries is within the required end-to-end delay requirement.
Consider the combination of all factors that affect the total maximum timeout of the retry operation. These factors include the time it takes for a failed connection to generate a response (usually set by the timeout value in the client) and the delay between the retry attempts and the maximum number of retries. The sum of all these times may result in a longer overall operation time, especially when using an exponential delay strategy, where the retry interval grows rapidly after each failure. If the process must meet a specific service level agreement SLA, the entire operating time (including all timeouts and delays) must be within the limits defined by the SLA.
Too aggressive retry strategies (too short an interval or too frequent retries) may adversely affect the target resource or service. This may prevent the resource or service from recovering from its overloaded state, and it will continue to block or reject requests. This leads to a vicious circle in which more and more requests are sent to the resource or service, which further reduces its resilience.
When selecting the retry interval, consider the timeout period of the operation to avoid starting subsequent attempts immediately (for example, if the timeout period is similar to the retry interval). Also consider whether you need to keep the total possible time (timeout plus retry interval) below a certain total time. Operations with unusually short or unusually long timeouts may affect how long to wait and how often to retry operations.
Use the type of exception and any data it contains, or the error code and message returned from the service, to optimize the interval and number of retries. For example, some exceptions or error codes (such as HTTP code 503 Service Unavailable with the Retry-After header in the response) may indicate how long the error may last, or the service has failed and will not respond to any subsequent attempts.

Avoid anti-patterns

In most cases, implementations that include repeated retry code layers should be avoided. Avoid designs that include cascading retry mechanisms, or implement retry designs at each stage of an operation involving the request hierarchy, unless there are specific requirements that require it. In these abnormal situations, please use strategies to prevent multiple attempts and delays, and make sure to understand the consequences. For example, if a component makes a request to another component and the latter accesses the target service, and the two calls are to be retried three times each, the service will be retried nine times in total. Many services and resources implement built-in retry mechanisms. If you need to retry at a higher level, you should investigate how to disable or modify this setting.
Never implement an endless retry mechanism. This may prevent resources or services from recovering from overload conditions and cause throttling and connection denials for longer periods of time. Use a limited number of retries, or implement a pattern (such as a circuit breaker) to allow service recovery.
Do not retry more than once immediately.
Avoid using regular retry intervals, especially when there are a large number of retry attempts when accessing services and resources in Azure. In this case, the optimal method is to adopt an exponential back-off strategy with breaking capability.
Prevent multiple instances of the same client or multiple instances of different clients from sending retries at the same time. If this can happen, introduce randomization in the retry interval.

Test retry strategy and implementation

Make sure to fully test the implementation of the retry strategy in the widest possible range, especially when the application and the target resource or service it uses are under extreme load. To check behavior during testing, you can:

Inject transient and non-transient faults into the service. For example, sending invalid requests or adding code to detect test requests, and responding with different types of errors.
Create a simulation of a resource or service that returns a series of errors that a real service might return. Make sure to cover all types of errors that the retry strategy is designed to detect.
If it is a custom service created and deployed by yourself, temporarily disable or reload the service to force a transient error (of course, we should not try to reload any shared resources or shared services in Azure).
For HTTP-based APIs, you can consider using the FiddlerCore library in automated testing to change the result of the HTTP request by adding additional round-trip time or changing the response (such as HTTP status code, headers, body, or other factors). This allows deterministic testing of a subset of fault conditions, whether it is transient faults or other types of faults.
Perform high load factor and concurrency testing to ensure that retry mechanisms and strategies work correctly under these conditions and will not adversely affect client operations or cause cross-contamination between requests.

Manage retry policy configuration

The retry strategy is a combination of all the elements of the retry strategy. It defines the detection mechanism to determine whether the failure may be temporary, the type of interval to be used (such as regular, exponential backoff, and randomization), the actual interval value, and the number of retries.
Even the simplest application must be retried in many places, and every layer of more complex applications must be retried. Consider using a central point to store all strategies instead of hard-coding the elements of each strategy in multiple locations. For example, store values such as interval and retry count in the application configuration file, read them at runtime, and programmatically build a retry strategy. This makes it easier to manage settings, modify and fine-tune values in response to changing needs and scenarios. However, design the system to store these values instead of re-reading the configuration file every time, and make sure to use appropriate default values when these values cannot be obtained from the configuration.
In Azure cloud native applications, consider storing the values used to build runtime retry policies in the service configuration file so that they can be changed without restarting the application.
Use the built-in or default retry strategies provided in the client API, but only use them in scenarios where they are suitable. These strategies are generally general. In some scenarios, they may be all necessary, but in other scenarios, they may not provide all options to meet specific needs. To determine the most appropriate value through testing, we must understand how the settings will affect the application.

Record and track transient and non-transient faults

As part of the retry strategy, it includes exception handling and other checks when recording retry attempts. Although occasional brief failures and retries are expected and do not indicate a problem, the regular and increasing number of retries is usually an indicator of a problem that may cause a failure, or that the application is currently degrading Performance and availability.
Record transient faults as warning items instead of error items so that the monitoring system does not detect them as application errors that may trigger false alarms.
Consider storing a value in the log entry that indicates whether the retries are caused by throttling in the service or by other types of errors (such as connection failures) so that they can be distinguished when analyzing the data. An increase in the number of throttling errors usually indicates a design flaw in the application or the need to switch to providing quality services with dedicated hardware.
Consider measuring and recording the total time spent in operations involving retry mechanisms. This is a good indicator of the overall impact of transient errors on user response time, processing delays, and application use case efficiency. Also record the number of retries that occur in order to understand the factors that affect response time.
Consider implementing a telemetry and monitoring system that can alert when the number and rate of failures, the average number of retries, or the total time required for the operation to succeed increases.

Manage operations that continue to fail

In some cases, every action will fail, and it is crucial to consider how to deal with this situation.
Although the retry policy will define the maximum number of times an operation should be retried, it will not prevent the application from repeating the operation again with the same number of retries. For example, if an order processing service fails due to a fatal error, causing it to permanently stop operating, the retry strategy may detect a connection timeout and consider this to be a temporary error. The code will retry the operation the specified number of times and then give up. However, when another customer places an order, the operation will be tried again-even if it will definitely fail every time.
In order to prevent continuous retries of continuously failing operations, consider implementing a circuit breaker mode. In this mode, if the number of failures within the specified time window exceeds the threshold, the request will be immediately returned to the caller as an error without attempting to access the failed resource or service.
The application can periodically test the service (intermittently, with long intervals between requests) to detect when the service is available. The appropriate interval depends on the scenario, such as the criticality of the operation and the nature of the service, and may be anywhere from a few minutes to a few hours. When the test is successful, the application can resume normal operation and pass the request to the newly restored service.
At the same time, you can go back to another instance of the service (maybe in a different data center or application), use a similar service that provides compatible (maybe simpler) functions, or perform some alternative operations, hoping that the service will be available soon . For example, you can store service requests in a queue or data store and replay them later. Otherwise, we may redirect the user to another instance of the application, reduce the performance of the application, but still provide acceptable functionality, or simply return a message to the user indicating that the application is currently unavailable.

Other considerations

When deciding the value of the number of retries and the policy retry interval, consider whether the operation on the service or resource is part of a long-running or multi-step operation. When one operation step fails, it can be difficult or expensive to compensate for all other operations that have succeeded. In this case, a long interval and a large number of retries are acceptable, as long as it does not block other operations by holding or locking scarce resources.
Please consider whether retrying the same operation may cause data inconsistency. If parts of a multi-step process are repeated, and the operation is not idempotent, it may lead to inconsistencies. For example, if the operation of incrementing the value is repeated, it will produce invalid results. If duplicate messages cannot be detected, the operation of repeatedly sending messages to the queue may cause inconsistencies among message consumers. To prevent this, make sure to design each step as an idempotent operation.
Consider the range of operations that will be retried. For example, it may be easier to implement retry code at a level that contains multiple operations. If one of the operations fails, all operations are retried. However, doing so may cause idempotent problems or unnecessary rollback operations.
If you choose a retry range that includes multiple operations, consider the total delay of all operations before determining the retry interval, the time spent on monitoring, and issuing a failure alert.

When considering resilient design for cloud-native applications, be sure to carefully consider how the retry strategy might affect neighbors and other tenants in the shared application. Aggressive retry strategies may cause more and more transient errors for other users and applications that share resources and services.

Similarly, our applications may be affected by retry strategies implemented by other users of resources and services. For mission-critical applications, we can decide to use advanced services that are not shared. This provides us with more load control and corresponding resource and service throttling, which helps to justify the additional cost.

Following the above ideas and making adjustments based on specific conditions, we can successfully design a cloud-native application architecture with sufficient resilience.

Hope this article can be helpful to you. If you are interested in this topic, then stay tuned for more series of articles to be released in the future. We will continue to explore the implementation ideas and methods of related mechanisms from the perspective of retry mode and circuit breaker mode. Stay tuned for more exciting!

"Resilience" can be more "willful" | Microsoft Cloud Native Resilience Design Guide

Why are there temporary failures in the cloud?

The challenge

Resilience Design Guide

Determine if there is a built-in retry mechanism

Determine if the operation is suitable for retry

Determine the appropriate retry count and interval

Avoid anti-patterns

Test retry strategy and implementation

Manage retry policy configuration

Record and track transient and non-transient faults

Manage operations that continue to fail

Other considerations

微软技术栈

引用和评论

以专家为镜，以人为本：孚知流的智能交互探索之旅

JManus - 面向 Java 开发者的开源通用智能体

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？

分析型数据库入门指南：如何选择适合你的实时分析工具？

安利一个求职刷题小妙招、变身 offer 收割机 | 《趣玩》第 2 期

Java 开发玩转 MCP：从 Claude 自动化到 Spring AI Alibaba 生态整合

Dify+DeepSeek实战教程！企业级 AI 文档库本地化部署，数据安全与智能检索我都要