In this article, we will start with a quick introduction to Resilience4j, and then dive into its Retry module. We will understand when and how to use it and the features it provides. In the process, we will also learn some good practices when retrying.
Code example
Paper Githu on work with code examples.
What is Resilience4j?
When the application communicates over the network, there will be many errors. Due to disconnection, network failure, upstream service unavailability, etc., the operation may time out or fail . Applications may overload each other, become unresponsive, or even crash.
Resilience4j is a Java library that can help us build resilient and fault-tolerant applications. It provides a framework for writing code to prevent and deal with such problems .
Resilience4j is written for Java 8 and higher and is suitable for structures such as functional interfaces, lambda expressions, and method references.
Resilience4j module
Let's take a quick look at these modules and their uses:
module | Purpose |
---|---|
Retry | Automatically retry failed remote operations |
RateLimiter | Limit the number of times we call remote operations within a certain period of time |
TimeLimiter | Set time limit when calling remote operation |
Circuit Breaker | When remote operations continue to fail, fail quickly or perform default operations |
Bulkhead | Limit the number of concurrent remote operations |
Cache | Store the results of expensive remote operations |
Use paradigm
Although each module has its abstraction, the usual usage paradigm is as follows:
- Create a Resilience4j configuration object
- Create a Registry object for this type of configuration
- Create or get a Resilience4j object from the registry
- Coding remote operations as lambda expressions or functional interfaces or usual Java methods
- Use one of the provided helper methods to create a decorator or wrapper around the code in step 4
- Call the decorator method to call the remote operation
Steps 1-5 are usually completed once when the application starts. Let's look at these steps to retry the module:
RetryConfig config = RetryConfig.ofDefaults(); // ----> 1
RetryRegistry registry = RetryRegistry.of(config); // ----> 2
Retry retry = registry.retry("flightSearchService", config); // ----> 3
FlightSearchService searchService = new FlightSearchService();
SearchRequest request = new SearchRequest("NYC", "LAX", "07/21/2020");
Supplier<List<Flight>> flightSearchSupplier =
() -> searchService.searchFlights(request); // ----> 4
Supplier<List<Flight>> retryingFlightSearch =
Retry.decorateSupplier(retry, flightSearchSupplier); // ----> 5
System.out.println(retryingFlightSearch.get()); // ----> 6
When to use retry?
remote operation can be any request sent through the network. Usually, it is one of the following:
- Send HTTP request to REST endpoint
- Call remote procedure (RPC) or Web service
- Read and write data from data storage (SQL/NoSQL database, object storage, etc.)
- Send and receive messages to the message broker (RabbitMQ/ActiveMQ/Kafka, etc.)
When the remote operation fails, we have two options-return an error to our client immediately, or retry the operation. If the retry succeeds, this is a good thing for the customer-they don't even need to know that this is a temporary problem.
Which option you choose depends on the type of error (transient or permanent), operation (idempotent or non-idempotent), client (human or application), and use case.
temporary error is temporary. Generally, if you try again, the operation is likely to succeed . Examples are when requests are restricted by upstream services, disconnected, or timed out due to temporary unavailability of certain services.
A hardware failure or 404 (not found) response from the REST API is an example of a permanent error, retrying will not help .
If we want to apply retry, the operation must be idempotent . Suppose that the remote service received and processed our request, but there was a problem sending the response. In this case, when we try again, we don't want the service to treat the request as a new request or return an unexpected error (think bank transfer).
retrying will increase the response time of the API . If the client is another application, such as a cron job or daemon, this may not be a problem. However, if it's a person, sometimes it's better to respond, fail fast and provide feedback, rather than make this person wait while we keep trying.
For some critical use cases, reliability may be more important than response time. Even if the customer is an individual, we may need to retry . Bank transfers or transfers from travel agencies to book flights and travel hotels are good examples-users expect reliability, not instant response to such use cases. We can respond by notifying users immediately that we have accepted their request and notifying them after completion.
Retry the module using Resilience4j
RetryRegistry
, RetryConfig
and Retry
are the main abstractions in resilience4j-retry RetryRegistry
is a factory used to create and manage Retry
RetryConfig
encapsulates configuration such as how many retries should be tried, and how long to wait between attempts. Each Retry
object is associated with a RetryConfig
. Retr
y provides helper methods to create decorators for functional interfaces or lambda expressions containing remote calls.
Let's see how to use the various functions available in the retry module. Suppose we are building a website for an airline to allow its customers to search and book flights. Our service communicates with the remote service encapsulated in the FlightSearchService
Simple retry
In a simple retry, if RuntimeException
is thrown during the remote call, the operation is retried. We can configure the number of attempts, how long to wait between attempts, etc.:
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.of(2, SECONDS))
.build();
// Registry, Retry creation omitted
FlightSearchService service = new FlightSearchService();
SearchRequest request = new SearchRequest("NYC", "LAX", "07/31/2020");
Supplier<List<Flight>> flightSearchSupplier =
() -> service.searchFlights(request);
Supplier<List<Flight>> retryingFlightSearch =
Retry.decorateSupplier(retry, flightSearchSupplier);
System.out.println(retryingFlightSearch.get());
We created a RetryConfig
, specifying that we should retry at most 3 times, and wait 2 seconds between attempts. If we use the RetryConfig.ofDefaults()
method instead, the default value of 3 attempts and 500 ms wait duration will be used.
We express the flight search call as a lambda expression- List<Flight>
of Supplier
. Retry.decorateSupplier()
method uses the retry function to decorate this Supplier
. Finally, we decorated the Supplier
call on get()
approach to remote calls.
If we want to create a decorator and reuse it in different places in the code base, we will use decorateSupplier()
. If we want to create it and execute it immediately, we can use the executeSupplier()
instance method instead:
List<Flight> flights = retry.executeSupplier(
() -> service.searchFlights(request));
这是显示第一个请求失败然后第二次尝试成功的示例输出:
Searching for flights; current time = 20:51:34 975
Operation failed
Searching for flights; current time = 20:51:36 985
Flight search successful
[Flight{flightNumber='XY 765', flightDate='07/31/2020', from='NYC', to='LAX'}, ...]
Retry on checked exception
Now, suppose we want to retry checked and unchecked exceptions. Suppose we are callingFlightSearchService.searchFlightsThrowingException()
, it can throw a checked Exception
. Since Supplier
cannot throw a checked exception, we will get a compiler error at this line:
Supplier<List<Flight>> flightSearchSupplier =
() -> service.searchFlightsThrowingException(request);
We might try to handle the Exception in the lambda expression and return Collections.emptyList()
, but this does not look good. More importantly, since we captured Exception
, retrying no longer works:
ExceptionSupplier<List<Flight>> flightSearchSupplier = () -> {
try {
return service.searchFlightsThrowingException(request);
} catch (Exception e) {
// don't do this, this breaks the retry!
}
return Collections.emptyList();
};
So when we want to retry all the exceptions that may be thrown by the remote call, what should we do? We can useRetry.decorateCheckedSupplier()
(or executeCheckedSupplier()
example method) instead of Retry.decorateSupplier()
:
CheckedFunction0<List<Flight>> retryingFlightSearch =
Retry.decorateCheckedSupplier(retry,
() -> service.searchFlightsThrowingException(request));
try {
System.out.println(retryingFlightSearch.apply());
} catch (...) {
// handle exception that can occur after retries are exhausted
}
Retry.decorateCheckedSupplier()
returns a CheckedFunction0
, which represents a function with no parameters. Please note the apply()
CheckedFunction0
object to invoke the remote operation.
If we don't want to use Suppliers
, Retry
provides more auxiliary decorator methods, such as decorateFunction()
, decorateCheckedFunction()
, decorateRunnable()
, decorateCallable()
etc., to use with other language structures. The difference between the decorate*
and decorateChecked*
decorate*
version RuntimeExceptions
on 0619c604c3f26a, while the decorateChecked*
version Exception
on 0619c604c3f27d.
Conditionally retry
Retry simple example above shows how to encounter when calling a remote service RuntimeException
or check Exception
retries. In practical applications, we may not want to retry all exceptions. For example, if we get aAuthenticationFailedException
same request will not help. When we make an HTTP call, we may want to check the HTTP response status code or look for a specific application error code in the response to decide whether we should try again. Let us see how to achieve this conditional retry.
Predicate-based condition retry
Assume that the airline's flight service regularly initializes flight data in its database. For flight data on a given date, this internal operation takes a few seconds. If we call the day's flight search during the initialization process, the service will return a specific error code FS-167. The flight search documentation says that this is a temporary error and you can retry the operation in a few seconds.
Let's see how to create RetryConfig
:
RetryConfig config = RetryConfig.<SearchResponse>custom()
.maxAttempts(3)
.waitDuration(Duration.of(3, SECONDS))
.retryOnResult(searchResponse -> searchResponse
.getErrorCode()
.equals("FS-167"))
.build();
We use the retryOnResult()
method and pass the Predicate
that performs this check. The Predicate
can be as complex as we want-it can be a check of a set of error codes, or some custom logic to decide whether the search should be retried.
Exception-based condition retry
Suppose we have a general exceptionFlightServiceBaseException
, this exception will be thrown when any accident occurs during the interaction with the airline's flight service. As a general strategy, we want to retry when this exception is thrown. But we don't want to retry SeatsUnavailableException
-if there are no seats available on the flight, retrying will not help. We can do this by RetryConfig
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.of(3, SECONDS))
.retryExceptions(FlightServiceBaseException.class)
.ignoreExceptions(SeatsUnavailableException.class)
.build();
In retryExceptions()
, we specified an exception list. ignoreExceptions()
will retry any exceptions that match or inherit from exceptions in this list. We put the ones we want to ignore instead of retrying in ignoreExceptions()
. If the code throws some other exceptions at runtime, such as IOException
, it will not be retried.
Suppose that even for a given exception, we don't want to retry in all cases. Maybe we only want to retry when the exception has a specific error code or specific text in the exception message. In this case, we can use the retryOnException
method:
Predicate<Throwable> rateLimitPredicate = rle ->
(rle instanceof RateLimitExceededException) &&
"RL-101".equals(((RateLimitExceededException) rle).getErrorCode());
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.of(1, SECONDS))
.retryOnException(rateLimitPredicate)
build();
As with predicate-based (predicate-based) conditional retries, the check within the predicate can be complicated as needed.
Back-off strategy
So far, our example has a fixed retry wait time. Usually we want to increase the waiting time after each attempt-this is to allow the remote service to have enough time to recover under the current overload situation. We can use IntervalFunction
to do this.
IntervalFunction
is a functional interface-it is a 0619c604c3f479 that takes the number of attempts as a parameter and returns the waiting time in Function
.
Random interval
Here we specify a random waiting time between attempts:
RetryConfig config = RetryConfig.custom()
.maxAttempts(4)
.intervalFunction(IntervalFunction.ofRandomized(2000))
.build();
IntervalFunction.ofRandomized()
has an associated randomizationFactor
. We can set it as the second parameter ofRandomized()
If it is not set, the default value of 0.5 is used. This randomizationFactor
determines the distribution range of the random value. Therefore, for the above default value of 0.5, the generated waiting time will be between 1000 milliseconds (2000-2000 0.5) and 3000 milliseconds (2000 + 2000 0.5).
Sample output of this behavior is as follows:
Searching for flights; current time = 20:27:08 729
Operation failed
Searching for flights; current time = 20:27:10 643
Operation failed
Searching for flights; current time = 20:27:13 204
Operation failed
Searching for flights; current time = 20:27:15 236
Flight search successful
[Flight{flightNumber='XY 765', flightDate='07/31/2020', from='NYC', to='LAX'},...]
Exponential interval
For exponential backoff, we specify two values-initial waiting time and multiplier. In this method, the waiting time increases exponentially between attempts due to the multiplier. For example, if we specify that the initial waiting time is 1 second and the multiplier is 2, the retry will be after 1 second, 2 seconds, 4 seconds, 8 seconds, 16 seconds, etc. This method is the recommended method when the client is a background job or daemon process.
Here is how we created RetryConfig
for exponential backoff:
RetryConfig config = RetryConfig.custom()
.maxAttempts(6)
.intervalFunction(IntervalFunction.ofExponentialBackoff(1000, 2))
.build();
Sample output of this behavior is as follows:
Searching for flights; current
time = 20:37:02 684
Operation failed
Searching for flights; current time = 20:37:03 727
Operation failed
Searching for flights; current time = 20:37:05 731
Operation failed
Searching for flights; current time = 20:37:09 731
Operation failed
Searching for flights; current time = 20:37:17 731
IntervalFunction
also provides a exponentialRandomBackoff()
method, which combines the above two methods. We can also provide a custom implementation of IntervalFunction.
Retry asynchronous operation
The examples we have seen so far are all synchronous calls. Let's see how to retry an asynchronous operation. Suppose we search for flights asynchronously like this:
CompletableFuture.supplyAsync(() -> service.searchFlights(request))
.thenAccept(System.out::println);
searchFlight(
) The call occurs on a different thread, and when it returns, the returned List<Flight>
is passed to thenAccept()
, and it just prints it.
We can use executeCompletionStage()
Retry
object to retry the above asynchronous operation. This method takes two parameters-one ScheduledExecutorService
will be scheduled on which to retry, and one Supplier<CompletionStage>
will be decorated. It decorates and executes CompletionStage
, and then returns a CompletionStage
, we can call thenAccept
as before:
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
Supplier<CompletionStage<List<Flight>>> completionStageSupplier =
() -> CompletableFuture.supplyAsync(() -> service.searchFlights(request));
retry.executeCompletionStage(scheduler, completionStageSupplier)
.thenAccept(System.out::println);
In the actual application, we will use the shared thread pool (Executors.newScheduledThreadPool()
) to schedule retries instead of the single-threaded scheduling executor shown here.
Retry event
In all these examples, the decorator is a black box-we don't know when the attempt failed, and the framework code is trying to retry. Suppose that for a given request, we want to log some details, such as the attempt count or the waiting time before the next attempt. We can do this by using retry events posted at different execution points. Retry has a EventPublisher
, which has onRetry()
, onSuccess()
and other methods.
We can collect and record detailed information by implementing these listener methods:
Retry.EventPublisher publisher = retry.getEventPublisher();
publisher.onRetry(event -> System.out.println(event.toString()));
publisher.onSuccess(event -> System.out.println(event.toString()));
Similarly, RetryRegistry
also has a EventPublisher
, which Retry
object is added or deleted from the registry.
Retry indicator
Retry
maintain counters to track the number of operations
- The first attempt was successful
- Success after retry
- Failed without retries
- Failed after retrying
Each time the decorator is executed, it will update these counters.
Why capture metrics?
captures and regularly analyzes the metrics to give us insight into the behavior of upstream services. It can also help identify bottlenecks and other potential problems .
For example, if we find that an operation usually fails on the first attempt, we can investigate the cause. If we find that our request is restricted or timed out when establishing a connection, it may indicate that the remote service requires additional resources or capacity.
How to capture metrics?
Resilience4j uses Micrometer to publish indicators. Micrometer provides the appearance of a meter client for monitoring systems (such as Prometheus, Azure Monitor, New Relic, etc.). So we can publish indicators to any of these systems or switch between them without changing our code.
First, we create RetryConfig
and RetryRegistry
and Retry
as usual. Then, we create a MeterRegistry
and etryRegistry
to it:
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedRetryMetrics.ofRetryRegistry(retryRegistry).bindTo(meterRegistry);
After running a few retryable operations, we display the captured metrics:
Consumer<Meter> meterConsumer = meter -> {
String desc = meter.getId().getDescription();
String metricName = meter.getId().getTag("kind");
Double metricValue = StreamSupport.stream(meter.measure().spliterator(), false)
.filter(m -> m.getStatistic().name().equals("COUNT"))
.findFirst()
.map(m -> m.getValue())
.orElse(0.0);
System.out.println(desc + " - " + metricName + ": " + metricValue);
};
meterRegistry.forEachMeter(meterConsumer);
Some sample output is as follows:
The number of successful calls without a retry attempt - successful_without_retry: 4.0
The number of failed calls without a retry attempt - failed_without_retry: 0.0
The number of failed calls after a retry attempt - failed_with_retry: 0.0
The number of successful calls after a retry attempt - successful_with_retry: 6.0
Of course, in actual applications, we will export the data to the monitoring system and view it on the dashboard.
Precautions and good practices when retrying
service usually provides a client library or SDK with a built-in retry mechanism. This is especially true for cloud services. For example, Azure CosmosDB and Azure Service Bus provide built-in retry tools for client libraries. They allow applications to set retry policies to control retry behavior.
In this case, it is better to use the built-in retry instead of our own coding. If we really need to write our own, we should disable the built-in default retry strategy-otherwise, it may cause nested retries, where each attempt by the application will result in multiple attempts by the client library .
Some cloud services record transient error codes. For example, Azure SQL provides a list of error codes that it expects database clients to retry. Before deciding to add retries for a particular operation, it is best to check whether the service provider has such a list.
Another good practice is that maintains the values we use in RetryConfig (such as the maximum number of attempts, waiting time, and error codes and exceptions) as a configuration outside of our service 1619c604c3f91f. If we discover a new transient error or we need to adjust the interval between attempts, we can make changes without building and redeploying the service.
Usually when retrying, Thread.sleep() may occur somewhere in the framework code. This is the case for synchronous retries with waiting time between retries. If our code runs in the context of a Web application, Thread is most likely the request processing thread of the Web server. Therefore, if we make too many retries, it will reduce the throughput of the application .
in conclusion
In this article, we learned what Resilience4j is and how to use its retry module to make our application resilient to temporary errors. We looked at different ways to configure retries, and some examples of deciding between different methods. We learned some good practices to follow when implementing retries, and the importance of collecting and analyzing retry indicators.
You can use GitHub on code to try a complete application to demonstrate these ideas.
This article is translated from: Implementing Retry with Resilience4j-Reflectoring
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。