Use Resilience4j framework to implement retry mechanism

In this article, we will start with a quick introduction to Resilience4j, and then dive into its Retry module. We will understand when and how to use it and the features it provides. In the process, we will also learn some good practices when retrying.

Code example

Paper Githu on work with code examples.

What is Resilience4j?

When the application communicates over the network, there will be many errors. Due to disconnection, network failure, upstream service unavailability, etc., the operation may time out or fail . Applications may overload each other, become unresponsive, or even crash.

Resilience4j is a Java library that can help us build resilient and fault-tolerant applications. It provides a framework for writing code to prevent and deal with such problems .

Resilience4j is written for Java 8 and higher and is suitable for structures such as functional interfaces, lambda expressions, and method references.

Resilience4j module

Let's take a quick look at these modules and their uses:

module	Purpose
Retry	Automatically retry failed remote operations
RateLimiter	Limit the number of times we call remote operations within a certain period of time
TimeLimiter	Set time limit when calling remote operation
Circuit Breaker	When remote operations continue to fail, fail quickly or perform default operations
Bulkhead	Limit the number of concurrent remote operations
Cache	Store the results of expensive remote operations

Use paradigm

Although each module has its abstraction, the usual usage paradigm is as follows:

Create a Resilience4j configuration object
Create a Registry object for this type of configuration
Create or get a Resilience4j object from the registry
Coding remote operations as lambda expressions or functional interfaces or usual Java methods
Use one of the provided helper methods to create a decorator or wrapper around the code in step 4
Call the decorator method to call the remote operation
Steps 1-5 are usually completed once when the application starts. Let's look at these steps to retry the module:

RetryConfig config = RetryConfig.ofDefaults(); // ----> 1
RetryRegistry registry = RetryRegistry.of(config); // ----> 2
Retry retry = registry.retry("flightSearchService", config); // ----> 3


FlightSearchService searchService = new FlightSearchService();
SearchRequest request = new SearchRequest("NYC", "LAX", "07/21/2020");
Supplier<List<Flight>> flightSearchSupplier =
  () -> searchService.searchFlights(request); // ----> 4


Supplier<List<Flight>> retryingFlightSearch =
  Retry.decorateSupplier(retry, flightSearchSupplier); // ----> 5


System.out.println(retryingFlightSearch.get()); // ----> 6

When to use retry?

remote operation can be any request sent through the network. Usually, it is one of the following:

Send HTTP request to REST endpoint
Call remote procedure (RPC) or Web service
Read and write data from data storage (SQL/NoSQL database, object storage, etc.)
Send and receive messages to the message broker (RabbitMQ/ActiveMQ/Kafka, etc.)

When the remote operation fails, we have two options-return an error to our client immediately, or retry the operation. If the retry succeeds, this is a good thing for the customer-they don't even need to know that this is a temporary problem.

Which option you choose depends on the type of error (transient or permanent), operation (idempotent or non-idempotent), client (human or application), and use case.

temporary error is temporary. Generally, if you try again, the operation is likely to succeed . Examples are when requests are restricted by upstream services, disconnected, or timed out due to temporary unavailability of certain services.

A hardware failure or 404 (not found) response from the REST API is an example of a permanent error, retrying will not help .

If we want to apply retry, the operation must be idempotent . Suppose that the remote service received and processed our request, but there was a problem sending the response. In this case, when we try again, we don't want the service to treat the request as a new request or return an unexpected error (think bank transfer).

retrying will increase the response time of the API . If the client is another application, such as a cron job or daemon, this may not be a problem. However, if it's a person, sometimes it's better to respond, fail fast and provide feedback, rather than make this person wait while we keep trying.

For some critical use cases, reliability may be more important than response time. Even if the customer is an individual, we may need to retry . Bank transfers or transfers from travel agencies to book flights and travel hotels are good examples-users expect reliability, not instant response to such use cases. We can respond by notifying users immediately that we have accepted their request and notifying them after completion.

Retry the module using Resilience4j

RetryRegistry , RetryConfig and Retry are the main abstractions in resilience4j-retry RetryRegistry is a factory used to create and manage Retry RetryConfig encapsulates configuration such as how many retries should be tried, and how long to wait between attempts. Each Retry object is associated with a RetryConfig . Retr y provides helper methods to create decorators for functional interfaces or lambda expressions containing remote calls.

Let's see how to use the various functions available in the retry module. Suppose we are building a website for an airline to allow its customers to search and book flights. Our service communicates with the remote service encapsulated in the FlightSearchService

Simple retry

In a simple retry, if RuntimeException is thrown during the remote call, the operation is retried. We can configure the number of attempts, how long to wait between attempts, etc.:

RetryConfig config = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.of(2, SECONDS))
  .build();


// Registry, Retry creation omitted


FlightSearchService service = new FlightSearchService();
SearchRequest request = new SearchRequest("NYC", "LAX", "07/31/2020");
Supplier<List<Flight>> flightSearchSupplier =
  () -> service.searchFlights(request);


Supplier<List<Flight>> retryingFlightSearch =
  Retry.decorateSupplier(retry, flightSearchSupplier);


System.out.println(retryingFlightSearch.get());

We created a RetryConfig , specifying that we should retry at most 3 times, and wait 2 seconds between attempts. If we use the RetryConfig.ofDefaults() method instead, the default value of 3 attempts and 500 ms wait duration will be used.

We express the flight search call as a lambda expression- List<Flight> of Supplier . Retry.decorateSupplier() method uses the retry function to decorate this Supplier . Finally, we decorated the Supplier call on get() approach to remote calls.

If we want to create a decorator and reuse it in different places in the code base, we will use decorateSupplier() . If we want to create it and execute it immediately, we can use the executeSupplier() instance method instead:

List<Flight> flights = retry.executeSupplier(
  () -> service.searchFlights(request));
这是显示第一个请求失败然后第二次尝试成功的示例输出：

Searching for flights; current time = 20:51:34 975
Operation failed
Searching for flights; current time = 20:51:36 985
Flight search successful
[Flight{flightNumber='XY 765', flightDate='07/31/2020', from='NYC', to='LAX'}, ...]

Retry on checked exception

Now, suppose we want to retry checked and unchecked exceptions. Suppose we are calling
FlightSearchService.searchFlightsThrowingException() , it can throw a checked Exception . Since Supplier cannot throw a checked exception, we will get a compiler error at this line:

Supplier<List<Flight>> flightSearchSupplier =
  () -> service.searchFlightsThrowingException(request);

We might try to handle the Exception in the lambda expression and return Collections.emptyList() , but this does not look good. More importantly, since we captured Exception , retrying no longer works:

ExceptionSupplier<List<Flight>> flightSearchSupplier = () -> {
    try {      
      return service.searchFlightsThrowingException(request);
    } catch (Exception e) {
      // don't do this, this breaks the retry!
    }
    return Collections.emptyList();
  };

So when we want to retry all the exceptions that may be thrown by the remote call, what should we do? We can use
Retry.decorateCheckedSupplier() (or executeCheckedSupplier() example method) instead of Retry.decorateSupplier() :

CheckedFunction0<List<Flight>> retryingFlightSearch =
  Retry.decorateCheckedSupplier(retry,
    () -> service.searchFlightsThrowingException(request));


try {
  System.out.println(retryingFlightSearch.apply());
} catch (...) {
  // handle exception that can occur after retries are exhausted
}

Retry.decorateCheckedSupplier() returns a CheckedFunction0 , which represents a function with no parameters. Please note the apply() CheckedFunction0 object to invoke the remote operation.

If we don't want to use Suppliers , Retry provides more auxiliary decorator methods, such as decorateFunction() , decorateCheckedFunction() , decorateRunnable() , decorateCallable() etc., to use with other language structures. The difference between the decorate* and decorateChecked* decorate* version RuntimeExceptions on 0619c604c3f26a, while the decorateChecked* version Exception on 0619c604c3f27d.

Conditionally retry

Retry simple example above shows how to encounter when calling a remote service RuntimeException or check Exception retries. In practical applications, we may not want to retry all exceptions. For example, if we get a
AuthenticationFailedException same request will not help. When we make an HTTP call, we may want to check the HTTP response status code or look for a specific application error code in the response to decide whether we should try again. Let us see how to achieve this conditional retry.

Predicate-based condition retry

Assume that the airline's flight service regularly initializes flight data in its database. For flight data on a given date, this internal operation takes a few seconds. If we call the day's flight search during the initialization process, the service will return a specific error code FS-167. The flight search documentation says that this is a temporary error and you can retry the operation in a few seconds.

Let's see how to create RetryConfig :

RetryConfig config = RetryConfig.<SearchResponse>custom()
  .maxAttempts(3)
  .waitDuration(Duration.of(3, SECONDS))
  .retryOnResult(searchResponse -> searchResponse
    .getErrorCode()
    .equals("FS-167"))
  .build();

We use the retryOnResult() method and pass the Predicate that performs this check. The Predicate can be as complex as we want-it can be a check of a set of error codes, or some custom logic to decide whether the search should be retried.

Exception-based condition retry

Suppose we have a general exception
FlightServiceBaseException , this exception will be thrown when any accident occurs during the interaction with the airline's flight service. As a general strategy, we want to retry when this exception is thrown. But we don't want to retry SeatsUnavailableException -if there are no seats available on the flight, retrying will not help. We can do this by RetryConfig

RetryConfig config = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.of(3, SECONDS))
  .retryExceptions(FlightServiceBaseException.class)
  .ignoreExceptions(SeatsUnavailableException.class)
  .build();

In retryExceptions() , we specified an exception list. ignoreExceptions() will retry any exceptions that match or inherit from exceptions in this list. We put the ones we want to ignore instead of retrying in ignoreExceptions() . If the code throws some other exceptions at runtime, such as IOException , it will not be retried.

Suppose that even for a given exception, we don't want to retry in all cases. Maybe we only want to retry when the exception has a specific error code or specific text in the exception message. In this case, we can use the retryOnException method:

Predicate<Throwable> rateLimitPredicate = rle ->
  (rle instanceof  RateLimitExceededException) &&
  "RL-101".equals(((RateLimitExceededException) rle).getErrorCode());


RetryConfig config = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.of(1, SECONDS))
  .retryOnException(rateLimitPredicate)
  build();

As with predicate-based (predicate-based) conditional retries, the check within the predicate can be complicated as needed.

Back-off strategy

So far, our example has a fixed retry wait time. Usually we want to increase the waiting time after each attempt-this is to allow the remote service to have enough time to recover under the current overload situation. We can use IntervalFunction to do this.

IntervalFunction is a functional interface-it is a 0619c604c3f479 that takes the number of attempts as a parameter and returns the waiting time in Function .

Random interval

Here we specify a random waiting time between attempts:

RetryConfig config = RetryConfig.custom()
.maxAttempts(4)
.intervalFunction(IntervalFunction.ofRandomized(2000))
.build();

IntervalFunction.ofRandomized() has an associated randomizationFactor . We can set it as the second parameter ofRandomized() If it is not set, the default value of 0.5 is used. This randomizationFactor determines the distribution range of the random value. Therefore, for the above default value of 0.5, the generated waiting time will be between 1000 milliseconds (2000-2000 0.5) and 3000 milliseconds (2000 + 2000 0.5).

Sample output of this behavior is as follows:

Searching for flights; current time = 20:27:08 729
Operation failed
Searching for flights; current time = 20:27:10 643
Operation failed
Searching for flights; current time = 20:27:13 204
Operation failed
Searching for flights; current time = 20:27:15 236
Flight search successful
[Flight{flightNumber='XY 765', flightDate='07/31/2020', from='NYC', to='LAX'},...]

Exponential interval

For exponential backoff, we specify two values-initial waiting time and multiplier. In this method, the waiting time increases exponentially between attempts due to the multiplier. For example, if we specify that the initial waiting time is 1 second and the multiplier is 2, the retry will be after 1 second, 2 seconds, 4 seconds, 8 seconds, 16 seconds, etc. This method is the recommended method when the client is a background job or daemon process.

Here is how we created RetryConfig for exponential backoff:

RetryConfig config = RetryConfig.custom()
.maxAttempts(6)
.intervalFunction(IntervalFunction.ofExponentialBackoff(1000, 2))
.build();

Sample output of this behavior is as follows:

Searching for flights; current 
time = 20:37:02 684

Operation failed

Searching for flights; current time = 20:37:03 727

Operation failed

Searching for flights; current time = 20:37:05 731

Operation failed

Searching for flights; current time = 20:37:09 731

Operation failed

Searching for flights; current time = 20:37:17 731

IntervalFunction also provides a exponentialRandomBackoff() method, which combines the above two methods. We can also provide a custom implementation of IntervalFunction.

Retry asynchronous operation

The examples we have seen so far are all synchronous calls. Let's see how to retry an asynchronous operation. Suppose we search for flights asynchronously like this:

CompletableFuture.supplyAsync(() -> service.searchFlights(request))
  .thenAccept(System.out::println);

searchFlight( ) The call occurs on a different thread, and when it returns, the returned List<Flight> is passed to thenAccept() , and it just prints it.

We can use executeCompletionStage() Retry object to retry the above asynchronous operation. This method takes two parameters-one ScheduledExecutorService will be scheduled on which to retry, and one Supplier<CompletionStage> will be decorated. It decorates and executes CompletionStage , and then returns a CompletionStage , we can call thenAccept as before:

ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();


Supplier<CompletionStage<List<Flight>>> completionStageSupplier =
  () -> CompletableFuture.supplyAsync(() -> service.searchFlights(request));


retry.executeCompletionStage(scheduler, completionStageSupplier)
.thenAccept(System.out::println);

In the actual application, we will use the shared thread pool (
Executors.newScheduledThreadPool() ) to schedule retries instead of the single-threaded scheduling executor shown here.

Retry event

In all these examples, the decorator is a black box-we don't know when the attempt failed, and the framework code is trying to retry. Suppose that for a given request, we want to log some details, such as the attempt count or the waiting time before the next attempt. We can do this by using retry events posted at different execution points. Retry has a EventPublisher , which has onRetry() , onSuccess() and other methods.

We can collect and record detailed information by implementing these listener methods:

Retry.EventPublisher publisher = retry.getEventPublisher();

publisher.onRetry(event -> System.out.println(event.toString()));

publisher.onSuccess(event -> System.out.println(event.toString()));

Similarly, RetryRegistry also has a EventPublisher , which Retry object is added or deleted from the registry.

Retry indicator

Retry maintain counters to track the number of operations

The first attempt was successful
Success after retry
Failed without retries
Failed after retrying

Each time the decorator is executed, it will update these counters.

Why capture metrics?

captures and regularly analyzes the metrics to give us insight into the behavior of upstream services. It can also help identify bottlenecks and other potential problems .

For example, if we find that an operation usually fails on the first attempt, we can investigate the cause. If we find that our request is restricted or timed out when establishing a connection, it may indicate that the remote service requires additional resources or capacity.

How to capture metrics?

Resilience4j uses Micrometer to publish indicators. Micrometer provides the appearance of a meter client for monitoring systems (such as Prometheus, Azure Monitor, New Relic, etc.). So we can publish indicators to any of these systems or switch between them without changing our code.

First, we create RetryConfig and RetryRegistry and Retry as usual. Then, we create a MeterRegistry and etryRegistry to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();

TaggedRetryMetrics.ofRetryRegistry(retryRegistry).bindTo(meterRegistry);

After running a few retryable operations, we display the captured metrics:

Consumer<Meter> meterConsumer = meter -> {
  String desc = meter.getId().getDescription();
  String metricName = meter.getId().getTag("kind");
  Double metricValue = StreamSupport.stream(meter.measure().spliterator(), false)
    .filter(m -> m.getStatistic().name().equals("COUNT"))
    .findFirst()
    .map(m -> m.getValue())
    .orElse(0.0);
  System.out.println(desc + " - " + metricName + ": " + metricValue);
};
meterRegistry.forEachMeter(meterConsumer);

Some sample output is as follows:

The number of successful calls without a retry attempt - successful_without_retry: 4.0

The number of failed calls without a retry attempt - failed_without_retry: 0.0

The number of failed calls after a retry attempt - failed_with_retry: 0.0

The number of successful calls after a retry attempt - successful_with_retry: 6.0

Of course, in actual applications, we will export the data to the monitoring system and view it on the dashboard.

Precautions and good practices when retrying

service usually provides a client library or SDK with a built-in retry mechanism. This is especially true for cloud services. For example, Azure CosmosDB and Azure Service Bus provide built-in retry tools for client libraries. They allow applications to set retry policies to control retry behavior.

In this case, it is better to use the built-in retry instead of our own coding. If we really need to write our own, we should disable the built-in default retry strategy-otherwise, it may cause nested retries, where each attempt by the application will result in multiple attempts by the client library .

Some cloud services record transient error codes. For example, Azure SQL provides a list of error codes that it expects database clients to retry. Before deciding to add retries for a particular operation, it is best to check whether the service provider has such a list.

Another good practice is that maintains the values we use in RetryConfig (such as the maximum number of attempts, waiting time, and error codes and exceptions) as a configuration outside of our service 1619c604c3f91f. If we discover a new transient error or we need to adjust the interval between attempts, we can make changes without building and redeploying the service.

Usually when retrying, Thread.sleep() may occur somewhere in the framework code. This is the case for synchronous retries with waiting time between retries. If our code runs in the context of a Web application, Thread is most likely the request processing thread of the Web server. Therefore, if we make too many retries, it will reduce the throughput of the application .

in conclusion

In this article, we learned what Resilience4j is and how to use its retry module to make our application resilient to temporary errors. We looked at different ways to configure retries, and some examples of deciding between different methods. We learned some good practices to follow when implementing retries, and the importance of collecting and analyzing retry indicators.

You can use GitHub on code to try a complete application to demonstrate these ideas.

This article is translated from: Implementing Retry with Resilience4j-Reflectoring

Use Resilience4j framework to implement retry mechanism

Code example

What is Resilience4j?

Resilience4j module

Use paradigm

When to use retry?

Retry the module using Resilience4j

Simple retry

Retry on checked exception

Conditionally retry

Predicate-based condition retry

Exception-based condition retry

Back-off strategy

Random interval

Exponential interval

Retry asynchronous operation

Retry event

Retry indicator

Why capture metrics?

How to capture metrics?

Precautions and good practices when retrying

in conclusion

信码由缰

引用和评论

评估您的数据是否可用于人工智能的三个考虑因素

在Java程序中监听mysql的binlog

Jerry和您聊聊Chrome开发者工具

Bitmap 和布隆过滤器傻傻分不清？你这不应该啊

Just for fun——迅速写完快速排序

Spring 实现 3 种异步流式接口，干掉接口超时烦恼

💢线上高延迟请求排查

Use Resilience4j framework to implement retry mechanism

Code example

What is Resilience4j?

Resilience4j module

Use paradigm

When to use retry?

Retry the module using Resilience4j

Simple retry

Retry on checked exception

Conditionally retry

Predicate-based condition retry

Exception-based condition retry

Back-off strategy

Random interval

Exponential interval

Retry asynchronous operation

Retry event

Retry indicator

Why capture metrics?

How to capture metrics?

Precautions and good practices when retrying

in conclusion

信码由缰

引用和评论

评估您的数据是否可用于人工智能的三个考虑因素

在Java程序中监听mysql的binlog

Jerry和您聊聊Chrome开发者工具

Bitmap 和 布隆过滤器傻傻分不清？你这不应该啊

Just for fun——迅速写完快速排序

Spring 实现 3 种异步流式接口，干掉接口超时烦恼

💢线上高延迟请求排查

Bitmap 和布隆过滤器傻傻分不清？你这不应该啊