Java 项目中使用 Resilience4j 框架实现故障隔离

So far in this series, we have learned about Resilience4j and its [Retry](
https://icodewalker.com/blog/261/), [RateLimiter](
https://icodewalker.com/blog/288/) and [TimeLimiter](
https://icodewalker.com/blog/302/) module. In this article, we will explore the Bulkhead module. We will understand what problems it solves, when and how to use it, and look at some examples.

Code example

This article is attached [GitHub on](
https://github.com/thombergs/code-examples/tree/master/resilience4j/bulkhead) working code example.

What is Resilience4j?

Please refer to the description in the previous article for a quick understanding of [General Working Principle of Resilience4j]
(https://icodewalker.com/blog/261/#what-is-resilience4j)。

What is fault isolation?

A few years ago, we ran into a production issue where one of the servers stopped responding to the health check, and the load balancer took the server out of the pool.

Just as we started investigating this issue, there was a second alert—another server had stopped responding to health checks and was also taken out of the pool.

After a few minutes, each server stopped responding to health detection, and our service was completely shut down.

We use Redis to cache some data for several functions supported by the application. As we later discovered, the Redis cluster had some problems at the same time, and it has stopped accepting new connections. We use the Jedis library to connect to Redis. The default behavior of the library is to block the calling thread indefinitely until a connection is established.

Our service is hosted on Tomcat, and its default request processing thread pool size is 200 threads. Therefore, every request through the code path connected to Redis will eventually block the thread indefinitely.

Within a few minutes, all 2000 threads in the cluster were blocked indefinitely-there were even no idle threads to respond to the load balancer's health check.

The service itself supports multiple functions, and not all functions require access to the Redis cache. But when there is a problem in this area, it eventually affects the entire service.

This is exactly the problem to be solved by fault isolation-it can prevent problems in a certain service area from affecting the entire service.

Although what happened to our service is an extreme example, we can see how slow upstream dependencies affect unrelated areas where the service is called.

If we set a limit of 20 concurrent requests for Redis on each server instance, then when Redis connection problems occur, only these threads will be affected. The remaining request processing threads can continue to provide services for other requests.

fault isolation is to set a limit on the number of concurrent calls we make to remote services. We treat calls to different remote services as different, isolated pools, and set limits on the number of calls that can be made at the same time.

The term bulkhead itself comes from its use in ships, where the bottom of the ship is divided into separate parts. If there are cracks and water starts to flow in, only that part will be filled with water. This prevents the entire ship from sinking.

Resilience4j partition concept

resilience4j-bulkhead is similar to other Resilience4j modules. We provide it with the code we want to execute as a function construct-a lambda expression for remote invocation or a Supplier with a value retrieved from a remote service, etc.-and the partition decorates it with code To control the number of concurrent calls.

Resilience4j provides two types of partitions- SemaphoreBulkhead and ThreadPoolBulkhead .

SemaphoreBulkhead internal use
java.util.concurrent.Semaphore to control the number of concurrent calls and execute our code on the current thread.

ThreadPoolBulkhead uses a thread in the thread pool to execute our code. It is used internally
java.util.concurrent.ArrayBlockingQueue and
java.util.concurrent.ThreadPoolExecutor to control the number of concurrent calls.

SemaphoreBulkhead

Let us look at the configuration and meaning related to the semaphore barrier.

maxConcurrentCalls determines the maximum number of concurrent calls we can make to the remote service. We can think of this value as the permitted number of initialized semaphores.

Any thread that attempts to call a remote service beyond this limit can immediately obtain BulkheadFullException or wait for a period of time to wait for another thread to release the permit. This is determined by the value of maxWaitDuration.

When there are multiple threads waiting for permission, the fairCallHandlingEnabled configuration determines whether the waiting threads obtain permission in a first-in, first-out order.

Finally, the writableStackTraceEnabled configuration allows us to reduce the amount of information in the stack trace when BulkheadFullException This is useful because without it, our logs may be full of similar information when an exception occurs multiple times. Usually when reading the log, it is enough to BulkheadFullException

ThreadPoolBulkhead

coreThreadPoolSize , maxThreadPoolSize , keepAliveDuration and queueCapacity are the main configurations related ThreadPoolBulkhead ThreadPoolBulkhead internally uses these configurations to construct a ThreadPoolExecutor .

internal ThreadPoolExecutor uses one of the available idle threads to execute the incoming task. If no thread is free to execute the incoming task, the task will be queued for execution later when the thread becomes available. queueCapacity has been reached, the remote call will be rejected and BulkheadFullException will be returned.

ThreadPoolBulkhead also has writableStackTraceEnabled configuration to control the amount of information in the stack trace of BulkheadFullException

Use Resilience4j bulkhead module

Let's see how to use the various functions available in the resilience4j-bulkhead

We will use the same examples as in the previous articles in this series. Suppose we are building a website for an airline to allow its customers to search and book flights. Our service talks with the remote service encapsulated in the FlightSearchService

SemaphoreBulkhead

When using semaphore-based partitions, BulkheadRegistry , BulkheadConfig and Bulkhead are the main abstractions we use.

BulkheadRegistry is a factory for creating and managing Bulkhead objects.

BulkheadConfig encapsulates the maxConcurrentCalls , maxWaitDuration , writableStackTraceEnabled and fairCallHandlingEnabled configurations. Each Bulkhead object is associated with a BulkheadConfig .

The first step is to create a BulkheadConfig :

BulkheadConfig config = BulkheadConfig.ofDefaults();

This will create a BulkheadConfig with default values maxConcurrentCalls (25), maxWaitDuration (0s), writableStackTraceEnabled (true) and fairCallHandlingEnabled (true).

Suppose we want to limit the number of concurrent calls to 2, and we are willing to wait 2 seconds for the thread to get permission:

BulkheadConfig config = BulkheadConfig.custom()
  .maxConcurrentCalls(2)
  .maxWaitDuration(Duration.ofSeconds(2))
  .build();

Then we create a Bulkhead :

BulkheadRegistry registry = BulkheadRegistry.of(config);

Bulkhead bulkhead = registry.bulkhead("flightSearchService");

Now let's express our code to run the flight search Supplier and decorate it bulkhead

BulkheadRegistry registry = BulkheadRegistry.of(config);
Bulkhead bulkhead = registry.bulkhead("flightSearchService");

Finally, let us call a few decoration operations to understand the working principle of the partition. We can use CompletableFuture to simulate concurrent flight search requests from users:

for (int i=0; i<4; i++) {
  CompletableFuture
    .supplyAsync(decoratedFlightsSupplier)
    .thenAccept(flights -> System.out.println("Received results"));
}

The timestamp and thread name in the output show that among the 4 concurrent requests, the first two requests passed immediately:

Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-5
Flight search successful at 11:42:13 226
Flight search successful at 11:42:13 226
Received results
Received results
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-9
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-7
Flight search successful at 11:42:14 239
Flight search successful at 11:42:14 239
Received results
Received results

The third and fourth requests were granted permission only after 1 second, after the previous request was completed.

If the thread cannot get a license within maxWaitDuration BulkheadFullException :

Caused by: io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
    at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:49)
    at io.github.resilience4j.bulkhead.internal.SemaphoreBulkhead.acquirePermission(SemaphoreBulkhead.java:164)
    at io.github.resilience4j.bulkhead.Bulkhead.lambda$decorateSupplier$5(Bulkhead.java:194)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    ... 6 more

Except for the first line, the other lines in the stack trace did not add much value. If 061a852db6368e occurs multiple times, these stack trace BulkheadFullException

We can reduce the amount of information generated in the stack trace by setting the writableStackTraceEnabled configuration to false

BulkheadConfig config = BulkheadConfig.custom()
    .maxConcurrentCalls(2)
    .maxWaitDuration(Duration.ofSeconds(1))
    .writableStackTraceEnabled(false)
.build();

Now, when BulkheadFullException occurs, there is only one line in the stack trace:

Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results

Similar to other Resilience4j modules we have seen, Bulkhead also provides additional methods, such as decorateCheckedSupplier() , decorateCompletionStage() , decorateRunnable() , decorateConsumer() etc., so we can provide our code in structures other than Supplier

ThreadPoolBulkhead

When using a partition based on a thread pool,
ThreadPoolBulkheadRegistry , ThreadPoolBulkheadConfig and ThreadPoolBulkhead are the main abstractions we use.

ThreadPoolBulkheadRegistry is a factory used to create and manage ThreadPoolBulkhead

ThreadPoolBulkheadConfig encapsulates the coreThreadPoolSize , maxThreadPoolSize , keepAliveDuration and queueCapacity configurations. Each ThreadPoolBulkhead object is associated with a ThreadPoolBulkheadConfig.

The first step is to create a ThreadPoolBulkheadConfig :

ThreadPoolBulkheadConfig config =
  ThreadPoolBulkheadConfig.ofDefaults();

This will create a ThreadPoolBulkheadConfig with default values coreThreadPoolSize (number of available processors-1), maxThreadPoolSiz e (maximum number of available processors), keepAliveDuration (20ms) and queueCapacity (100).

Suppose we want to limit the number of concurrent calls to 2:

ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
  .maxThreadPoolSize(2)
  .coreThreadPoolSize(1)
  .queueCapacity(1)
  .build();

Then we create a ThreadPoolBulkhead :

ThreadPoolBulkheadRegistry registry = ThreadPoolBulkheadRegistry.of(config);
ThreadPoolBulkhead bulkhead = registry.bulkhead("flightSearchService");

Now let us express our code to run the flight search Supplier and decorate it bulkhead

Supplier<List<Flight>> flightsSupplier =
  () -> service.searchFlightsTakingOneSecond(request);
Supplier<CompletionStage<List<Flight>>> decoratedFlightsSupplier =
  ThreadPoolBulkhead.decorateSupplier(bulkhead, flightsSupplier);

And return a Supplier<List<Flight>>
SemaphoreBulkhead.decorateSupplier() different,
ThreadPoolBulkhead.decorateSupplier() returns a Supplier<CompletionStage<List<Flight>> . This is because ThreadPoolBulkHead will not execute code synchronously on the current thread.

Finally, let us call a few decoration operations to understand the working principle of the partition:

for (int i=0; i<3; i++) {
  decoratedFlightsSupplier
    .get()
    .whenComplete((r,t) -> {
      if (r != null) {
        System.out.println("Received results");
      }
      if (t != null) {
        t.printStackTrace();
      }
    });
}

The timestamp and thread name in the output show that although the first two requests were executed immediately, the third request was queued and later executed by one of the released threads:

Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-1
Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:00 136
Flight search successful at 16:15:00 135
Received results
Received results
Searching for flights; current time = 16:15:01 151; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:01 151
Received results

If there is no idle thread and capacity in the queue, then BulkheadFullException is thrown:

Exception in thread "main" io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
 at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:64)
 at io.github.resilience4j.bulkhead.internal.FixedThreadPoolBulkhead.submit(FixedThreadPoolBulkhead.java:157)
... other lines omitted ...

We can use the writableStackTraceEnabled configuration to reduce the amount of information generated in the stack trace:

ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
  .maxThreadPoolSize(2)
  .coreThreadPoolSize(1)
  .queueCapacity(1)
  .writableStackTraceEnabled(false)
  .build();

Now, when BulkheadFullException occurs, there is only one line in the stack trace:

Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results

Context propagation

Sometimes we store data in ThreadLocal variables and read it in different areas of the code. We do this to avoid explicitly passing data as parameters between method chains, especially when the value is not directly related to the core business logic we are implementing.

For example, we may want to record the current user ID or transaction ID or a certain request tracking ID in each log statement to make it easier to search the log. For such scenarios, using ThreadLocal is a useful technique.

When using ThreadPoolBulkhead , since our code is not executed on the current thread, the data we store in the ThreadLocal variable will not be available in other threads.

Let us look at an example to understand this problem. First we define a RequestTrackingIdHolder class, a wrapper class ThreadLocal

class RequestTrackingIdHolder {
  static ThreadLocal<String> threadLocal = new ThreadLocal<>();


  static String getRequestTrackingId() {
    return threadLocal.get();
  }


  static void setRequestTrackingId(String id) {
    if (threadLocal.get() != null) {
      threadLocal.set(null);
      threadLocal.remove();
    }
    threadLocal.set(id);
  }


  static void clear() {
    threadLocal.set(null);
    threadLocal.remove();
  }
}

The static method can easily set and get the value ThreadLocal We next set up a request tracking ID before calling the flight search operation of the clapboard decoration:

for (int i=0; i<2; i++) {
  String trackingId = UUID.randomUUID().toString();
  System.out.println("Setting trackingId " + trackingId + " on parent, main thread before calling flight search");
  RequestTrackingIdHolder.setRequestTrackingId(trackingId);
  decoratedFlightsSupplier
    .get()
    .whenComplete((r,t) -> {
        // other lines omitted
    });
}

The sample output shows that this value is not available in the thread managed by the partition:

Setting trackingId 98ff99df-466a-47f7-88f7-5e31fc8fcb6b on parent, main thread before calling flight search
Setting trackingId 6b98d73c-a590-4a20-b19d-c85fea783caf on parent, main thread before calling flight search
Searching for flights; current time = 19:53:53 799; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:53 824
Received results
Searching for flights; current time = 19:53:54 836; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:54 836
Received results

To solve this problem, ThreadPoolBulkhead provides a ContextPropagator . ContextPropagator is an abstraction used to retrieve, copy, and clean up values across thread boundaries. It defines an interface that contains retrieve() get values from the current thread (061a852db63a60 ), copy them to a new thread of execution ( copy() ), and finally clean up on the clear()

Let's implement a
RequestTrackingIdPropagator

class RequestTrackingIdPropagator implements ContextPropagator {
  @Override
  public Supplier<Optional> retrieve() {
    System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
    return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
  }


  @Override
  Consumer<Optional> copy() {
    return optional -> {
      System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
    };
  }


  @Override
  Consumer<Optional> clear() {
    return optional -> {
      System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.clear());
    };
  }
}

We provide ContextPropagator for ThreadPoolBulkhead through the ThreadPoolBulkheadConfig on 061a852db63ab7:

class RequestTrackingIdPropagator implements ContextPropagator {
  @Override
  public Supplier<Optional> retrieve() {
    System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
    return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
  }


  @Override
  Consumer<Optional> copy() {
    return optional -> {
      System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
    };
  }


  @Override
  Consumer<Optional> clear() {
    return optional -> {
      System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
      optional.ifPresent(s -> RequestTrackingIdHolder.clear());
    };
  }
}

Now, the sample output shows that the request tracking ID is available in the thread managed by the partition:

Setting trackingId 71d44cb8-dab6-4222-8945-e7fd023528ba on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting trackingId 5f9dd084-f2cb-4a20-804b-038828abc161 on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting request tracking id 71d44cb8-dab6-4222-8945-e7fd023528ba on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:56 508; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 71d44cb8-dab6-4222-8945-e7fd023528ba
Flight search successful at 20:07:56 538
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results
Setting request tracking id 5f9dd084-f2cb-4a20-804b-038828abc161 on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:57 542; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 5f9dd084-f2cb-4a20-804b-038828abc161
Flight search successful at 20:07:57 542
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results

Bulkhead event

Bulkhead and T hreadPoolBulkhead both have a EventPublisher to generate the following types of events:

  • BulkheadOnCallPermittedEvent
  • BulkheadOnCallRejectedEvent and
  • BulkheadOnCallFinishedEvent

We can listen to these events and record them, for example:

Bulkhead bulkhead = registry.bulkhead("flightSearchService");
bulkhead.getEventPublisher().onCallPermitted(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallFinished(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallRejected(e -> System.out.println(e.toString()));

The sample output shows what was logged:

2020-08-26T12:27:39.790435: Bulkhead 'flightSearch' permitted a call.
... other lines omitted ...
2020-08-26T12:27:40.290987: Bulkhead 'flightSearch' rejected a call.
... other lines omitted ...
2020-08-26T12:27:41.094866: Bulkhead 'flightSearch' has finished a call.

Bulkhead indicator

SemaphoreBulkhead

Bulkhead exposed two indicators:

  • The maximum number of available permissions ( resilience4j.bulkhead.max.allowed.concurrent.calls ), and
  • The number of concurrent calls allowed ( resilience4j.bulkhead.available.concurrent.calls ).

bulkhead.available indicators and we BulkheadConfig configured on maxConcurrentCalls same.

First, we create BulkheadConfig , BulkheadRegistry and Bulkhead as before. Then, we create a MeterRegistry and BulkheadRegistry to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedBulkheadMetrics.ofBulkheadRegistry(registry)
  .bindTo(meterRegistry);

After running several partition decoration operations, we display the captured metrics:

Consumer<Meter> meterConsumer = meter -> {
  String desc = meter.getId().getDescription();
  String metricName = meter.getId().getName();
  Double metricValue = StreamSupport.stream(meter.measure().spliterator(), false)
    .filter(m -> m.getStatistic().name().equals("VALUE"))
    .findFirst()
    .map(m -> m.getValue())
    .orElse(0.0);
  System.out.println(desc + " - " + metricName + ": " + metricValue);};meterRegistry.forEachMeter(meterConsumer);

This is some sample output:

The maximum number of available permissions - resilience4j.bulkhead.max.allowed.concurrent.calls: 8.0
The number of available permissions - resilience4j.bulkhead.available.concurrent.calls: 3.0

ThreadPoolBulkhead

ThreadPoolBulkhead exposes five indicators:

  • The current length of the queue ( resilience4j.bulkhead.queue.depth ),
  • The size of the current thread pool ( resilience4j.bulkhead.thread.pool.size ),
  • The core and maximum capacity of the thread pool ( resilience4j.bulkhead.core.thread.pool.size and resilience4j.bulkhead.max.thread.pool.size ), and
  • The capacity of the queue ( resilience4j.bulkhead.queue.capacity ).

First, we create ThreadPoolBulkheadConfig as before,
ThreadPoolBulkheadRegistry and ThreadPoolBulkhead . Then, we create a MeterRegistry and
ThreadPoolBulkheadRegistry bound to it:

MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedThreadPoolBulkheadMetrics.ofThreadPoolBulkheadRegistry(registry).bindTo(meterRegistry);

After running several partition decoration operations, we will display the captured metrics:

The queue capacity - resilience4j.bulkhead.queue.capacity: 5.0
The queue depth - resilience4j.bulkhead.queue.depth: 1.0
The thread pool size - resilience4j.bulkhead.thread.pool.size: 5.0
The maximum thread pool size - resilience4j.bulkhead.max.thread.pool.size: 5.0
The core thread pool size - resilience4j.bulkhead.core.thread.pool.size: 3.0

In practical applications, we will regularly export the data to the monitoring system and analyze it on the dashboard.

Pitfalls and good practices when implementing partitions

Make the partition a singleton

All calls to a given remote service should go through the same Bulkhead instance. For a given remote service, Bulkhead must be a singleton.

If we do not enforce this operation, some areas of our code base may bypass Bulkhead and call remote services directly. In order to prevent this, the actual invocation of remote services should use the partition decorator exposed in the inner layer in a core, inner layer and other areas.

How do we ensure that future new developers understand this intent? Check out Tom's article, which shows a way to solve this problem, that is, organizes the package structure to clarify such intentions . In addition, it shows how to enforce this by coding intents in ArchUnit tests.

Combine with other Resilience4j modules

It is more effective to combine the partition with one or more other Resilience4j modules (such as retry and rate limiter). For example, if there is a BulkheadFullException, we may want to try again after some delay.

in conclusion

In this article, we learned how to use Resilience4j's Bulkhead module to set limits on our concurrent calls to remote services. We understand why this is important, and we saw some practical examples on how to configure it.

You can use [GitHub on](
https://github.com/thombergs/code-examples/tree/master/resilience4j/bulkhead) demonstrates a complete application.


This article is translated from: Implementing Bulkhead with Resilience4j-Reflectoring


信码由缰
65 声望8 粉丝

“码”界老兵,分享程序人生。