So far in this series, we have learned about Resilience4j and its [Retry](
https://icodewalker.com/blog/261/), [RateLimiter](
https://icodewalker.com/blog/288/) and [TimeLimiter](
https://icodewalker.com/blog/302/) module. In this article, we will explore the Bulkhead module. We will understand what problems it solves, when and how to use it, and look at some examples.
Code example
This article is attached [GitHub on](
https://github.com/thombergs/code-examples/tree/master/resilience4j/bulkhead) working code example.
What is Resilience4j?
Please refer to the description in the previous article for a quick understanding of [General Working Principle of Resilience4j]
(https://icodewalker.com/blog/261/#what-is-resilience4j)。
What is fault isolation?
A few years ago, we ran into a production issue where one of the servers stopped responding to the health check, and the load balancer took the server out of the pool.
Just as we started investigating this issue, there was a second alert—another server had stopped responding to health checks and was also taken out of the pool.
After a few minutes, each server stopped responding to health detection, and our service was completely shut down.
We use Redis to cache some data for several functions supported by the application. As we later discovered, the Redis cluster had some problems at the same time, and it has stopped accepting new connections. We use the Jedis library to connect to Redis. The default behavior of the library is to block the calling thread indefinitely until a connection is established.
Our service is hosted on Tomcat, and its default request processing thread pool size is 200 threads. Therefore, every request through the code path connected to Redis will eventually block the thread indefinitely.
Within a few minutes, all 2000 threads in the cluster were blocked indefinitely-there were even no idle threads to respond to the load balancer's health check.
The service itself supports multiple functions, and not all functions require access to the Redis cache. But when there is a problem in this area, it eventually affects the entire service.
This is exactly the problem to be solved by fault isolation-it can prevent problems in a certain service area from affecting the entire service.
Although what happened to our service is an extreme example, we can see how slow upstream dependencies affect unrelated areas where the service is called.
If we set a limit of 20 concurrent requests for Redis on each server instance, then when Redis connection problems occur, only these threads will be affected. The remaining request processing threads can continue to provide services for other requests.
fault isolation is to set a limit on the number of concurrent calls we make to remote services. We treat calls to different remote services as different, isolated pools, and set limits on the number of calls that can be made at the same time.
The term bulkhead itself comes from its use in ships, where the bottom of the ship is divided into separate parts. If there are cracks and water starts to flow in, only that part will be filled with water. This prevents the entire ship from sinking.
Resilience4j partition concept
resilience4j-bulkhead is similar to other Resilience4j modules. We provide it with the code we want to execute as a function construct-a lambda expression for remote invocation or a Supplier with a value retrieved from a remote service, etc.-and the partition decorates it with code To control the number of concurrent calls.
Resilience4j provides two types of partitions- SemaphoreBulkhead
and ThreadPoolBulkhead
.
SemaphoreBulkhead
internal usejava.util.concurrent.Semaphore
to control the number of concurrent calls and execute our code on the current thread.
ThreadPoolBulkhead
uses a thread in the thread pool to execute our code. It is used internallyjava.util.concurrent.ArrayBlockingQueue
andjava.util.concurrent.ThreadPoolExecutor
to control the number of concurrent calls.
SemaphoreBulkhead
Let us look at the configuration and meaning related to the semaphore barrier.
maxConcurrentCalls
determines the maximum number of concurrent calls we can make to the remote service. We can think of this value as the permitted number of initialized semaphores.
Any thread that attempts to call a remote service beyond this limit can immediately obtain BulkheadFullException
or wait for a period of time to wait for another thread to release the permit. This is determined by the value of maxWaitDuration.
When there are multiple threads waiting for permission, the fairCallHandlingEnabled
configuration determines whether the waiting threads obtain permission in a first-in, first-out order.
Finally, the writableStackTraceEnabled
configuration allows us to reduce the amount of information in the stack trace when BulkheadFullException
This is useful because without it, our logs may be full of similar information when an exception occurs multiple times. Usually when reading the log, it is enough to BulkheadFullException
ThreadPoolBulkhead
coreThreadPoolSize
, maxThreadPoolSize
, keepAliveDuration
and queueCapacity
are the main configurations related ThreadPoolBulkhead
ThreadPoolBulkhead
internally uses these configurations to construct a ThreadPoolExecutor
.
internal ThreadPoolExecutor
uses one of the available idle threads to execute the incoming task. If no thread is free to execute the incoming task, the task will be queued for execution later when the thread becomes available. queueCapacity
has been reached, the remote call will be rejected and BulkheadFullException
will be returned.
ThreadPoolBulkhead
also has writableStackTraceEnabled
configuration to control the amount of information in the stack trace of BulkheadFullException
Use Resilience4j bulkhead module
Let's see how to use the various functions available in the resilience4j-bulkhead
We will use the same examples as in the previous articles in this series. Suppose we are building a website for an airline to allow its customers to search and book flights. Our service talks with the remote service encapsulated in the FlightSearchService
SemaphoreBulkhead
When using semaphore-based partitions, BulkheadRegistry
, BulkheadConfig
and Bulkhead
are the main abstractions we use.
BulkheadRegistry
is a factory for creating and managing Bulkhead objects.
BulkheadConfig
encapsulates the maxConcurrentCalls
, maxWaitDuration
, writableStackTraceEnabled
and fairCallHandlingEnabled
configurations. Each Bulkhead
object is associated with a BulkheadConfig
.
The first step is to create a BulkheadConfig
:
BulkheadConfig config = BulkheadConfig.ofDefaults();
This will create a BulkheadConfig
with default values maxConcurrentCalls
(25), maxWaitDuration
(0s), writableStackTraceEnabled
(true) and fairCallHandlingEnabled
(true).
Suppose we want to limit the number of concurrent calls to 2, and we are willing to wait 2 seconds for the thread to get permission:
BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(2)
.maxWaitDuration(Duration.ofSeconds(2))
.build();
Then we create a Bulkhead
:
BulkheadRegistry registry = BulkheadRegistry.of(config);
Bulkhead bulkhead = registry.bulkhead("flightSearchService");
Now let's express our code to run the flight search Supplier
and decorate it bulkhead
BulkheadRegistry registry = BulkheadRegistry.of(config);
Bulkhead bulkhead = registry.bulkhead("flightSearchService");
Finally, let us call a few decoration operations to understand the working principle of the partition. We can use CompletableFuture
to simulate concurrent flight search requests from users:
for (int i=0; i<4; i++) {
CompletableFuture
.supplyAsync(decoratedFlightsSupplier)
.thenAccept(flights -> System.out.println("Received results"));
}
The timestamp and thread name in the output show that among the 4 concurrent requests, the first two requests passed immediately:
Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 11:42:13 187; current thread = ForkJoinPool.commonPool-worker-5
Flight search successful at 11:42:13 226
Flight search successful at 11:42:13 226
Received results
Received results
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-9
Searching for flights; current time = 11:42:14 239; current thread = ForkJoinPool.commonPool-worker-7
Flight search successful at 11:42:14 239
Flight search successful at 11:42:14 239
Received results
Received results
The third and fourth requests were granted permission only after 1 second, after the previous request was completed.
If the thread cannot get a license within maxWaitDuration
BulkheadFullException
:
Caused by: io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:49)
at io.github.resilience4j.bulkhead.internal.SemaphoreBulkhead.acquirePermission(SemaphoreBulkhead.java:164)
at io.github.resilience4j.bulkhead.Bulkhead.lambda$decorateSupplier$5(Bulkhead.java:194)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
... 6 more
Except for the first line, the other lines in the stack trace did not add much value. If 061a852db6368e occurs multiple times, these stack trace BulkheadFullException
We can reduce the amount of information generated in the stack trace by setting the writableStackTraceEnabled
configuration to false
BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(2)
.maxWaitDuration(Duration.ofSeconds(1))
.writableStackTraceEnabled(false)
.build();
Now, when BulkheadFullException
occurs, there is only one line in the stack trace:
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results
Similar to other Resilience4j modules we have seen, Bulkhead
also provides additional methods, such as decorateCheckedSupplier()
, decorateCompletionStage()
, decorateRunnable()
, decorateConsumer()
etc., so we can provide our code in structures other than Supplier
ThreadPoolBulkhead
When using a partition based on a thread pool,ThreadPoolBulkheadRegistry
, ThreadPoolBulkheadConfig
and ThreadPoolBulkhead
are the main abstractions we use.
ThreadPoolBulkheadRegistry
is a factory used to create and manage ThreadPoolBulkhead
ThreadPoolBulkheadConfig
encapsulates the coreThreadPoolSize
, maxThreadPoolSize
, keepAliveDuration
and queueCapacity
configurations. Each ThreadPoolBulkhead
object is associated with a ThreadPoolBulkheadConfig.
The first step is to create a ThreadPoolBulkheadConfig
:
ThreadPoolBulkheadConfig config =
ThreadPoolBulkheadConfig.ofDefaults();
This will create a ThreadPoolBulkheadConfig
with default values coreThreadPoolSize
(number of available processors-1), maxThreadPoolSiz
e (maximum number of available processors), keepAliveDuration
(20ms) and queueCapacity
(100).
Suppose we want to limit the number of concurrent calls to 2:
ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
.maxThreadPoolSize(2)
.coreThreadPoolSize(1)
.queueCapacity(1)
.build();
Then we create a ThreadPoolBulkhead
:
ThreadPoolBulkheadRegistry registry = ThreadPoolBulkheadRegistry.of(config);
ThreadPoolBulkhead bulkhead = registry.bulkhead("flightSearchService");
Now let us express our code to run the flight search Supplier
and decorate it bulkhead
Supplier<List<Flight>> flightsSupplier =
() -> service.searchFlightsTakingOneSecond(request);
Supplier<CompletionStage<List<Flight>>> decoratedFlightsSupplier =
ThreadPoolBulkhead.decorateSupplier(bulkhead, flightsSupplier);
And return a Supplier<List<Flight>>
SemaphoreBulkhead.decorateSupplier()
different,ThreadPoolBulkhead.decorateSupplier()
returns a Supplier<CompletionStage<List<Flight>>
. This is because ThreadPoolBulkHead
will not execute code synchronously on the current thread.
Finally, let us call a few decoration operations to understand the working principle of the partition:
for (int i=0; i<3; i++) {
decoratedFlightsSupplier
.get()
.whenComplete((r,t) -> {
if (r != null) {
System.out.println("Received results");
}
if (t != null) {
t.printStackTrace();
}
});
}
The timestamp and thread name in the output show that although the first two requests were executed immediately, the third request was queued and later executed by one of the released threads:
Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-1
Searching for flights; current time = 16:15:00 097; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:00 136
Flight search successful at 16:15:00 135
Received results
Received results
Searching for flights; current time = 16:15:01 151; current thread = bulkhead-flightSearchService-2
Flight search successful at 16:15:01 151
Received results
If there is no idle thread and capacity in the queue, then BulkheadFullException
is thrown:
Exception in thread "main" io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
at io.github.resilience4j.bulkhead.BulkheadFullException.createBulkheadFullException(BulkheadFullException.java:64)
at io.github.resilience4j.bulkhead.internal.FixedThreadPoolBulkhead.submit(FixedThreadPoolBulkhead.java:157)
... other lines omitted ...
We can use the writableStackTraceEnabled
configuration to reduce the amount of information generated in the stack trace:
ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
.maxThreadPoolSize(2)
.coreThreadPoolSize(1)
.queueCapacity(1)
.writableStackTraceEnabled(false)
.build();
Now, when BulkheadFullException
occurs, there is only one line in the stack trace:
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-3
Searching for flights; current time = 12:27:58 658; current thread = ForkJoinPool.commonPool-worker-5
io.github.resilience4j.bulkhead.BulkheadFullException: Bulkhead 'flightSearchService' is full and does not permit further calls
Flight search successful at 12:27:58 699
Flight search successful at 12:27:58 699
Received results
Received results
Context propagation
Sometimes we store data in ThreadLocal
variables and read it in different areas of the code. We do this to avoid explicitly passing data as parameters between method chains, especially when the value is not directly related to the core business logic we are implementing.
For example, we may want to record the current user ID or transaction ID or a certain request tracking ID in each log statement to make it easier to search the log. For such scenarios, using ThreadLocal
is a useful technique.
When using ThreadPoolBulkhead
, since our code is not executed on the current thread, the data we store in the ThreadLocal variable will not be available in other threads.
Let us look at an example to understand this problem. First we define a RequestTrackingIdHolder
class, a wrapper class ThreadLocal
class RequestTrackingIdHolder {
static ThreadLocal<String> threadLocal = new ThreadLocal<>();
static String getRequestTrackingId() {
return threadLocal.get();
}
static void setRequestTrackingId(String id) {
if (threadLocal.get() != null) {
threadLocal.set(null);
threadLocal.remove();
}
threadLocal.set(id);
}
static void clear() {
threadLocal.set(null);
threadLocal.remove();
}
}
The static method can easily set and get the value ThreadLocal
We next set up a request tracking ID before calling the flight search operation of the clapboard decoration:
for (int i=0; i<2; i++) {
String trackingId = UUID.randomUUID().toString();
System.out.println("Setting trackingId " + trackingId + " on parent, main thread before calling flight search");
RequestTrackingIdHolder.setRequestTrackingId(trackingId);
decoratedFlightsSupplier
.get()
.whenComplete((r,t) -> {
// other lines omitted
});
}
The sample output shows that this value is not available in the thread managed by the partition:
Setting trackingId 98ff99df-466a-47f7-88f7-5e31fc8fcb6b on parent, main thread before calling flight search
Setting trackingId 6b98d73c-a590-4a20-b19d-c85fea783caf on parent, main thread before calling flight search
Searching for flights; current time = 19:53:53 799; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:53 824
Received results
Searching for flights; current time = 19:53:54 836; current thread = bulkhead-flightSearchService-1; Request Tracking Id = null
Flight search successful at 19:53:54 836
Received results
To solve this problem, ThreadPoolBulkhead
provides a ContextPropagator
. ContextPropagator
is an abstraction used to retrieve, copy, and clean up values across thread boundaries. It defines an interface that contains retrieve()
get values from the current thread (061a852db63a60 ), copy them to a new thread of execution ( copy()
), and finally clean up on the clear()
Let's implement aRequestTrackingIdPropagator
:
class RequestTrackingIdPropagator implements ContextPropagator {
@Override
public Supplier<Optional> retrieve() {
System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
}
@Override
Consumer<Optional> copy() {
return optional -> {
System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
};
}
@Override
Consumer<Optional> clear() {
return optional -> {
System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
optional.ifPresent(s -> RequestTrackingIdHolder.clear());
};
}
}
We provide ContextPropagator
for ThreadPoolBulkhead
through the ThreadPoolBulkheadConfig
on 061a852db63ab7:
class RequestTrackingIdPropagator implements ContextPropagator {
@Override
public Supplier<Optional> retrieve() {
System.out.println("Getting request tracking id from thread: " + Thread.currentThread().getName());
return () -> Optional.of(RequestTrackingIdHolder.getRequestTrackingId());
}
@Override
Consumer<Optional> copy() {
return optional -> {
System.out.println("Setting request tracking id " + optional.get() + " on thread: " + Thread.currentThread().getName());
optional.ifPresent(s -> RequestTrackingIdHolder.setRequestTrackingId(s.toString()));
};
}
@Override
Consumer<Optional> clear() {
return optional -> {
System.out.println("Clearing request tracking id on thread: " + Thread.currentThread().getName());
optional.ifPresent(s -> RequestTrackingIdHolder.clear());
};
}
}
Now, the sample output shows that the request tracking ID is available in the thread managed by the partition:
Setting trackingId 71d44cb8-dab6-4222-8945-e7fd023528ba on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting trackingId 5f9dd084-f2cb-4a20-804b-038828abc161 on parent, main thread before calling flight search
Getting request tracking id from thread: main
Setting request tracking id 71d44cb8-dab6-4222-8945-e7fd023528ba on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:56 508; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 71d44cb8-dab6-4222-8945-e7fd023528ba
Flight search successful at 20:07:56 538
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results
Setting request tracking id 5f9dd084-f2cb-4a20-804b-038828abc161 on thread: bulkhead-flightSearchService-1
Searching for flights; current time = 20:07:57 542; current thread = bulkhead-flightSearchService-1; Request Tracking Id = 5f9dd084-f2cb-4a20-804b-038828abc161
Flight search successful at 20:07:57 542
Clearing request tracking id on thread: bulkhead-flightSearchService-1
Received results
Bulkhead event
Bulkhead
and T hreadPoolBulkhead
both have a EventPublisher
to generate the following types of events:
- BulkheadOnCallPermittedEvent
- BulkheadOnCallRejectedEvent and
- BulkheadOnCallFinishedEvent
We can listen to these events and record them, for example:
Bulkhead bulkhead = registry.bulkhead("flightSearchService");
bulkhead.getEventPublisher().onCallPermitted(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallFinished(e -> System.out.println(e.toString()));
bulkhead.getEventPublisher().onCallRejected(e -> System.out.println(e.toString()));
The sample output shows what was logged:
2020-08-26T12:27:39.790435: Bulkhead 'flightSearch' permitted a call.
... other lines omitted ...
2020-08-26T12:27:40.290987: Bulkhead 'flightSearch' rejected a call.
... other lines omitted ...
2020-08-26T12:27:41.094866: Bulkhead 'flightSearch' has finished a call.
Bulkhead indicator
SemaphoreBulkhead
Bulkhead
exposed two indicators:
- The maximum number of available permissions (
resilience4j.bulkhead.max.allowed.concurrent.calls
), and - The number of concurrent calls allowed (
resilience4j.bulkhead.available.concurrent.calls
).
bulkhead.available
indicators and we BulkheadConfig
configured on maxConcurrentCalls
same.
First, we create BulkheadConfig
, BulkheadRegistry
and Bulkhead
as before. Then, we create a MeterRegistry
and BulkheadRegistry
to it:
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedBulkheadMetrics.ofBulkheadRegistry(registry)
.bindTo(meterRegistry);
After running several partition decoration operations, we display the captured metrics:
Consumer<Meter> meterConsumer = meter -> {
String desc = meter.getId().getDescription();
String metricName = meter.getId().getName();
Double metricValue = StreamSupport.stream(meter.measure().spliterator(), false)
.filter(m -> m.getStatistic().name().equals("VALUE"))
.findFirst()
.map(m -> m.getValue())
.orElse(0.0);
System.out.println(desc + " - " + metricName + ": " + metricValue);};meterRegistry.forEachMeter(meterConsumer);
This is some sample output:
The maximum number of available permissions - resilience4j.bulkhead.max.allowed.concurrent.calls: 8.0
The number of available permissions - resilience4j.bulkhead.available.concurrent.calls: 3.0
ThreadPoolBulkhead
ThreadPoolBulkhead
exposes five indicators:
- The current length of the queue (
resilience4j.bulkhead.queue.depth
), - The size of the current thread pool (
resilience4j.bulkhead.thread.pool.size
), - The core and maximum capacity of the thread pool (
resilience4j.bulkhead.core.thread.pool.size
andresilience4j.bulkhead.max.thread.pool.size
), and - The capacity of the queue (
resilience4j.bulkhead.queue.capacity
).
First, we create ThreadPoolBulkheadConfig
as before,ThreadPoolBulkheadRegistry
and ThreadPoolBulkhead
. Then, we create a MeterRegistry
andThreadPoolBulkheadRegistry
bound to it:
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedThreadPoolBulkheadMetrics.ofThreadPoolBulkheadRegistry(registry).bindTo(meterRegistry);
After running several partition decoration operations, we will display the captured metrics:
The queue capacity - resilience4j.bulkhead.queue.capacity: 5.0
The queue depth - resilience4j.bulkhead.queue.depth: 1.0
The thread pool size - resilience4j.bulkhead.thread.pool.size: 5.0
The maximum thread pool size - resilience4j.bulkhead.max.thread.pool.size: 5.0
The core thread pool size - resilience4j.bulkhead.core.thread.pool.size: 3.0
In practical applications, we will regularly export the data to the monitoring system and analyze it on the dashboard.
Pitfalls and good practices when implementing partitions
Make the partition a singleton
All calls to a given remote service should go through the same Bulkhead
instance. For a given remote service, Bulkhead
must be a singleton.
If we do not enforce this operation, some areas of our code base may bypass Bulkhead and call remote services directly. In order to prevent this, the actual invocation of remote services should use the partition decorator exposed in the inner layer in a core, inner layer and other areas.
How do we ensure that future new developers understand this intent? Check out Tom's article, which shows a way to solve this problem, that is, organizes the package structure to clarify such intentions . In addition, it shows how to enforce this by coding intents in ArchUnit tests.
Combine with other Resilience4j modules
It is more effective to combine the partition with one or more other Resilience4j modules (such as retry and rate limiter). For example, if there is a BulkheadFullException, we may want to try again after some delay.
in conclusion
In this article, we learned how to use Resilience4j's Bulkhead module to set limits on our concurrent calls to remote services. We understand why this is important, and we saw some practical examples on how to configure it.
You can use [GitHub on](
https://github.com/thombergs/code-examples/tree/master/resilience4j/bulkhead) demonstrates a complete application.
This article is translated from: Implementing Bulkhead with Resilience4j-Reflectoring
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。