I was shocked! CompletableFuture actually has performance problems!

Hello, I'm crooked.

During the National Day, I had nothing to do, so I just hand-written a little bit of the game code mentioned earlier, and the goal is to keep the top 100 mixed competition t-shirts.

He is still in the top 50 teams and has a steady comparison.

In fact, I think everyone’s ideas for flexible load balancing are not too different. It depends on who can collect the key information and use it.

Since it is based on Dubbo, during the debugging process, it was written that I saw this place:

org.apache.dubbo.rpc.protocol.AbstractInvoker#waitForResultIfSync

First look at the line of code I framed. There is a CompletableFuture inside aysncResult, which calls the get() method with a timeout period. The timeout period is Integer.MAX_VALUE. In theory, the effect is equivalent to the get() method. NS.

From my intuitive point of view, the use of the get() method should not have any problems, and it is even better to understand.

But why not use the get() method?

In fact, the comment on the method has already written the reason, and I am afraid that people like me will have this question:

These words caught my eye:

have serious performance drop。

performance of 1616128e689b7f is severely degraded.

It probably means that we must call java.util.concurrent.CompletableFuture#get(long, java.util.concurrent.TimeUnit) instead of the get() method, because the get method has been proven to cause severe performance degradation.

For Dubbo, the waitForResultIfSync method is the method on the main link. Personally, I feel conservative. It can be said that more than 90% of requests will go to this method and block waiting for results. So if there is a problem with this method, it will affect Dubbo's performance.

Dubbo, as a middleware, may run in various JDK versions. For a specific JDK version, this optimization is indeed a great help for performance improvement.

Even if we don't talk about Dubbo, when we use CompletableFuture, the get() method is a method we often use.

In addition, I am too familiar with the call link of this method.

Because the first public account I wrote two years ago was about the asynchronous transformation of "Asynchronous transformation of the new features of Dubbo 2.7"

Back then, this part of the code was definitely not like that, at least there was no such hint.

Because if there is this hint, I must have noticed it the first time I wrote it.

Sure enough, I went to look through it. Although the picture is very blurry, I can still vaguely see that the get() method was actually called before:

I also call it the most "showy" line of code.

Because this line of code is the key code for Dubbo to convert from asynchronous to synchronous.

The previous is just an introduction, this article will not write Dubbo-related knowledge points.

Mainly write about the problem of get() of CompletableFuture.

Don't worry, this interview will definitely not be tested. It's just that after you know this point, it happens that your JDK version has not been repaired before, so you can pay a little attention when writing code.

Learn Dubbo and add the same NOTICE where the method is called to directly force it to full. When you wait for someone to ask, you can talk about it again.

Or when you inadvertently see other people writing like this, say something lightly: there may be performance problems here, you can go to find out.

What performance problem?

According to the information in the Dubbo comment, I don't know what the problem is, but I know where to find the problem.

This kind of problem must be recorded in the openJDK bug list, so the first stop is to search for keywords here:

https://bugs.openjdk.java.net/projects/JDK/issues/

Generally speaking, they are all old bugs, and it takes a long time to search to find the information you want.

However, this time I was so lucky. The first thing that popped up was what I was looking for. I was a bit uncomfortable doing it. Is this a legendary National Day gift? I dare not even think about it.

The title is: Performance improvements to CompletableFuture.

It mentioned a BUG numbered 8227019.

https://bugs.openjdk.java.net/browse/JDK-8227019

Let's take a look at what this BUG describes.

The translation of the title probably means that there is a loop in the CompletableFuture.waitingGet method, and the Runtime.availableProcessors method is called in this loop. And this method is called very frequently, which is not good.

In the detailed description, it mentions another BUG numbered 82.7006. This BUG describes why it is not good to call the availableProcessors frequently, but let’s click here first.

First study the line of code he mentioned:

 spins = (Runtime.getRuntime().availableProcessors() > 1) ?
                    1 << 8 : 0; // Use brief spin-wait on multiprocessors

He said it is located in waitingGet, let's go and see what is going on.

But my local JDK version is 1.8.0_271, and its waitingGet source code looks like this:

java.util.concurrent.CompletableFuture#waitingGet

Regardless of the meaning of these lines of code, anyway, I found that I did not see the code mentioned in the bug, but only saw spins=SPINS . Although SPINS called the Runtime.getRuntime().availableProcessors() method, the field was modified by static and final, and it did not exist. The "frequent call" described in the BUG is gone.

So I realized that my version was wrong, it should be the code after it was fixed, so I downloaded several previous versions.

Finally found such code in the JDK 1.8.0_202 version:

The difference from the source code in the previous screenshot is that the former has an extra SPINS field, which Runtime.getRuntime().availableProcessors() the return of the 0616128e68b212 method.

The reason I must find this line of code is to prove that such code indeed appeared in some JDK versions.

Okay, now let's take a look at what the waitingGet method does.

First, when the get() method is called, if the result is still null, then the result of the asynchronous thread execution is not ready yet, then the waitingGet method is called:

When we come to the waitingGet method, we only focus on the two branch judgments related to BUG:

First initialize the value of spins to -1.

Then when the result is null, the while loop continues.

Therefore, if you enter the loop, the availableProcessors method will be called for the first time. Then it is found that it is a multi-processor operating environment, and spins is set to 1<<8, which is 256.

Then loop again, go to the branch judgment of spins>0, and then do a random operation. If the random value is greater than or equal to 0, the spins will be subtracted by one.

Only when the spins is reduced to 0, will I enter the following logic framed by me:

In other words, this is to reduce the spins from 256 to 0, and due to the existence of the random function, the number of cycles must be greater than 256.

But there is another major premise, that is, every time the loop is looped, it will be judged whether the loop condition is still true. That is to determine whether the result is still null. If it is null, it will continue to decrease.

So, what do you say this code is doing?

In fact, the comment has been written very clearly:

Use brief spin-wait on multiprocessors。

Brief, this is a four-level vocabulary, it must be remembered, it needs to be tested. It means "short", which is an irregular verb, and its highest level is briefest.

By the way, everyone should know the word spin. I forgot to teach you the word before, so let’s talk about it together. Look at the small blackboard:

So what the comment says is: if it is a multi-processor, use a short spin to wait.

The process of reducing from 256 to 0 is this "brief spin-wait".

But think about it carefully, in the process of spin waiting, the availableProcessors method is only called once when it enters the loop for the first time.

So why does it cost performance?

Yes, it is true that the get() method is called only once, but you can't hold back how many places the get() method is called.

Take Dubbo as an example. In most cases, everyone uses the default synchronous calling scheme. So every call will go to the asynchronous to synchronous block to wait for the result, which means that the get() method will be called once every time, that is, the availableProcessors method will be called once.

So what is the solution?

I have already shown it to you before, which is to cache the return value of the availableProcessors method in a field:

But followed by a "problem", this "problem" means that if we cache the value of multi-processor, assuming that the value of the operating environment changes from multi-processor to single-processor in the process of running the program Inaccurate, although this is an unlikely change. But even if this "problem" does happen, it doesn't matter, it just causes a small performance loss.

So the code that everyone saw earlier appeared, which is "we can cache this value in a field":

The specific code changes are as follows:

http://cr.openjdk.java.net/~shade/8227018/webrev.01/src/share/classes/java/util/concurrent/CompletableFuture.java.udiff.html

So, when you look at this part of the source code, you will see that there is actually a long paragraph on the SPINS field, which is like this:

Translate for everyone:

1. In the waitingGet method, rotate before the blocking operation.

2. There is no need to rotate on a single processor.

3. The cost of calling the Runtime.availableProcessors method is very high, so the value is cached here. But this value is the number of CPUs available at the first initialization. If a system has only one CPU available at startup, the value of SPINS will be initialized to 0, even if more CPUs are brought online later, it will not change.

When you have the foreshadowing in the previous BUG description, you will understand why such a long paragraph is written here.

Some students actually go through the code, maybe what you see is this:

what's the situation? I can't see the code related to SPINS at all. Isn't this deceiving honest people?

Don't panic, the monkey is anxious, I haven't finished talking yet?

Let us turn our attention to this sentence in the picture:

You only need to fix this in JDK 8, because the code for JDK 9 and later is not written in this way.

For example, in JDK 9, the entire SPINS logic is directly removed. Don't wait for this short spin:

http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/f3af17da360b

Although this short spin waiting was taken away, it was actually considered to have learned a Sao operation.

Question: How to make a spin-wait effect without introducing time?

The answer is this piece of code that was removed.

But there is one thing to say, the first time I saw this code, I felt awkward. How much time can this short spin extend?

This spin is added to execute the park code in the subsequent logic later, which is a heavier operation. But I think the benefits of this "brief spin-wait" are actually minimal.

So I also understand why this whole bunch of code is removed directly in the follow-up. When the code was removed, the author did not realize that there was a BUG.

The author mentioned here is actually Father Doug Lea.

Why do I say this?

According to the BUG No. 8227018 mentioned in this BUG link, they actually describe the same thing:

There is such a conversation in which David Holmes and Doug Lea appear:

Holmes mentioned the "cache this value in a field" solution, and it was approved by Doug.

Doug said: Spin is no longer needed in JDK 9.

Therefore, my personal understanding is that Doug removed the logic of SPIN without knowing that there was a BUG in this place. As for the reasons, my guess is that the benefits are really small, and the code is somewhat confusing. It's better to take it away and make it more intuitive to understand.

Doug Lea is familiar to everyone. Who is David Holmes?

.png)

One of the authors of "Java Concurrent Programming Practical Combat", it's over with tea.

And if you are impressed enough with my previous article, then you will find that the bug written by Doug Lea in the In the , he has already appeared:

The old friend showed up again, and suggested that the iron juicers put the dream linkage on the public screen.

What's the reason?

In the previous crackling talk about such a big paragraph, the core idea is actually that the call cost of the Runtime.availableProcessors method is high, so this method should not be called frequently in the CompletableFuture.waitingGet method.

But why the call cost of availableProcessors is high? What is the basis? You have to take a look!

In this section, I will show you what the basis is.

The basis is in this BUG description:

https://bugs.openjdk.java.net/browse/JDK-8227006

The title says: In the linux environment, the execution time of Runtime.availableProcessors has increased by 100 times.

Increased by 100 times, there must be a comparison of two different versions, so which two versions are it?

On JDK versions prior to 1.8b191, the following sample program can achieve more than 4 million calls to Runtime.availableProcessors per second.

But on JDK build 1.8b191 and all subsequent major and minor versions (including 11), the maximum call volume it can achieve is about 40,000 times per second, and the performance is reduced by 100 times.

This leads to performance problems with CompletableFuture.waitingGet, which calls Runtime.availableProcessors in a loop. Because our application shows obvious performance problems in asynchronous code, waitingGet is where we first discovered the problem.

The test code is like this:

  public static void main(String[] args) throws Exception {
        AtomicBoolean stop = new AtomicBoolean();
        AtomicInteger count = new AtomicInteger();

        new Thread(() -> {
            while (!stop.get()) {
                Runtime.getRuntime().availableProcessors();
                count.incrementAndGet();
            }
        }).start();

        try {
            int lastCount = 0;
            while (true) {
                Thread.sleep(1000);
                int thisCount = count.get();
                System.out.printf("%s calls/sec%n", thisCount - lastCount);
                lastCount = thisCount;
            }
        }
        finally {
            stop.set(true);
        }
    }

According to the description of the BUG submitter, if you run on 64-bit Linux with JDK 1.8b182 and 1.8b191 respectively, you will find a difference of nearly 100 times.

As for why there is a 100 times performance difference, an old man named Fairoz Matte said that he debugged it and found that the problem occurred when calling the "OSContainer::is_containerized()" method:

And he also located the first version number where the problem occurred is 8u191 b02, and the code after this version will have such problems.

What the problematic version upgrade did was to improve the use of docker container detection and resource configuration.

So, if your JDK 8 is a version before 8u191 b02, and the system call concurrency is very high, then congratulations, you have the opportunity to step into this pit.

Then, the following big guys gave many solutions based on this problem and discussed various solutions.

Some solutions sound cumbersome and require a lot of code to be written.

In the end, from the road to simplicity, I chose the cache solution that is simpler to implement. Although this solution has some flaws, the probability of occurrence is very low and acceptable.

Look at the get method again

Now that we know this knowledge point that is not useful for eggs, let's take a look at why the get() method with a timeout is called, there is no such problem.

java.util.concurrent.CompletableFuture#get(long, java.util.concurrent.TimeUnit)

First of all, you can see that the internally called methods are different:

The get() method with a timeout period is internally called the timedGet method, and the input parameter is the timeout period.

Click into the timedGet method to know why there is no problem calling the get() method with timeout:

The answer is already written for you in the code comments: we deliberately do not rotate here (like waitingGet), because the call to nanoTime() above is like a rotation.

It can be seen that in this method, there is no call to Runtime.availableProcessors at all, so there is no corresponding problem.

Now, let’s go back to where we started:

So you said, if we change the asyncResult.get(Integer.MAX_VALUE, TimeUnit.MILLISECONDS) asyncResult.get() effect is still the same?

It must be different.

Say it again: As an open source middleware, Dubbo may run in various JDK versions, and this method is the core code on its main link. For a specific JDK version, this optimization is indeed for performance The improvement is of great help.

So writing middleware is still a bit interesting.

Finally, I will give you an opportunity to submit source code for Dubbo.

In this class below:

org.apache.dubbo.rpc.AsyncRpcResult

There are still these two methods:

But the above get() method is only called by the test class:

You can completely change them all by calling the get(long timeout, TimeUnit unit) method, and then directly delete the get() method.

I think it can definitely be merged.

If you want to contribute to an open source project and familiarize yourself with the process, then this is a good little opportunity.

I was shocked! CompletableFuture actually has performance problems!

What performance problem?

What's the reason?

Look at the get method again

why技术

引用和评论

面试场景题：一次关于线程池使用场景的讨论。

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性