3
头图

Hello, I am crooked.

When I was surfing in RocketMQ's ISSUE some time ago, I saw a pr. Although it was found on RocketMQ's site, this thing has nothing to do with RocketMQ.

Pure is a bug in the JDK.

Let me ask you a question first: Is LinkedBlockingQueue thread safe?

This is an old eight-legged essay. If you can't blurt it out, you should be beaten.

The answer is: it is thread-safe because of the existence of these two locks.

However, in a certain scenario of RocketMQ, the thread unsafe situation of LinkedBlockingQueue has been stably reproduced.

Let's talk about the conclusion first: The way of stream traversal of LinkedBlockingQueue has certain problems under multi-threading, and there may be an infinite loop.

Old and interesting, this article will take you to a plate.

Make a Demo

In fact, I don't need to do the Demo. The link to the pr mentioned above is this:

https://github.com/apache/rocketmq/pull/3509

In this link, there is a lot of discussion around RocketMQ.

But in the middle, a big guy nicknamed areyouok hits the nail on the head and points out the problem.

A very simple reproduction code is given directly. And completely strip out the RocketMQ stuff:

As the saying goes, predecessors planted trees and later generations enjoyed the shade. Since I have seen the code of the big guy areyouok, I will use it directly as a demo for the demonstration.

If you don't mind, in order to show my respect, I venture to say: Thank you, Mr. Lei for the code.

I'll paste Mr. Lei's code first, so that you can actually operate it when you read the article:

 public class TestQueue {
    public static void main(String[] args) throws Exception {
        LinkedBlockingQueue<Object> queue = new LinkedBlockingQueue<>(1000);
        for (int i = 0; i < 10; i++) {
            new Thread(() -> {
                while (true) {
                    queue.offer(new Object());
                    queue.remove();
                }
            }).start();
        }
        while (true) {
            System.out.println("begin scan, i still alive");
            queue.stream()
                    .filter(o -> o == null)
                    .findFirst()
                    .isPresent();
            Thread.sleep(100);
            System.out.println("finish scan, i still alive");
        }
    }
}

Introduce the core logic of the above code.

The first is to engage in 10 threads, and each thread keeps calling the offer and remove methods.

It should be noted that this remove method is a parameterless method, which means to remove the head node.

To emphasize again: there is a ReentrantLock lock in LinkedBlockingQueue, so even if multiple threads operate the offer or remove method concurrently, they must obtain the lock separately to operate, so this must be thread-safe.

Then there is an infinite loop in the main thread, and the stream operation is performed on the queue to see if it can find the first non-empty element in the queue.

This stream operation is a trick, the real key is the tryAdvance method:

Let's take a look at this method first, and then elaborate on it later.

It stands to reason that after this method runs, it should keep outputting these two sentences:

 begin scan, i still alive
finish scan, i still alive

However, if you paste the code out and run it with JDK 8, you will find that the console only has this thing:

Or just alternate output a few times and it's gone.

But when we don't move the code and just replace the JDK version, for example, I happen to have a JDK 15. After the replacement, run it again, and the alternating effect will come out:

So based on the above performance, can I boldly guess that this is a bug in the JDK 8 version?

Now that we have a demo that can be stably reproduced in the JDK 8 operating environment, the next step is to locate the reason for the bug.

What is the reason?

First, let me tell you what I did after I got this problem.

Very simple, think about it, the main thread should always output but there is no output, so what is it doing?

My initial suspicion is that it is waiting for a lock.

How to verify it?

Friends, the cute little camera is back:

Through it, I can Dump what each thread is doing in the current state.

But when I saw that the status of the main thread was RUNNABLE, I was a little confused:

What's the matter?

If it is waiting for the lock, shouldn't it be RUNNABLE?

Dump again to verify:

If it is found that it is still in RUNNABLE, then the suspicion of lock waiting can be directly ruled out.

There is a reason why I specifically reflect this operation of the Dump thread twice.

Because many friends are trying to analyze where they are holding a Dump file when they Dump the thread, but I think the correct operation should be to Dump multiple times at different time points, and compare and analyze what the same thread in different Dump files is doing.

For example, I dumped twice at different time points and found that the main thread is in the RUNNABLE state, which means that from the perspective of the program, the main thread is not blocked.

But from a console output perspective, it seems to be blocking again.

A classic, friends. Do you think this is a classic picture?

Isn't this, this thing, there is an infinite loop in the thread:

 System.out.println("begin scan, i still alive");
while (true) {}
System.out.println("finish scan, i still alive");

to verify.

What we can observe from the dump file is that the main thread is executing this method:

at java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.tryAdvance(LinkedBlockingQueue.java:950)

Remember the eyes I inserted in front of me?

Here is the stream I said earlier is just a trick, the real key point is the tryAdvance method.

Take a look at the tryAdvance method of JDK 8, and sure enough, there is a while loop in it:

From the while condition, current!=null is always true, and e!=null is always false, so the loop cannot be jumped out.

But judging from the logic in the while loop body, the current node inside will change:

current = current.next;

Come, combined with the current conditions, let me elaborate.

  • The data result of LinkedBlockingQueue is a linked list.
  • An infinite loop appears in the tryAdvance method, indicating that the loop condition current=null is always true, and e!=null is always false.
  • But there is an action to get the next node in the loop body, current = current.next.

To sum up, there is a node in the current linked list that looks like this:

Only then will both conditions be met:

  • current.item=null
  • current.next=null

So when will such a node appear?

This situation is to remove the node from the linked list, so it must be the time to call the method related to removing the node.

Looking at our demo code, the code related to removal is this line:

queue.remove();

As mentioned earlier, this remove method is to remove the head node, the effect is the same as poll, and the poll method is also called directly in its source code:

So we mainly look at the source code of the poll method:

java.util.concurrent.LinkedBlockingQueue#poll()

The two places labeled ① are to take the lock and release the lock, indicating that this method is thread-safe.

Then the focus is on the place labeled ②, the dequeue method, which is the method to remove the head node:

java.util.concurrent.LinkedBlockingQueue#dequeue

How does it remove the head node?

It's the part I framed, point to yourself, be a lonely node, and you're done.

h.next=h

This is the picture I drew earlier:

So what kind of magic happens in this place of the dequeue method and the while loop in the tryAdvance method?

This thing is not easy to describe, you know, so I decided to draw you a picture below to make it easier to understand.

Screen demo

Now I have mastered the principle of this bug, so in order to facilitate my debugging, I also simplified the example code, the core logic remains the same, or just a few lines of code, mainly to trigger the tryAdvance method:

First of all, according to the code, after adding elements to the queue queue, the queue is as long as this:

Draw a diagram like this:

Then, we proceed to perform the traversal operation, which is to trigger the tryAdvance method:

The above picture I specifically cut a method.

That is, if you look one step further, the place where the tryAdvance method is triggered is called forEachWithCancel. From the source code, it is actually a loop. The loop end condition is that the tryAdvance method returns false, which means the traversal is over.

Then I specially framed the lock and unlock places, which means that the try method is thread-safe, because at this time, the locks of put and take are obtained.

In other words, when a thread executes the tryAdvance method and the lock is successful, if other threads need to operate the queue, they cannot acquire the lock, and must wait for the thread to complete the operation and release the lock.

But the scope of the lock is not the entire traversal period, but every time the tryAdvance method is triggered.

And each tryAdvance method only processes one node in the linked list.

The preparation is almost done here. Next, I will take you to analyze the core source code of the tryAdvance method step by step, that is, this part of the code:

When fired for the first time, the current object is null, so an initialization will be executed:

current = q.head.next;

Then at this time current is node 1:

.

Then execute the while loop, when the current!=null condition is satisfied, the loop body is entered.

Inside the loop body, two lines of code are executed.

The first line is this, take out the value in the current node:

e = current.item;

In my demo, e=1.

The second line is this line of code, which means to maintain current as the next node, and use it directly when the next tryAdvance method is triggered:

current = current.next;

Then break ends the loop because e!=null:

After the first tryAdvance method is executed, current points to the node at this position:

Friends, the next interesting thing is coming.

Suppose that when the tryAdvance method is triggered for the second time, any line of code in the boxed part is executed, that is, when the lock has not been acquired or the lock cannot be acquired:

At this time, another thread comes, which is executing the remove() method and continuously removing the head node.

After executing the remove() method three times, the linked list becomes like this:

Next, when I merge the two graphs together, it's time to witness the miracle:

When the remove method is executed for the third time, the tryAdvance method successfully grabs the lock again and starts to execute. From our God's perspective, we see this scene:

This, I can verify from the Debug view:

It can be seen that the next node of current is still itself, and they are all objects of LinkedBlockingQueue$Mode@701, not null.

So this is how the endless loop of this place came about.

After the analysis is over, think back to the process. In fact, is this problem not as difficult as you imagined?

You have to believe that as long as you are given code that you can reproduce stably, all bugs can be debugged.

In the process of debugging, I also thought of another problem: if I call this remove method, remove the specified element.

Will the same problem occur?

I don't know either, but it's very simple, just experiment and you'll know.

Or put a breakpoint in the tryAdvance method, and then call the Evaluate function through Alt+F8 after triggering the tryAdvance method for the second time, and execute queue.remove 1, 2, 3 respectively:

Then observe the current element, and it does not appear to point to itself:

why?

There are no secrets under the source code.

The answer is written in the unlink method:

The p in the input parameter is the node to be removed, and the trail is the previous node of the node to be removed.

In the source code, I only see trail.next=p.next, that is, through the pointer, skip the node to be removed.

But I didn't see the source code similar to p.next=p in the previous dequeue method, that is, the action of pointing the next node of the node to itself.

Why?

The author has written it clearly for you in the comments:

p.next is not changed, to allow iterators that are traversing p to maintain their weak-consistency guarantee.
p.next has not changed, as it is designed to maintain weak consistency of the iterators that are traversing p.

In human words: this thing can't point to itself, it points to itself. If this node is being executed by an iterator, isn't it a complete bullshit?

So the remove method with parameters takes into account the iterator, but the remove method without parameters is really ill-considered.

How to fix it?

I searched in the JDK's BUG library. In fact, this problem appeared in the JDK's BUG list in 2016:

https://bugs.openjdk.org/browse/JDK-8171051

The fix is done in the JDK9 version.

I have a source code of JDK15 locally, so let me compare it with the source code of JDK8:

The main change is inside the try block.

A succ method is called in the source code of JDK15, and it can be seen from the comments on the method that this bug is specifically fixed:

For example, back to this scenario:

Let's take a closer look at how the succ method handles the current situation:

 Node<E> succ(Node<E> p) {
    if (p == (p = p.next))
        p = head.next;
    return p;
}

p is the element corresponding to current in the above figure.

First of all, p = p.next is still p, because it points to itself, is this okay?

Then p == (p = p.next), bring in the condition, that is, p==p, the condition is true, is this okay?

So execute p = head.next. From the above figure, head.next is the node with element 4, right?

Finally, element 4 is obtained, which is the last element, and then the loop is ended:

No endless loop, perfect.

extend it

Back to a question at the beginning of this article: Is LinkedBlockingQueue thread-safe?

The next time you encounter this question in an interview, you will smile slightly and answer: Due to the existence of read-write locks inside, this thing is generally thread-safe. However, in the scenario of JDK8, when it encounters a stream operation, and other threads are calling the remove method without parameters, there is a certain chance that an infinite loop will occur.

Be confident when you speak, and under normal circumstances, you can bluff the interviewer.

The solution I gave earlier is to upgrade the JDK version, but you know, this is a big move. Generally speaking, don't act rashly if you can run,

So I can think of two alternatives.

For the first time, don't use stream, just use iterator loop honestly, isn't it delicious?

The second option is this:

It works great, absolutely no problem.

What is your internal ReentrantLock, I will give you a lock promotion directly, and wrap it with synchronized externally.

Come on, you have the ability to show me another thread unsafe.

Now, let me ask you another question: Is ConcurrentHashMap thread-safe?

I wrote before that this thing also has an infinite loop under JDK8 "Shock! There is also an infinite loop in ConcurrentHashMap, what about the "Easter Eggs" left by the author? 》

I asked the same question at the end of the article.

The answer at that time is moved again:

Yes, ConcurrentHashMap itself must be thread safe. However, if you use it incorrectly, it may still be thread unsafe.

Let's take a look at the source code in Spring:

org.springframework.core.SimpleAliasRegistry

In this class, aliasMap is of type ConcurrentHashMap:

In the registerAlias and getAliases methods, there is code to operate the aliasMap, but the aliasMap is locked with synchronized before the operation.

Why do we need to lock when we operate ConcurrentHashMap?

This depends on the scene, the alias manager, the lock here should be to avoid multiple threads operating ConcurrentHashMap.

Although ConcurrentHashMap is thread-safe, it is assumed that if one thread puts and one thread gets, it is not allowed in the scenario of this code.

Specific situations require specific analysis.

If it is not easy to understand, I will give an example of Redis.

Redis' get and set methods are all thread-safe. But if you first get and then set, there will still be problems in the case of multi-threading.

Because these two operations are not atomic. So incr came into being.

I give this example to say that thread safety is not absolute, it depends on the scene. Give you a thread-safe container, you will still have thread-safety problems if you use it incorrectly.

For another example, is HashMap necessarily thread-unsafe?

Say it can't be said to be so dead. It's a thread-unsafe container. But what if my usage scenario is read-only?

In this read-only scenario, it is thread-safe.

In short, watch the scene, don't discuss the problem outside the scene.

Reason, that's the truth.

Finally, let me say the conclusion again: The way of stream traversal of LinkedBlockingQueue has certain problems under multi-threading, and there may be an infinite loop.


why技术
2.2k 声望6.8k 粉丝