Feel comfortable, stepping on an unusual bug about distributed locks!

Hello, I am crooked.

When it comes to distributed locks, everyone usually thinks of Redis.

Thinking of Redis, some students will talk about Redisson.

So when it comes to Redisson, you have to talk about its "watchdog" mechanism.

So you thought I was going to tell you about "watchdogs" in this article?

No, I mainly want to report to you two bugs that I recently researched that seem to be tight to Redisson after the introduction of "watchdog":

A bug that the watchdog does not work.
Watchdog causes deadlock bug.

In order to let you slip into the play, I still briefly lay the groundwork for you, what exactly is Redisson's watchdog.

watchdog description

You go to Redisson's wiki document. In this part of the lock, a word is mentioned at the beginning: watchdog

https://github.com/redisson/redisson/wiki/8.-distributed-locks-and-synchronizers

watchdog means watchdog.

What is it for?

OK, if you can't answer this question. Then you must be stunned when you encounter the following interview question.

Interviewer: When you use Redis as a distributed lock, if the specified expiration time is up, the lock will be released. But the task has not been executed yet, causing the task to be executed again. How would you deal with this situation?

At this point, 99% of interviewers want the answer to be a watchdog, or a mechanism similar to a watchdog.

If you say: I have encountered this problem, but I just set the expiration time longer.

How long to set the time is a very subjective judgment of you. Setting a longer time can solve this problem to a certain extent, but not completely.

So please go back and wait for the notification.

Or you answer: I have encountered this problem, I do not set the expiration time, it is guaranteed by the program calling unlock.

Well, the program guarantees that there is nothing wrong with calling the unlock method. This is controllable and guaranteed at the program level. But if the server your program is running on just goes down before executing the unlock, you can't guarantee this, right?

Is this lock deadlocked?

so......

In order to solve the problem that the expiration time is not easy to set and accidentally deadlocked, Redisson has a timed task for each lock based on the time wheel. This timed task is the watchdog.

Before the Redisson instance is shut down, the dog can continuously extend the validity period of the lock through scheduled tasks.

Because you don't need to set the expiration time at all, this fundamentally solves the problem of "the expiration time is not easy to set". By default, the timeout period of the watchdog's check lock is 30 seconds, which can also be specified by modifying the parameters.

If, unfortunately, the node is down and the unlock is not executed, the lock will be automatically released after a maximum of 30s in the default configuration.

Then the question came, and the interviewer followed up with a follow-up question: how to release it automatically?

At this time, you only need a tactical fallback: the program is gone, do you think the timed task is still there? There are no timed tasks, so there will be no deadlock problem.

engage in demo

The principle was briefly introduced earlier, and I still give you a simple demo to run, which is more intuitive.

Introduce dependencies, let's not talk about starting Redis, just look at the code.

The sample code is very simple, just a little bit of content, very common usage:

After starting the project and triggering the interface, observe the key situation of whyLock in Redis through the tool, which is like this:

You can see that in my screenshot, there is an expiration time, which is where I hit the arrow.

Then I will make a moving picture for you. You look carefully at the expiration time (TTL), there is a process of changing from 20s to 30s:

First of all, there is no action to set the expiration time in our code, and there is no action to update the expiration time.

So what's going on with this thing?

It's very simple, Redisson does these things for us, out of the box, and it's done as a black box.

Next, I will take you to turn the black box into a white box, and then lead to the two bugs mentioned above.

My test case uses Redission of version 3.16.0. Let's first find its source code for setting expiration actions.

First of all, you can see that although I am calling the lock method without parameters, it is actually just a layer of skin. The lock method with input parameters is still called, but several default values are given, among which leaseTime gives it's 1:

The source code of the lock with parameters is like this, mainly focus on the line of code I framed:

The tryAcquire method is its core logic, so what is this method doing?

Click to see, this part of the source code is like this:

The tryLockInnerAsync method is to execute the Lua script of Redis to lock.

Since it is locked, the expiration time must be set here, which is the leaseTime here:

The leaseTime here is initialized in the constructor. In my Demo, the default value in the configuration is used, which is 30s:

So, why is there no action to set the expiration time in our code, but the corresponding key has an expiration time?

The source code here answers this question.

In addition, this time is obtained from the configuration, so it must be customizable, not necessarily 30s.

Another thing to note is that at this point, we have two different leaseTimes.

They are as follows:

The argument leaseTime of the tryAcquireOnceAsync method is -1 in our example.
The input parameter leaseTime of the tryLockInnerAsync method is the default value of 30 * 1000 in our example.

After adding the lock in front, it is the watchdog's turn to work:

As I said earlier, the leaseTime here is -1, so the trigger is the scheduleExpirationRenewal code in the else branch.

And this code is the code that starts the watchdog.

In other words, if the leaseTime here is not -1, then the watchdog will not start.

So how to make leaseTime not -1?

Specify the lock time yourself:

In other words, if the expiration time is specified when locking, Redission will not give you the mechanism to turn on the watchdog.

This point is a point that countless people who are not clear about the watchdog mechanism will remember wrongly. I used to argue with reason in a group, and was later beaten by others with the source code.

Yes, I was the one who thought the watchdog would continue to work after the specified expiration time.

It hurts to be slapped in the face, I hope you don't follow in the footsteps.

Let's take a look at the code of scheduleExpirationRenewal:

Inside is to encapsulate the current thread into an object, and then maintain it in a MAP.

This MAP is very important. I will put it here first, and I will talk about it later:

You just need to remember that the key of this MAP is the current thread, and the value is the ExpirationEntry object, which maintains the number of locks of the current thread.

Then, let's look at the scheduleExpirationRenewal method. After calling the putIfAbsent method of MAP, the returned oldEntry is empty.

This situation indicates that it is the first time to lock, and the renewExpiration method will be triggered, which is the core logic of the watchdog.

In the scheduleExpirationRenewal method, regardless of whether the oldEntry mentioned earlier is empty, the addThreadId method will be triggered:

It can be seen from the source code that only one maintenance is performed on the number of locks of the current thread.

This maintenance is easy to understand, because to support the reentrancy of locks, it is necessary to record the number of reentrances.

Once locked, the number of times is increased by one. Once unlocked, the number of times is reduced by one.

Then look at the renewExpiration method, this is the true face of the watchdog:

First of all, this piece of logic is mainly a timing task based on a time wheel.

The place labeled ④ is the time condition that this timed task triggers: internalLockLeaseTime / 3.

As I said earlier, internalLockLeaseTime is 30*1000 by default, so the default here is to perform a life-continuation task every 10 seconds. This can also be seen from the dynamic I gave earlier, the time of ttl first changed from 30 to 20, and then from 20 to 30 all at once.

The places labeled ① and ② are doing the same thing, that is, checking whether the current thread is still valid.

How to judge whether it is effective?

It is to see if there is an ExpirationEntry object corresponding to the current thread in the MAP mentioned above.

No, it means that it has been removed.

Then the question comes. When you look at the source code, you should naturally think of this question: when to call the remove method of this MAP?

Soon, you can see the corresponding remove in the place where the lock is released next. Mention it here first, and it will be echoed later.

The core logic is the place labeled ③. I'll take you through it, focusing mainly on where I've underlined.

If you can go to ③, it means that the business logic of the current thread has not been executed yet, and you need to continue to hold the lock.

First look at the renewExpirationAsync method. We can also see from the method name that this is resetting the expiration time:

The above source code is mainly a lua script, and the logic of this script is very simple. It is to determine whether the lock still exists, and whether the thread holding the lock is the current thread. If it is the current thread, reset the expiration time of the lock and return 1, that is, return true.

If the lock does not exist, or the current thread is not holding the lock, then return 0, that is, return false.

Then the place labeled ③, which firstly judges whether there is an abnormality in the execution of the renewExpirationAsync method.

Then the question arises, what will be abnormal?

The exception in this place is mainly because the command needs to be executed in Redis, so if there is a problem with Redis, such as stuck, or dropped, or the connection pool is not connected, etc., the command may not be executed. , resulting in an exception.

If an exception occurs, execute the following line of code:

EXPIRATION_RENEWAL_MAP.remove(getEntryName());

Then return, this timed task is over.

Well, remember this remove operation is very important, first get familiar with it, and I will talk about it later.

If there is no exception when executing the renewExpirationAsync method. The return value at this time is true or false.

If it is true, it means that the renewal is successful, then call the renewExporation method again, waiting for the time wheel to trigger the next time.

If it is false, it means that the lock is no longer available, or has changed hands. Then there is nothing to do in the current thread, and you don't have to do anything, just end it silently.

Some of the basic principles of locks and watchdogs are the same as mentioned above.

Then simply take a look at what's going on in the unlock method.

The first is the unlockInnerAsync method, which is the logic of the lua script releasing the lock:

This method returns a Boolean, there are three cases.

The return is null, indicating that the lock does not exist, or the lock exists, but the value does not match, indicating that the lock has been occupied by other threads.
Returns true, indicating that the lock exists, the thread is correct, the number of reentries has been reduced to zero, and the lock can be released.
The return is false, indicating that the lock exists and the thread is correct, but the number of reentries is not zero, and the lock cannot be released.

But look how unlockInnerAsync handles this return value:

The return value, that is, opStatus, only judges the situation that the return is null, throwing an exception indicates that the lock is not held by the current thread, and it's done.

It doesn't care if it returns true or false.

Then look at the method I framed cancelExpirationRenewal(threadId); :

There is a remove method here.

And so much foreshadowing is actually to elicit this cancelExpirationRenewal method.

Looking at locking and unlocking, for the operation of MAP, take a look at the picture below:

The place marked with ① is the lock, and the put method of MAP is called.

The place labeled ② is to put the lock and call the remove method of MAP.

Remember the above analysis and the timing of operating this MAP. The BUG mentioned below is caused by improper operation of this MAP.

Watchdog does not work BUG

I found a version earlier for everyone to see the source code, mainly to let everyone run the Demo. After all, the cost of introducing maven dependencies is much smaller.

But if you really want to study the source code, you still have to pull down the source code first and slowly gnaw it.

I have said many times in the previous articles about the benefits of pulling the source code of the project directly. For me, there are three purposes:

Guaranteed to be the latest source code
You can see the commit record of the code
Official test cases can be found

Well, without further ado, let's first look at the first bug mentioned at the beginning: the problem that the watchdog does not take effect.

From this issue:

https://github.com/redisson/redisson/issues/2515

In this issue, he gave a piece of code, and then said that his expected result is that during the watchdog's life, if there is a connection problem between the program and Redis, the lock will expire automatically, then I apply for the same lock again, it should be Just make the watchdog work again.

But the actual situation is that even if the previous lock expires due to abnormal connection, the program successfully applies for a new lock, but the new lock will expire automatically after 30s, that is, the watchdog will not work.

The pr corresponding to this issues is this:

https://github.com/redisson/redisson/pull/2518

In this pr, a test case is provided, which we can find directly in the source code:

org.redisson.RedissonLockExpirationRenewalTest

This is the benefit of pulling source code.

In this test case, the core logic is as follows:

The first thing to note is that in this test case, the watchdog lockWatchdogTimeout parameter is modified to 1000 ms:

That is to say, the timing task of the watchdog will be triggered every 333ms.

Then we look at the place labeled ①, first apply for a lock, and then Redis restarts, the restart causes the lock to fail, for example, it has not had time to persist, or persisted, but the restart time exceeds 1s , the lock is gone.

Therefore, when the unlock method is called, an IllegalMonitorStateException exception will definitely be thrown, indicating that the lock is gone.

So far everything is normal and understandable.

But look at the place marked ②.

After locking, the business logic will be executed for 2s, which will definitely trigger the operation of the watchdog to continue its life.

Before this bug was fixed, calling the unlock method here would also throw an IllegalMonitorStateException, indicating that the lock was lost:

Let's not say why, at least this is a bug.

Because according to the normal logic, this lock should always be renewed, and then it should not be released until unlock is called.

Well, you have seen the demo of the bug, and it can be reproduced. Guess what?

In fact, I should have written the answer for you earlier. It depends on whether you can react to this wave of echoes before and after.

First of all, the premise is that the two locked threads are the same, and then didn't I specifically emphasize the oldEntry thing:

The above bug can appear, indicating that the oldEntry exists in the MAP during the second lock, so it is misunderstood that the current watchdog is working, and the logic of the reentrant lock can be directly entered.

Why does oldEntry exist in the MAP during the second lock?

Because when unlocking for the first time, the ExpirationEntry object of the current thread is not removed from the MAP.

Why wasn't it removed?

Take a look at the Redisson version tested by this dude:

In this version, the logic for releasing the lock is as follows:

Eh, no, isn't this the logic of cancelExpirationRenewal(threadId) ?

Yes, there is.

But you see under what circumstances will this logic be executed.

The first is the case of an exception, but in our test case, Redis is normal when calling unlock twice, and no exception will be thrown.

Then the logic will be executed when opStatus is not null.

That is to say, when opStatus is null, that is, when the current lock is gone, or the owner is changed, the logic of cancelExpirationRenewal(threadId) will not be triggered.

Coincidentally, in our scenario, when the unlock method is called for the first time, the lock is lost due to the restart of Redis, so the returned opStatus here is null, and the logic of cancelExpirationRenewal method is not triggered.

As a result, when I call lock in the current thread for the second time, when I go to the following, oldEntry is not empty:

Therefore, the logic of reentrancy is taken, and the watchdog is not started.

Since the watchdog is not started, the lock is automatically released after 1000ms, which can be robbed and used by other threads.

Then the business logic of the current thread is executed, and the second call to unlock will of course throw an exception.

This is the root cause of the bug.

Once the problem is found, it can be solved with a single line of code:

As long as the unlock method is called, no matter what, first call the cancelExpirationRenewal(threadId) method, that's right.

This is a bug caused by not removing the object corresponding to the current thread from the MAP in time.

Take a look at another issue:

https://github.com/redisson/redisson/issues/3714

The problem is that if my lock is lost for some reason, the watchdog should continue to work after I acquire it again in the program.

Sounds like the same question, right?

Yes, that is the same question.

But this problem, the submitted code is like this:

In the watchdog, if the watchdog fails to continue its life, it means that the lock does not exist, that is, the res returns to false, then the cancelExpirationRenewal method is also actively executed to make way for the succeeding thread to be locked later, so as not to delay others from opening the watch Door dog mechanism.

In this way, there is a double guarantee, the logic of cancelExpirationRenewal will be triggered in the unlock and the watchdog, and the two logics will not conflict.

In addition, let me remind that the final submitted code is like this, and the parameters of the two methods are different:

Why amend threadId to null?

Let me leave a thought question, it is considered from the perspective of re-entry, you can study it yourself, it is very simple.

Watchdog causes deadlock bug

This bug is very simple to explain.

Check out this issue:

https://github.com/redisson/redisson/issues/1966

The steps to reproduce are clearly written here.

The test program is like this, triggered once by the timed task 1s, but the task will be executed for 2s, which will lead to the reentrancy of the lock:

Here he mentions a command:

CLIENT PAUSE 5000

The main thing is to simulate the situation that Redis handles the request timeout, that is, let Redis fake death for 5s, so that the request sent by the program will time out.

In this way, the logic of reentrancy will be confused.

Take a look at one of the key pieces of code corresponding to this bug fix:

The cancelExpirationRenewal logic is executed regardless of whether opStatus returns false or true.

The solution to the problem lies in the operation of the MAP.

Also, one more word.

It is also in this submission that the logic of maintenance and reentrancy is encapsulated into the ExpirationEntry object, which is much more elegant than the previous writing method. If you are interested, you can pull down the source code for comparison and feel what is called elegant refactoring:

thread interruption

While writing the article, I also found an interesting but unsolvable bug in Redisson.

right here:

The first time I saw this piece of code, it was very strange. There must be a story behind such a strange way of writing.

The corresponding story behind this is hidden in this issue:

https://github.com/redisson/redisson/issues/2714

In translation, it means that when the tryLock method is interrupted, the watchdog will continue to update the lock, which results in an infinite lock, which is a deadlock.

Let's take a look at the corresponding test cases:

A child thread is opened, the tryLock method is executed in the child thread, and then the interrupt method of the child thread is called in the main thread.

What do you think the child thread should do at this time?

It stands to reason that the thread is interrupted, shouldn't the watchdog also work?

Yes, so code like this appears:

However, if you look closely, these few lines of code do not completely solve the watchdog problem. It can only be solved with a certain probability that after the renewExpiration method is called for the first time, the short period of time before the scheduled task is started.

Therefore, the sleep time in the test case is only 5ms: