One: background
1. Tell a story
At the beginning of this month, a friend on the planet approached me and said that his program had a deadlock. It was suspected that some of his own writing methods caused such an embarrassing situation in mongodb. The screenshot is as follows:
To be honest, after seeing so many dumps, it is the first time to encounter a real deadlock, this tmd immediately became interested. . . Talk to windbg.
Two: Windbg analysis
1. Is it really a deadlock?
Since my friend said deadlock, I have to verify it first. You can use the command !syncblk
view the synchronization block table.
0:000> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner
97 000000F7B77CA1B8 107 1 000000F7D37A7210 8848 25 000000f7b853d480 System.Object
144 000000F7D39BA2A8 495 1 000000F7DA4CDA70 75e0 63 000000f7b853de48 System.Object
-----------------------------
Total 603
CCW 1
RCW 1
ComClassFactory 0
Free 490
The following information can be obtained from the synchronization block table.
- Thread 25 is holding the
000000f7b853d480
lock object. - Thread 63 is holding the
000000f7b853de48
lock object.
We know that the so-called deadlock is a deadlock caused by two threads eager to obtain the lock resources held by each other, and no one gives in. If you don't understand, I will draw a picture:
The above picture is a kind of deadlock. By the way, this situation is often encountered in sqlserver. How will it deal with it? This is a bit interesting. There is a mediation thread inside sqlserver that executes periodically. When such a deadlock deadlock is detected, it will kill the thread with low priority so that another thread can acquire the lock smoothly. The thread dropped by kill will have the following exception message:
System.Data.SqlClient.SqlException (0x80131904): 事务(进程 ID 112)与另一个进程被死锁在 锁 | 通信缓冲区 资源上,并且已被选作死锁牺牲品。请重新运行该事务。
在 System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
在 System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
在 System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
在 System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
在 System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async, Int32 timeout)
在 System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean asyncWrite)
在 System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
在 Microsoft.ApplicationBlocks.Data.SqlHelper.ExecuteNonQuery(SqlConnection connection, CommandType commandType, String commandText, SqlParameter[] commandParameters)
在 Microsoft.ApplicationBlocks.Data.SqlHelper.ExecuteNonQuery(String connectionString, CommandType commandType, String commandText, SqlParameter[] commandParameters)
Haha, is it deja vu? Well, after we have a certain understanding of deadlock, let’s assume that if it exists
- Thread 25 wants to acquire the
000000f7b853de48
lock object. - Thread 63 wants to acquire the
000000f7b853d480
lock object.
In the case of, it will inevitably deadlock, right, how to use windbg to verify next? Switch to thread 25 to view the thread stack and stack objects.
0:000> ~25s
ntdll!NtWaitForMultipleObjects+0xa:
00007ffb`9f230c7a c3 ret
0:025> !clrstack
OS Thread Id: 0x8848 (25)
Child SP IP Call Site
000000F782904838 00007ffb9f230c7a [HelperMethodFrame_1OBJ: 000000f782904838] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000F782904990 00007ffb1d5e4d5c MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Acquire()
0:025> !dso
OS Thread Id: 0x8848 (25)
RSP/REG Object Name
000000F782904648 000000f7b853de48 System.Object
000000F7829046D8 000000f7b84cb508 MongoDB.Driver.ReadPreference
You can clearly see that ReliableEnter
is getting stuck while acquiring the 000000f7b853de48
lock object, and then switch to thread 63 to view.
0:025> ~63s
ntdll!NtWaitForMultipleObjects+0xa:
00007ffb`9f230c7a c3 ret
0:063> !clrstack
OS Thread Id: 0x75e0 (63)
Child SP IP Call Site
000000F787774EE8 00007ffb9f230c7a [HelperMethodFrame_1OBJ: 000000f787774ee8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000F787775040 00007ffb1d5e4d5c MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Acquire()
0:063> !dso
OS Thread Id: 0x75e0 (63)
RSP/REG Object Name
000000F787774A38 000000f7b82dc750 MongoDB.Bson.BsonBoolean
000000F787774BA0 000000f7b83a9a10 System.RuntimeType
000000F787774CF8 000000f7b853d480 System.Object
You can clearly see that ReliableEnter
is acquiring 000000f7b853d480
, which shows that a deadlock has indeed occurred and there is nothing wrong with it.
2. Analysis of the cause of deadlock
To investigate the cause of the deadlock, you can only carefully consider the thread stack +
thread stack object.
0:063> !clrstack
OS Thread Id: 0x75e0 (63)
Child SP IP Call Site
000000F787774EE8 00007ffb9f230c7a [HelperMethodFrame_1OBJ: 000000f787774ee8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000F787775040 00007ffb1d5e4d5c MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Acquire()
...
000000F78777BD10 00007ffb1e2fc69f System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/CancellationTokenSource.cs @ 724]
000000F78777BDA0 00007ffb1e2fc69f System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/CancellationTokenSource.cs @ 724]
000000F78777BE30 00007ffb1e2fc69f System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/CancellationTokenSource.cs @ 724]
000000F78777BEC0 00007ffb1e325d55 MongoDB.Driver.Core.Misc.SemaphoreSlimSignalable.Signal()
000000F78777BF10 00007ffb1e35b98e MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Return(PooledConnection)
000000F78777BF70 00007ffb1e35b728 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.ReleaseConnection(PooledConnection)
000000F78777C0A0 00007ffb1e35b4f9 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+AcquiredConnection.Dispose()
000000F78777C0D0 00007ffb1e384164 MongoDB.Driver.Core.Operations.RetryableReadContext.Dispose()
000000F78777C100 00007ffb1e37ee31 MongoDB.Driver.Core.Operations.FindOperation`1+d__120[[System.__Canon, System.Private.CoreLib]].MoveNext()
...
000000F78777E980 00007ffb1dd807bf System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation(System.Action`1<System.Object>, System.Object, Boolean, Boolean) [/_/src/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs @ 1019]
000000F78777E9F0 00007ffb1dd80519 System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs) [/_/src/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs @ 858]
000000F78777EA50 00007ffb1dd7f751 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [/_/src/System.Private.CoreLib/src/System/Threading/Overlapped.cs @ 59]
0:063> !dso
OS Thread Id: 0x75e0 (63)
RSP/REG Object Name
000000F787774A38 000000f7b82dc750 MongoDB.Bson.BsonBoolean
000000F787774BA0 000000f7b83a9a10 System.RuntimeType
000000F787774CF8 000000f7b853d480 System.Object
...
000000F78777B658 000000f7be3407c8 System.String OperationCanceled
000000F78777B670 000000f7b82c07e0 System.Resources.RuntimeResourceSet
000000F78777B698 000000f7b82bff38 System.Globalization.CultureInfo
000000F78777B730 000000f7a36525c8 System.Char[]
000000F78777B7E0 000000f7be3407c8 System.String OperationCanceled
...
000000F78777BEF8 000000f7b853de10 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder
000000F78777BF10 000000f79f79e398 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Driver.IAsyncCursor`1[[Czkj.AlibabaTools.Models.mongo.CopyRecords, Czkj.AlibabaTools.Models]], MongoDB.Driver.Core],[MongoDB.Driver.Core.Operations.FindOperation`1+<ExecuteAsync>d__120[[Czkj.AlibabaTools.Models.mongo.CopyRecords, Czkj.AlibabaTools.Models]], MongoDB.Driver.Core]]
000000F78777BF40 000000f7b853de48 System.Object
...
Because this code is everywhere in await,async
, so the thread stack after decompilation is really big, after careful comparison, it is found that the code flow is probably:
- Start with processing Mongodb's asynchronous request callback (System.Threading.OverlappedData).
- At
MongoDB.Driver.Core.Operations.FindOperation
, I don't know why I throw the cancellation exceptionOperationCanceled
, and then callRetryableReadContext.Dispose()
.
- Get the
000000f7b853de48
lock object in theListConnectionHolder.Return()
- Perform some registration handler logic in the
SignalOrReset() -> SemaphoreSlimSignalable.Signal()
lock
area is not exited during the event trigger.
ListConnectionHolder.Acquire()
method that entered another thread pool during the spasm of several handlers, I hope to get the000000f7b853d480
lock object in the pool.
At the same time, the 25th thread is doing this operation in reverse. Because everyone is double- , it eventually leads to the 1619b00b8aeda9 deadlock .
Three: There are a few questions that need to be answered
1. Why are there two thread pools?
From the thread stack object, we should also see that there are two thread pools
ExclusiveConnectionPool
.
0:000> !dumpheap -type ExclusiveConnectionPool -stat
Statistics:
MT Count TotalSize Class Name
00007ffb1d25eca0 2 464 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool
This is because my friend's mongodb connection string uses a dual IP replica set mode.
{
"mongo": {
"Connection": "mongodb://xxx.aliyuncs.com:3717,xxx.aliyuncs.com:3717/admin?replicaSet=mgset-500180544&readPreference=secondaryPreferred&maxPoolSize=1000&wtimeoutMS=10000"
}
}
2. Is it the programmer's pot?
From the stack information, it is not the programmer's pot. It is that when mongodb receives asynchronous callbacks, due to a certain situation, a OperationCanceled
exception occurs, and a deadlock bug occurs when facing the subsequent processing logic of the exception.
3. What is the current mognodb sdk version?
The official mongodb drive is 2.13.1.0
, which is 2021-8-15
release, as of the latest is released in October 2.13.2.0
.
4. Feedback
After understanding this information, I communicated with my friend. The friend said that he submitted an issue to the mongodb community. A few days later, the official answer was to v2.14.beta1
with it in the latest 0619b00b8aef49.
- https://github.com/mongodb/mongo-csharp-driver/commit/b961b81cb7dc1ffe7262c55a227afad0aab5a994
- https://jira.mongodb.org/browse/CSHARP-3815
That is to say release v2.14.0
version, and we can only wait for now! Expect. . . Haha😁😁😁
Four: Summary
In general, this is a deadlock problem caused by a bug in the bottom layer of mongodb. The dump analysis process has also experienced twists and turns. Although it is the official and authoritative MongoDB.Driver
, it is also doubtful. Don't blindly doubt yourself deeply... Finally, look forward to the upcoming release v2.14.0
.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。