One: background

1. Tell a story

An old friend on the planet approached me in October and said that their company’s program became unresponsive under an online celebrity live broadcast. During the unresponsive period, there were a lot of RabbitMQ timeouts. Seeking how to find the root cause, I just took a screenshot of the chat. Stop posting.

Since there is no response, it must be that a large number of threads of the program are actively or passively suspended. My friend also drew a tube of blood from the program in time, and then went to windbg to explore what happened.

Two: Windbg analysis

1. What's wrong with the threads

To see all threads, the old command !t .


0:000> !t
ThreadCount:      5221
UnstartedThread:  0
BackgroundThread: 5199
PendingThread:    0
DeadThread:       21
Hosted Runtime:   no
                                                                                                            Lock  
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
  20    1     74e4 00000276CB778180  202a020 Preemptive  0000000000000000:0000000000000000 00000276cb77c9d0 -00001 MTA 
  31    2     42cc 00000276CB6CA830    2b220 Preemptive  0000000000000000:0000000000000000 00000276cb77c9d0 -00001 MTA (Finalizer) 
  32    3     2b40 00000276CB85D1B0  102a220 Preemptive  0000000000000000:0000000000000000 00000276cb77c9d0 -00001 MTA (Threadpool Worker) 
   2    6     bccc 00000276CBA5D2F0    20220 Preemptive  0000000000000000:0000000000000000 00000276cb77c9d0 -00001 Ukn 
  33    9     7224 00000276CBA5C0C0  3029220 Preemptive  0000000000000000:0000000000000000 00000276cb77c9d0 -00001 MTA (Threadpool Worker) System.IO.IOException 00000279ccc56cd0
   9   23     29e0 0000027BD86FD180    20220 Preemptive  0000000000000000:0000000000000000 00000276cb77c9d0 -00001 Ukn 
...

Judging from the brief information, there are currently 5000+ , the threads of all processes of a machine do not add up to that many. . . But the thread I see most is 1w + thread 😂😂😂, which is a kind of thread garbage caused by using Thread directly without thread pool.

You can see that thread number in the thread list throws a managed exception. Next, let's see what the error is, use the !wpe 00000279ccc56cd0 command.


0:000> !wpe 00000279ccc56cd0
Address: 00000279ccc56cd0
Exception Type: System.IO.IOException
Message: Unable to read data from the transport connection: 远程主机强迫关闭了一个现有的连接。.
Inner Exception: 00000279ccc56b20 System.Net.Sockets.SocketException 远程主机强迫关闭了一个现有的连接。
Stack:
SP               IP               Function
000000791b88c970 00007ffd844a1b31 System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)
000000791b88ee80 00007ffd849e6f8a System.IO.BufferedStream.ReadByteSlow()
000000791b88eeb0 00007ffd8312950a RabbitMQ.Client.Impl.InboundFrame.ReadFrom(System.IO.Stream, Byte[])
000000791b88ef40 00007ffd849e6d7d RabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()
000000791b88efa0 00007ffd8312832f RabbitMQ.Client.Framing.Impl.Connection.MainLoop()

HResult: 80131620

From the stack information, the program did a remote access to RabbitMQ, and the tcp connection was closed by the other party, which seemed to be related to a large number of RabbitMQ timeouts that my friends said.

The next step is to look at the various thread stacks and study what these threads are doing at this time. Using the ~*e !clrstack command, through careful study of the thread stack, I found that there are a lot of methods stuck in the xxx.RabbitMq.RabbitMqConnection.GetInstance method.


        Child SP               IP Call Site
0000008B8A9ED6A8 00007ffdf5246594 [HelperMethodFrame_1OBJ: 0000008b8a9ed6a8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
0000008B8A9ED800 00007ffd84a6a4a9 xxx.RabbitMq.RabbitMqConnection.GetInstance(Microsoft.Extensions.Options.IOptions`1<xxx.RabbitMq.RabbitMqConfig>, Microsoft.Extensions.Logging.ILogger`1<System.Object>)
0000008B8A9ED860 00007ffd84a6a317 xxx.RabbitMq.RabbitMqProducer..ctor(Microsoft.Extensions.Options.IOptionsSnapshot`1<xxx.RabbitMq.RabbitMqConfig>, Microsoft.Extensions.Logging.ILogger`1<xxx.RabbitMq.RabbitMqProducer>)
0000008B8A9ED8A0 00007ffd8334817b DynamicClass.ResolveService(ILEmitResolverBuilderRuntimeContext, Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope)
0000008B8A9ED930 00007ffd83347d76 DynamicClass.ResolveService(ILEmitResolverBuilderRuntimeContext, Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope)
0000008B8A9EDE90 00007ffd844f3cb3 Microsoft.Extensions.DependencyInjection.ActivatorUtilities.GetService(System.IServiceProvider, System.Type, System.Type, Boolean) [/_/src/libraries/Common/src/Extensions/ActivatorUtilities/ActivatorUtilities.cs @ 173]
DynamicClass.lambda_method196(System.Runtime.CompilerServices.Closure, System.IServiceProvider, System.Object[])
0000008B8A9EDF20 00007ffd84a0fc9c Microsoft.AspNetCore.Mvc.Controllers.ControllerFactoryProvider+c__DisplayClass5_0.g__CreateController|0(Microsoft.AspNetCore.Mvc.ControllerContext)
0000008B8A9EDF70 00007ffd8452ce7f Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State ByRef, Scope ByRef, System.Object ByRef, Boolean ByRef) [/_/src/Mvc/Mvc.Core/src/Infrastructure/ControllerActionInvoker.cs @ 285]
0000008B8A9EE030 00007ffd84a0fac8 Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeInnerFilterAsync() [/_/src/Mvc/Mvc.Core/src/Infrastructure/ControllerActionInvoker.cs @ 490]
0000008B8A9EE0B0 00007ffd845346cd Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State ByRef, Scope ByRef, System.Object ByRef, Boolean ByRef) [/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs @ 883]
0000008B8A9EE240 00007ffd84a0f9ad Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeNextExceptionFilterAsync() [/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs @ 1024]
0000008B8A9EE2C0 00007ffd84534272 Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State ByRef, Scope ByRef, System.Object ByRef, Boolean ByRef) [/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs @ 883]
0000008B8A9EE450 00007ffd84a0f850 Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeNextResourceFilter() [/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs @ 976]
...

From the call stack information, the source is an http request, and then it GetInstance , which arouses my curiosity. Next, export the source code according to the IP.


public sealed class RabbitMqConnection
{
    public static RabbitMqConnection GetInstance(IOptions<RabbitMqConfig> options, ILogger<dynamic> logger)
    {
        if (_uniqueInstance == null || _uniqueInstance.Connection == null || !_uniqueInstance.Connection.IsOpen)
        {
            lock (_objLock)
            {
                if (_uniqueInstance == null || _uniqueInstance.Connection == null || !_uniqueInstance.Connection.IsOpen)
                {
                    _uniqueInstance = new RabbitMqConnection(options.Value, logger);
                }
            }
        }
        return _uniqueInstance;
    }

    private RabbitMqConnection(RabbitMqConfig config, ILogger<dynamic> logger)
    {
        Policy.Handle<SocketException>().Or<BrokerUnreachableException>().WaitAndRetry(6, (int retryAttempt) => TimeSpan.FromSeconds(1.0), delegate (Exception ex, TimeSpan time, int retryCount, Context content)
        {
            if (6 == retryCount)
            {
                throw ex;
            }
            _logger.LogError(ex, $"{retryCount}:{ex.Message}");
        })
            .Execute(delegate
            {
                Connection = factory.CreateConnection();
            });
    }
}

From the code logic point of view, my friend used the double check lock to singleton the RabbitMQ instance. If the instance creation fails, there will be 6 1s attempts. At first glance, this way of writing is not a problem.

2. Is the singleton writing really okay?

If the singleton writing is fine, why are there a lot of threads stuck in the lock? Since it is a , it must be the first time that rabbitmq is instantiated. It is good for future generations to directly take the shade. With this question, check the double search method again. Nima, when judging the singleton, I did _uniqueInstance.Connection.IsOpen Judgment, do you know what this means?

This means that once rabbitmq hangs at a certain moment, the singleton condition is broken, and countless threads are queued for the instantiation of RabbimtMQ. You must know that these are all warriors who don’t give up on instantiation, and then lead to The program hangs...🤣🤣🤣

3. Verify my ideas

RabbitMqConnection , I have to show evidence. To verify it is simple, go to the managed heap to find the 061b018b767a7d instance, and see if its IsOpen is false at this time, through ILSpy Check the source code and find that it is judged CloseReason==null

Next, see if CloseReason not empty.


0:000> !dumpheap -type RabbitMqConnection
         Address               MT     Size
00000277cbd7aa68 00007ffd831f1570       32  
00000277ccb13068 00007ffd831f1570       32   

0:000> !DumpObj /d 00000277cbd7b858
Name:        RabbitMQ.Client.Framing.Impl.AutorecoveringConnection
MethodTable: 00007ffd83235db0
EEClass:     00007ffd83242898
Size:        200(0xc8) bytes
File:        G:\xxx\RabbitMQ.Client.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd82397238  4000180       bc       System.Boolean  1 instance                0 _disposed
00007ffd82390c68  4000181        8        System.Object  0 instance 00000277cbd7b920 _eventLock
00007ffd831fc230  4000182       10 ...g.Impl.Connection  0 instance 00000277cbd7d5f8 _delegate

0:000> !DumpObj /d 00000277cbd7d5f8
Name:        RabbitMQ.Client.Framing.Impl.Connection
MethodTable: 00007ffd831fc230
EEClass:     00007ffd8322cd70
Size:        232(0xe8) bytes
File:        G:\xxx\RabbitMQ.Client.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd82397238  4000198       b8       System.Boolean  1 instance                0 _disposed
00007ffd82390c68  4000199        8        System.Object  0 instance 00000277cbd7d6e0 _eventLock
00007ffd82d93d00  400019a       10 ...ualResetEventSlim  0 instance 00000277cbd7d6f8 _appContinuation
00007ffd83276028  400019b       18 ...ShutdownEventArgs  0 instance 00000279ccc56e28 _closeReason

0:000> !do 00000277ccb13068
Name:        xxx.RabbitMq.RabbitMqConnection
MethodTable: 00007ffd831f1570
EEClass:     00007ffd831eb920
Size:        32(0x20) bytes
File:        G:\xxx\xxx.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd831f13f8  400001f        8 ...Private.CoreLib]]  0 instance 00000278cbe4c2a0 _logger
00007ffd831f2ab0  4000020       10 ...lient.IConnection  0 instance 0000000000000000 <Connection>k__BackingField
00007ffd831f1570  400001d        8 ...abbitMqConnection  0   static 00000277cbd7aa68 _uniqueInstance
00007ffd82390c68  400001e       10        System.Object  0   static 00000277cbd7aa50 _objLock

From the output information, it can be clearly seen that there are two RabbitMqConnection objects in the current managed heap, one of which really failed ( _closeReason=00000279ccc56e28 ), and another new <Connection>k__BackingField=0000000000000000 that is working hard, which verifies the hypothesis.

4. Follow-up

With this information, I communicated with my friend and suggested to optimize the exception handling logic at 061b018b767b8b, such as return or throw new, or simply do not use lazy detection, and do not IsOpen=false

As for the series of incentives that caused RabbitMQ to not respond, friends found through reference to the open source project that AddSingleton changed AddScoped .

This also causes the tentative judgment of singleton logic for every Http request. The solution of others is that once the upper layer is singletonized, the lower layer will not process it again, and will not follow the IsOpen logic.

Three: Summary

There are two main points in the accident that caused this program to freeze:

  1. RabbitMQ generates the IsOpen judgment problem in the singletonization. It is recommended to use the automatic reconnection attributes AutomaticRecoveryEnabled and TopologyRecoveryEnabled , or use the static constructor instead of the double check lock.
  2. Excessive http request to singleton logic test, according to friend's change Scoped to Singleton to avoid.

一线码农
366 声望1.6k 粉丝