一:背景
1. 講故事
前段時間收到一個朋友的求助,說他的程序線程數瘋漲,尋求如何解決。

等我分析完之后,我覺得這個問題很有代表性,所以拿出來和大家分享下,還是上老工具 WinDbg。
二:WinDbg 分析
1. 線程真的在瘋漲嗎
要想查線程有沒有瘋漲,可以用 !t
命令看一下。
0:000:x86>?!t
ThreadCount:??????382
UnstartedThread:??1
BackgroundThread:?376
PendingThread:????0
DeadThread:???????2
Hosted?Runtime:???noLock??ID?OSID?ThreadOBJ????State?GC?Mode?????GC?Alloc?Context??Domain???Count?Apt?Exception0????1??59c?00e52fb0?????26020?Preemptive??12D67610:00000000?00e4b408?0?????STA?2????2?2b30?00e61aa0?????2b220?Preemptive??00000000:00000000?00e4b408?0?????MTA?(Finalizer)?3????3?18cc?00ea72b8???202b220?Preemptive??00000000:00000000?00e4b408?0?????MTA?5????4?1f18?00f02998???1020220?Preemptive??00000000:00000000?00e4b408?0?????Ukn?(Threadpool?Worker)?
XXXX????6????0?00f056f8?????39820?Preemptive??00000000:00000000?00e4b408?0?????MTA?6????7?2154?09052448???202b020?Preemptive??12E353E0:00000000?00e4b408?0?????MTA?...377??373?2ee8?21a90958???1029220?Preemptive??12D1FCCC:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?378??374?227c?21b1d510???1029220?Preemptive??12DCBFC8:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?379??375??7e8?21b1baa8???1029220?Preemptive??12D39ADC:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?380??376?1d1c?21a8fec8???1029220?Preemptive??12D11F40:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?381??366?19ec?215c1bd0???1029220?Preemptive??12DB42D8:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?382??377?1dc8?21b1bff0???1029220?Preemptive??12C71F9C:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?383??378??f94?215bc750???1029220?Preemptive??12E10568:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?384??379?17d4?21ac5580???1029220?Preemptive??12D8EE98:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?385??381?2c1c?21b1b018???1029220?Preemptive??12D0DD00:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?386??380?309c?21b1da58???1029220?Preemptive??12E25028:00000000?00e4b408?0?????MTA?(Threadpool?Worker)?387??382?3048?21ac6aa0???1029220?Preemptive??12DFA918:00000000?00e4b408?0?????MTA?(Threadpool?Worker)
從卦中看,主線程是一個 STA,說明是一個窗體程序,一個窗體能做到 387 個線程,也是挺牛的,同時也能觀察到大多都是 ThreadPool Worker
,也就是線程池工作線程。
2. 這些線程都在干嘛
這里有一個小技巧,那就是線程號越大,往往都是最新創建的,往往從這里面就能套出來一些有用的東西,言外之意就扒一扒 380 ~ 387
這些線程的調用棧。
0:387:x86>?~387s
ntdll_77380000!NtWaitForSingleObject+0xc:
773f29dc?c20c00??????????ret?????0Ch
0:387:x86>?k
CvRegToMachine(x86)?conversion?failure?for?0x14f
X86MachineInfo::SetVal:?unknown?register?0?requested#?ChildEBP?RetAddr??????
00?31fef104?755a1539?????ntdll_77380000!NtWaitForSingleObject+0xc
01?31fef104?74b3ee3b?????KERNELBASE!WaitForSingleObjectEx+0x99
02?31fef168?74b3efed?????clr!CLRSemaphore::Wait+0xbe
03?31fef19c?74b3eee2?????clr!ThreadpoolMgr::UnfairSemaphore::Wait+0x13a
04?31fef204?74a54c27?????clr!ThreadpoolMgr::WorkerThreadStart+0x328
05?31feff24?7649fa29?????clr!Thread::intermediateThreadProc+0x58
06?31feff34?773e7a7e?????kernel32!BaseThreadInitThunk+0x19
07?31feff90?773e7a4e?????ntdll_77380000!__RtlUserThreadStart+0x2f
0:387:x86>?~386s
ntdll_77380000!NtWaitForSingleObject+0xc:
773f29dc?c20c00??????????ret?????0Ch
0:386:x86>?k
CvRegToMachine(x86)?conversion?failure?for?0x14f
X86MachineInfo::SetVal:?unknown?register?0?requested#?ChildEBP?RetAddr??????
00?31d6ede4?755a1539?????ntdll_77380000!NtWaitForSingleObject+0xc
01?31d6ede4?74b3ee3b?????KERNELBASE!WaitForSingleObjectEx+0x99
02?31d6ee48?74b3efed?????clr!CLRSemaphore::Wait+0xbe
03?31d6ee7c?74b3eee2?????clr!ThreadpoolMgr::UnfairSemaphore::Wait+0x13a
04?31d6eee4?74a54c27?????clr!ThreadpoolMgr::WorkerThreadStart+0x328
05?31d6fb84?7649fa29?????clr!Thread::intermediateThreadProc+0x58
06?31d6fb94?773e7a7e?????kernel32!BaseThreadInitThunk+0x19
07?31d6fbf0?773e7a4e?????ntdll_77380000!__RtlUserThreadStart+0x2f
0:386:x86>?~385s
ntdll_77380000!NtWaitForSingleObject+0xc:
773f29dc?c20c00??????????ret?????0Ch
0:385:x86>?k
CvRegToMachine(x86)?conversion?failure?for?0x14f
X86MachineInfo::SetVal:?unknown?register?0?requested#?ChildEBP?RetAddr??????
00?31eaee64?755a1539?????ntdll_77380000!NtWaitForSingleObject+0xc
01?31eaee64?74b3ee3b?????KERNELBASE!WaitForSingleObjectEx+0x99
02?31eaeec8?74b3efed?????clr!CLRSemaphore::Wait+0xbe
03?31eaeefc?74b3eee2?????clr!ThreadpoolMgr::UnfairSemaphore::Wait+0x13a
04?31eaef64?74a54c27?????clr!ThreadpoolMgr::WorkerThreadStart+0x328
05?31eafb7c?7649fa29?????clr!Thread::intermediateThreadProc+0x58
06?31eafb8c?773e7a7e?????kernel32!BaseThreadInitThunk+0x19
07?31eafbe8?773e7a4e?????ntdll_77380000!__RtlUserThreadStart+0x2f
從線程棧上看,這些線程都在 UnfairSemaphore
處等待,這是一個正常現象,因為這些線程都是通過 UnfairSemaphore
鎖來喚醒,不過奇怪的是,這些線程為什么產生,又為什么不被消亡?
根據經驗預測:肯定有代碼在不斷的調度
線程池
線程,然后又做了一個短命的操作,導致線程池線程不斷新增,又得不到線程可以被消亡的閾值。
3. 程序真的在頻繁調度線程嗎
既然猜測是程序在頻繁的調用線程池線程,能做的只能是觀察此時 dump 中的所有線程的線程棧,看能不能挖到點有價值的東西,可以使用 ~*e !clrstack
命令。

經過仔細觀察這近400個線程棧,發現有 37 處都是 System.Threading.Thread.Sleep(Int32)
, 而且大多都是 HslCommunication.Core.Net.NetworkBase.ThreadPoolCheckTimeOut(System.Object)
函數,能清楚的看到是由線程池發起的,接下來就是用 ILSpy 反編譯下這個函數看下到底是怎么回事。
protected?void?ThreadPoolCheckTimeOut(object?obj){HslTimeOut?hslTimeOut;if?((hslTimeOut?=?obj?as?HslTimeOut)?==?null){return;}while?(!hslTimeOut.IsSuccessful){Thread.Sleep(100);if?((DateTime.Now?-?hslTimeOut.StartTime).TotalMilliseconds?>?(double)hslTimeOut.DelayTime){if?(!hslTimeOut.IsSuccessful){LogNet?.WriteWarn(ToString(),?"Wait?Time?Out?:?"?+?hslTimeOut.DelayTime);hslTimeOut.Operator?.Invoke();hslTimeOut.WorkSocket?.Close();}break;}}}
接下來通過 ILSpy 查看這個方法的引用,發現有很多處,抽幾個如下:
protected?OperateResult<TNetMessage>?ReceiveMessage<TNetMessage>(Socket?socket,?int?timeOut,?TNetMessage?netMsg)?where?TNetMessage?:?INetMessage{...if?(timeOut?>?0){ThreadPool.QueueUserWorkItem(ThreadPoolCheckTimeOut,?hslTimeOut);}...}protected?OperateResult<Socket>?CreateSocketAndConnect(IPEndPoint?endPoint,?int?timeOut){...ThreadPool.QueueUserWorkItem(ThreadPoolCheckTimeOut,?hslTimeOut);...}protected?void?CreateSocketAndConnect(IPEndPoint?endPoint,?int?timeOut,?Action<OperateResult<Socket>>?connectCallback){...ThreadPool.QueueUserWorkItem(ThreadPoolCheckTimeOut,?hslTimeOut);...}
從上面代碼看,確實存在一些商榷的地方,很多的 socket 操作都用線程池來處理 ThreadPoolCheckTimeOut()
函數,而在這個函數內當 hslTimeOut.IsSuccessful =false
的時候,在 if ((DateTime.Now - hslTimeOut.StartTime).TotalMilliseconds > (double)hslTimeOut.DelayTime)
不滿足的時間區間內會一直 sleep,這就導致當 socket 請求量上去后,導致很多線程處于 sleep 狀態, 線程池又不得不生成更多的線程來處理 ThreadPoolCheckTimeOut()
邏輯。
到這里終于就找到了符合 線程池線程 瘋漲的底層邏輯,接下來看看 HslCommunication.dll
為何物,去找一下它的類庫聲明。
[assembly:?CompilationRelaxations(8)]
[assembly:?RuntimeCompatibility(WrapNonExceptionThrows?=?true)]
[assembly:?Debuggable(DebuggableAttribute.DebuggingModes.IgnoreSymbolStoreSequencePoints)]
[assembly:?AssemblyTitle("HslCommunication")]
[assembly:?AssemblyDescription("一個框架庫,包含完善的網絡通信及日志組件")]
[assembly:?AssemblyConfiguration("")]
[assembly:?AssemblyCompany("")]
[assembly:?AssemblyProduct("HslCommunication")]
[assembly:?AssemblyCopyright("Copyright???By?Richard.Hu?2018")]
[assembly:?AssemblyTrademark("")]
[assembly:?ComVisible(false)]
[assembly:?Guid("d3710b78-1b32-4d53-9604-0451a795a2f5")]
[assembly:?AssemblyFileVersion("5.3.2.0")]
[assembly:?AssemblyVersion("5.3.2.0")]
可以看到,這是一個商業組件。

三:總結
由于定位到疑似是 HslCommunication
組件的問題,看了下還是 商業版
, 這就尷尬了,建議的解決辦法如下:
1) 短期:
用 ThreadPool.SetMaxThreads
限定線程上限。
2) 長期:
找作者看看有沒有最新版,或者到 https://github.com/dathlin/HslCommunication
上提一個 issue,讓別人系統性解決一下。