一:背景
1. 講故事
前段時間遇到了好幾起關于窗體程序的 進程加載鎖
引發的 程序卡死
和 線程暴漲
問題,這種 dump 分析難度較大,主要涉及到 Windows操作系統 和 C++ 的基礎知識,所以有必要簡單整理和大家分享一下,上 windbg 說話。
二:WinDbg 分析
1. 主線程此時在做什么
窗體程序的卡死,入口分析點在 主線程 上,使用 ~0s; k
命令即可。
0:000>?~0s;?k
ntdll!NtWaitForSingleObject+0x14:
00007ffc`6010e614?c3??????????????ret#?Child-SP??????????RetAddr???????????????Call?Site
00?0000008c`107fe5d8?00007ffc`5cda4313?????ntdll!NtWaitForSingleObject+0x14
01?0000008c`107fe5e0?00007ffc`257b2fe8?????KERNELBASE!WaitForSingleObjectEx+0x93
02?0000008c`107fe680?00007ffc`257b2f9e?????clr!CLREventWaitHelper2+0x3c
03?0000008c`107fe6c0?00007ffc`257b2efc?????clr!CLREventWaitHelper+0x1f
04?0000008c`107fe720?00007ffc`256beed2?????clr!CLREventBase::WaitEx+0x71
05?0000008c`107fe7b0?00007ffc`25687e44?????clr!WKS::GCHeap::WaitUntilGCComplete+0x2e
06?0000008c`107fe7e0?00007ffc`25688092?????clr!Thread::RareDisablePreemptiveGC+0x18f
07?0000008c`107fe880?00007ffc`255d44f4?????clr!JIT_RareDisableHelperWorker+0x42
08?0000008c`107fe9d0?00007ffc`22544314?????clr!JIT_RareDisableHelper+0x14
09?0000008c`107fea10?00007ffc`22525f32?????WindowsBase_ni+0x184314
0a?0000008c`107fead0?00007ffc`22520298?????WindowsBase_ni+0x165f32
0b?0000008c`107feb10?00007ffc`2251edaf?????WindowsBase_ni+0x160298
0c?0000008c`107feba0?00007ffc`202b6421?????WindowsBase_ni+0x15edaf
...
從卦象中的 WaitUntilGCComplete
函數看,此時的主線程正在等待 GC完成
,那到底誰觸發了 GC 呢? 接下來用 !t
命令查看下 GC 標記。
0:000>?!t
ThreadCount:??????58
UnstartedThread:??9
BackgroundThread:?39
PendingThread:????9
DeadThread:???????5
Hosted?Runtime:???no42???41??cd8?000001ec5f7f7c90??202b220?Preemptive??0000000000000000:0000000000000000?000001ec3353c710?0?????MTA?43???34?1160?000001ec5f7f4db0????21220?Preemptive??0000000000000000:0000000000000000?000001ec3353c710?0?????Ukn?44???33?218c?000001ec5f7f5580????2b220?Cooperative?0000000000000000:0000000000000000?000001ec3353c710?1?????MTA?(GC)?45???36?1110?000001ec5f7f8460??202b220?Preemptive??0000000000000000:0000000000000000?000001ec3353c710?0?????MTA?48???32?26a8?000001ec545813e0????2b220?Preemptive??0000000000000000:0000000000000000?000001ec3353c710?0?????MTA?49???31??4b4?000001ec54581bb0????2b220?Preemptive??0000000000000000:0000000000000000?000001ec3353c710?0?????MTA
從卦中看,當前的 44
號線程觸發了 GC,接下來看下它的線程棧情況。
0:000>?~~[218c]s
ntdll!NtWaitForSingleObject+0x14:
00007ffc`6010e614?c3??????????????ret
0:044>?k#?Child-SP??????????RetAddr???????????????Call?Site
00?0000008c`0a0bd9b8?00007ffc`5cda4313?????ntdll!NtWaitForSingleObject+0x14
01?0000008c`0a0bd9c0?00007ffc`257b2fe8?????KERNELBASE!WaitForSingleObjectEx+0x93
02?0000008c`0a0bda60?00007ffc`257b2f9e?????clr!CLREventWaitHelper2+0x3c
03?0000008c`0a0bdaa0?00007ffc`257b2efc?????clr!CLREventWaitHelper+0x1f
04?0000008c`0a0bdb00?00007ffc`256c821d?????clr!CLREventBase::WaitEx+0x71
05?0000008c`0a0bdb90?00007ffc`256c8120?????clr!standalone::`anonymous?namespace'::CreateSuspendableThread+0x10c
06?0000008c`0a0bdc50?00007ffc`257b9e4c?????clr!GCToEEInterface::CreateThread+0x170
07?0000008c`0a0bde40?00007ffc`257b8543?????clr!WKS::gc_heap::prepare_bgc_thread+0x4c
08?0000008c`0a0bde70?00007ffc`256be9f7?????clr!WKS::gc_heap::garbage_collect+0xfbb37
09?0000008c`0a0bdeb0?00007ffc`256c0c47?????clr!WKS::GCHeap::GarbageCollectGeneration+0xef
0a?0000008c`0a0bdf00?00007ffc`255dc7b3?????clr!WKS::GCHeap::Alloc+0x29c
0b?0000008c`0a0bdf50?00007ffb`c631853d?????clr!JIT_New+0x339
從線程棧看,GC 在觸發的過程中準備使用 CreateThread
函數創建線程,可能有些朋友不太理解,GC觸發還有創建線程的操作???哈哈,這就涉及到一點 CLR 的基礎知識,workstation 的 bgc 模式會有一個專門的 后臺線程
, 而這個后臺線程是在運行時的某個時刻創建和銷毀的,所以從線程棧看,GC 正在等待 bgc 線程初始化完畢。
很奇怪的是,我從多個卡死狀態下的 dump 看,發現 GC 都卡在這個 CreateThread
函數上,言外之意線程在這里無限期等待了。
2. CreateThread 為什么不能初始化完成?
如果大家玩過 C++ 的話,應該知道 C++ 的 dll 會有一個 DllMain
方法,它的意義和 Main
方法一致,代碼骨架圖如下:
//?dllmain.cpp?:?Defines?the?entry?point?for?the?DLL?application.
#include?"pch.h"BOOL?APIENTRY?DllMain(?HMODULE?hModule,DWORD??ul_reason_for_call,LPVOID?lpReserved)
{switch?(ul_reason_for_call){case?DLL_PROCESS_ATTACH:case?DLL_THREAD_ATTACH:case?DLL_THREAD_DETACH:case?DLL_PROCESS_DETACH:break;}return?TRUE;
}
從 switch 中的枚舉參數來看,就是 dll 加載和卸載,線程創建和銷毀,有此 DllMain 方法的 dll 都會收到通知,在進入到這個 DllMain 之前會首先獲取到一個全局的 進程加載鎖(LdrpLoaderLock)
。
既然 GC 過程中不能創建 CreateThread
,那必然有人在持有這個 LdrpLoaderLock
鎖,接下來的問題就是如何找到 哪個線程正在持有 LdrpLoaderLock
?這就涉及到 windows 操作系統的 基礎知識了。
3. 誰在持有 LdrpLoaderLock 鎖?
LdrpLoaderLock 變量是在 ntdll.dll 用戶態網關dll中,可以用 x ntdll!LdrpLoaderLock
命令檢索,然后看下是作為哪個臨界區持有的。
0:044>??x?ntdll!LdrpLoaderLock
00007ffc`601cf4f8?ntdll!LdrpLoaderLock?=?<no?type?information>0:044>?dt?_RTL_CRITICAL_SECTION??00007ffc`601cf4f8
atl100!_RTL_CRITICAL_SECTION+0x000?DebugInfo????????:?0x00007ffc`601cf978?_RTL_CRITICAL_SECTION_DEBUG+0x008?LockCount????????:?0n-2+0x00c?RecursionCount???:?0n1+0x010?OwningThread?????:?0x00000000`0000138c?Void+0x018?LockSemaphore????:?(null)?+0x020?SpinCount????????:?0x4000000
從卦中看,當前 138c
號線程持有了這個臨界區,接下來切到這個線程看下它的線程棧即可。
0:044>?~~[138c]s
win32u!NtUserMessageCall+0x14:
00007ffc`5c891184?c3??????????????ret
0:061>?k#?Child-SP??????????RetAddr???????????????Call?Site
00?0000008c`00ffec68?00007ffc`5f21bfbe?????win32u!NtUserMessageCall+0x14
01?0000008c`00ffec70?00007ffc`5f21be38?????user32!SendMessageWorker+0x11e
02?0000008c`00ffed10?00007ffc`124fd4af?????user32!SendMessageW+0xf8
03?0000008c`00ffed70?00007ffc`125e943b?????cogxImagingDevice!DllUnregisterServer+0x3029f
04?0000008c`00ffeda0?00007ffc`125e9685?????cogxImagingDevice!DllUnregisterServer+0x11c22b
05?0000008c`00ffede0?00007ffc`600b50e7?????cogxImagingDevice!DllUnregisterServer+0x11c475
06?0000008c`00ffee20?00007ffc`60093ccd?????ntdll!LdrpCallInitRoutine+0x6f
07?0000008c`00ffee90?00007ffc`60092eef?????ntdll!LdrpProcessDetachNode+0xf5
08?0000008c`00ffef60?00007ffc`600ae319?????ntdll!LdrpUnloadNode+0x3f
09?0000008c`00ffefb0?00007ffc`600ae293?????ntdll!LdrpDecrementModuleLoadCountEx+0x71
0a?0000008c`00ffefe0?00007ffc`5cd7c00e?????ntdll!LdrUnloadDll+0x93
0b?0000008c`00fff010?00007ffc`5d47cf78?????KERNELBASE!FreeLibrary+0x1e
0c?0000008c`00fff040?00007ffc`5d447aa3?????combase!CClassCache::CDllPathEntry::CFinishObject::Finish+0x28?[onecore\com\combase\objact\dllcache.cxx?@?3420]?
0d?0000008c`00fff070?00007ffc`5d4471a9?????combase!CClassCache::CFinishComposite::Finish+0x4b?[onecore\com\combase\objact\dllcache.cxx?@?3530]?
0e?0000008c`00fff0a0?00007ffc`5d3f1499?????combase!CClassCache::FreeUnused+0xdd?[onecore\com\combase\objact\dllcache.cxx?@?6547]?
0f?0000008c`00fff650?00007ffc`5d3f13c7?????combase!CoFreeUnusedLibrariesEx+0x89?[onecore\com\combase\objact\dllapi.cxx?@?117]?
10?(Inline?Function)?--------`--------?????combase!CoFreeUnusedLibraries+0xa?[onecore\com\combase\objact\dllapi.cxx?@?74]?
11?0000008c`00fff690?00007ffc`6008a019?????combase!CDllHost::MTADllUnloadCallback+0x17?[onecore\com\combase\objact\dllhost.cxx?@?929]?
12?0000008c`00fff6c0?00007ffc`6008bec4?????ntdll!TppTimerpExecuteCallback+0xa9
13?0000008c`00fff710?00007ffc`5f167e94?????ntdll!TppWorkerThread+0x644
14?0000008c`00fffa00?00007ffc`600d7ad1?????kernel32!BaseThreadInitThunk+0x14
15?0000008c`00fffa30?00000000`00000000?????ntdll!RtlUserThreadStart+0x21
可以看到,cogxImagingDevice
發起了一個 user32!SendMessageW
同步方法,導致程序徹底死鎖,可能有些朋友有點懵,我簡單羅列下。
44 號線程使用 CreateThread 創建線程,但必須要先獲取 加載鎖,所以一直在等待 61 號線程釋放加載鎖。
61 號線程用同步的方式
user32!SendMessageW
給 主線程的 WndProc 網關函數打同步消息,等待 主線程給予響應,而此時主線程正在等待 GC 完成。主線程 在無限期的 等待 GC 結束。
綜合來看,只要主線程不響應 44 號線程, 44號線程就不會釋放 加載鎖
,這個 加載鎖
不釋放,就會導致很多的線程都無法初始化完畢,這個在它的 dump 中也反應出來了,代碼如下:
70??Id:?300.4f0?Suspend:?0?Teb:?0000008c`102e1000?Unfrozen#?Child-SP??????????RetAddr???????????????Call?Site
00?0000008c`0ecff388?00007ffc`6008902d?????ntdll!NtWaitForSingleObject+0x14
01?0000008c`0ecff390?00007ffc`600b29a7?????ntdll!LdrpDrainWorkQueue+0x15d
02?0000008c`0ecff3d0?00007ffc`600e76d5?????ntdll!LdrpInitializeThread+0x8b
03?0000008c`0ecff4b0?00007ffc`600e7633?????ntdll!_LdrpInitialize+0x89
04?0000008c`0ecff550?00007ffc`600e75de?????ntdll!LdrpInitialize+0x3b
05?0000008c`0ecff580?00000000`00000000?????ntdll!LdrInitializeThunk+0xe71??Id:?300.1c88?Suspend:?0?Teb:?0000008c`102e5000?Unfrozen#?Child-SP??????????RetAddr???????????????Call?Site
00?0000008c`0f4ff268?00007ffc`6008902d?????ntdll!NtWaitForSingleObject+0x14
01?0000008c`0f4ff270?00007ffc`600b29a7?????ntdll!LdrpDrainWorkQueue+0x15d
02?0000008c`0f4ff2b0?00007ffc`600e76d5?????ntdll!LdrpInitializeThread+0x8b
03?0000008c`0f4ff390?00007ffc`600e7633?????ntdll!_LdrpInitialize+0x89
04?0000008c`0f4ff430?00007ffc`600e75de?????ntdll!LdrpInitialize+0x3b
05?0000008c`0f4ff460?00000000`00000000?????ntdll!LdrInitializeThunk+0xe72??Id:?300.15c0?Suspend:?0?Teb:?0000008c`102e7000?Unfrozen#?Child-SP??????????RetAddr???????????????Call?Site
00?0000008c`0f8ff278?00007ffc`6008902d?????ntdll!NtWaitForSingleObject+0x14
01?0000008c`0f8ff280?00007ffc`600b29a7?????ntdll!LdrpDrainWorkQueue+0x15d
02?0000008c`0f8ff2c0?00007ffc`600e76d5?????ntdll!LdrpInitializeThread+0x8b
03?0000008c`0f8ff3a0?00007ffc`600e7633?????ntdll!_LdrpInitialize+0x89
04?0000008c`0f8ff440?00007ffc`600e75de?????ntdll!LdrpInitialize+0x3b
05?0000008c`0f8ff470?00000000`00000000?????ntdll!LdrInitializeThunk+0xe73??Id:?300.764?Suspend:?0?Teb:?0000008c`102ef000?Unfrozen#?Child-SP??????????RetAddr???????????????Call?Site
00?0000008c`0fcff388?00007ffc`6008902d?????ntdll!NtWaitForSingleObject+0x14
01?0000008c`0fcff390?00007ffc`600b29a7?????ntdll!LdrpDrainWorkQueue+0x15d
02?0000008c`0fcff3d0?00007ffc`600e76d5?????ntdll!LdrpInitializeThread+0x8b
03?0000008c`0fcff4b0?00007ffc`600e7633?????ntdll!_LdrpInitialize+0x89
04?0000008c`0fcff550?00007ffc`600e75de?????ntdll!LdrpInitialize+0x3b
05?0000008c`0fcff580?00000000`00000000?????ntdll!LdrInitializeThunk+0xe
可以看到,有很多的線程都卡死在 ntdll!LdrpInitializeThread+0x8b
處無法進行下去,那這個方法到底在做什么呢?可以看下 反匯編代碼
。
0:000>?u?ntdll!LdrpInitializeThread+0x8b
ntdll!LdrpInitializeThread+0x8b:
00007ffc`600b29a7?e874a50000??????call????ntdll!LdrpAcquireLoaderLock?(00007ffc`600bcf20)
00007ffc`600b29ac?90??????????????nop
00007ffc`600b29ad?488b1d1c2a1200??mov?????rbx,qword?ptr?[ntdll!PebLdr+0x10?(00007ffc`601d53d0)]
00007ffc`600b29b4?488d05152a1200??lea?????rax,[ntdll!PebLdr+0x10?(00007ffc`601d53d0)]
00007ffc`600b29bb?483bd8??????????cmp?????rbx,rax
00007ffc`600b29be?0f84c5000000????je??????ntdll!LdrpInitializeThread+0x16d?(00007ffc`600b2a89)
....
從匯編中可以清晰的看到,都卡在獲取加載鎖 ntdll!LdrpAcquireLoaderLock
函數上。
三:總結
這是一個由 cogxImagingDevice.dll
引發的程序死鎖,查了下百度是一個商業版的 視覺圖像
庫,對此我也無法解決,只能建議朋友。
熟悉下這個 dll 的配置,如果不是配置造成建議提交官方解決。
爭取做到 C# 和 C++ 的進程級隔離,或者干脆替換掉
cogxImagingDevice.dll
,雖然這個難度有點大。
這個 dump 給我們的教訓是:當 C# 和 C++ 混在一起,爭取做到最大可能的隔離,一旦出現問題非常考驗你對 windows 底層知識的理解,分析排錯門檻很高😂😂😂。