一:背景
1.講故事
前段時間有位朋友加微信,說他的程序直接 CPU=100%
,每次只能手工介入重啟,讓我幫忙看下到底怎么回事,哈哈,這種CPU打滿的事故,程序員壓力會非常大, 我讓朋友在 CPU 高的時候抓 2 個 dump 下來,然后發給我分析。
二:WinDbg 分析
1. CPU 真的被打滿了嗎?
為了防止南轅北轍,一定要用 !tp
命令去驗證下是不是真的 CPU 爆高。
0:000>?!tp
CPU?utilization:?100%
Worker?Thread:?Total:?21?Running:?7?Idle:?0?MaxLimit:?32767?MinLimit:?4
Work?Request?in?Queue:?3AsyncTimerCallbackCompletion?TimerInfo@00000000042d2430AsyncTimerCallbackCompletion?TimerInfo@00000000042d2f90AsyncTimerCallbackCompletion?TimerInfo@000000000420c150
--------------------------------------
Number?of?Timers:?0
--------------------------------------
Completion?Port?Thread:Total:?18?Free:?9?MaxFree:?8?CurrentLimit:?18?MaxLimit:?1000?MinLimit:?4
從卦中看確實 100%,太牛了,而且 WorkRequest 還有任務堆積現象,確認無疑后,接下來看下是誰引發的?
2. 誰導致的 CPU 爆高
根據慣例首先懷疑是不是 GC 觸發所致,可以用 !t
查看下線程列表,觀察下有沒有 GC
字樣。
:000>?!t
ThreadCount:??????53
UnstartedThread:??0
BackgroundThread:?53
PendingThread:????0
DeadThread:???????0
Hosted?Runtime:???noLock??ID?OSID?ThreadOBJ???????????State?GC?Mode?????GC?Alloc?Context??????????????????Domain???????????Count?Apt?Exception4????1?1240?00000000021cdf30????2a220?Preemptive??0000000000000000:0000000000000000?00000000021d94c0?0?????MTA?23????2?4db4?00000000041cdaa0????2b220?Preemptive??0000000000000000:0000000000000000?00000000021d94c0?0?????MTA?(Finalizer)?...65??156?22f4?000000000b1a3f60??8029220?Preemptive??00000004527751F0:0000000452775EE8?00000000021d94c0?0?????MTA?(Threadpool?Completion?Port)?66??205?2ef8?000000000b1a1080??8029220?Preemptive??0000000157641DE0:00000001576435B0?00000000021d94c0?0?????MTA?(Threadpool?Completion?Port)?...
從卦中看沒有 GC
字樣,也表明這個程序并不是由 GC 觸發所致,接下來該怎么排查呢?一般來說 CPU 的爆高是由線程抬起來的,所以接下來就是看下 CPU 的檔次和各個線程棧,看有沒有什么新線索, 可以使用 ~*e !clrstack
。
0:000>?!cpuid
CP??F/M/S??Manufacturer?????MHz0??6,79,1??<unavailable>???22991??6,79,1??<unavailable>???22992??6,79,1??<unavailable>???22993??6,79,1??<unavailable>???22990:000>?~*e?!clrstack?
OS?Thread?Id:?0x2cc4?(68)Child?SP???????????????IP?Call?Site
000000000c14e758?00007ffadeb86e4a?[GCFrame:?000000000c14e758]?
000000000c14e840?00007ffadeb86e4a?[GCFrame:?000000000c14e840]?
000000000c14e878?00007ffadeb86e4a?[HelperMethodFrame:?000000000c14e878]?System.Threading.Monitor.Enter(System.Object)
000000000c14e970?00007ffaceb40491?System.Net.ConnectionGroup.Disassociate(System.Net.Connection)?[f:\dd\NDP\fx\src\net\System\Net\_ConnectionGroup.cs?@?148]
000000000c14e9d0?00007ffaceb3fc93?System.Net.Connection.PrepareCloseConnectionSocket(System.Net.ConnectionReturnResult?ByRef)?[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs?@?3048]
000000000c14eaa0?00007ffacf139bfb?System.Net.Connection.HandleError(Boolean,?Boolean,?System.Net.WebExceptionStatus,?System.Net.ConnectionReturnResult?ByRef)?[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs?@?3119]
000000000c14eb00?00007ffacebc4118?System.Net.Connection.ReadComplete(Int32,?System.Net.WebExceptionStatus)?[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs?@?3387]
000000000c14eb80?00007ffacebe3dc5?System.Net.LazyAsyncResult.Complete(IntPtr)?[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs?@?415]
000000000c14ebe0?00007ffacebe3d07?System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object,?IntPtr)?[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs?@?368]
000000000c14ec20?00007ffacf3a476f?System.Net.Security._SslStream.StartFrameBody(Int32,?Byte[],?Int32,?Int32,?System.Net.AsyncProtocolRequest)
000000000c14ec80?00007ffacebb3ed8?System.Net.Security._SslStream.ReadHeaderCallback(System.Net.AsyncProtocolRequest)?[f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_SslStream.cs?@?1007]
000000000c14ece0?00007ffacebae5ee?System.Net.AsyncProtocolRequest.CompleteRequest(Int32)?[f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_HelperAsyncResults.cs?@?142]
000000000c14ed10?00007ffacf3a3567?System.Net.FixedSizeReader.CheckCompletionBeforeNextRead(Int32)
000000000c14ed40?00007ffacebae507?System.Net.FixedSizeReader.ReadCallback(System.IAsyncResult)?[f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_FixedSizeReader.cs?@?148]
000000000c14ed90?00007ffacebe3dc5?System.Net.LazyAsyncResult.Complete(IntPtr)?[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs?@?415]
000000000c14edf0?00007ffadcbe3a63?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object,?Boolean)?[f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs?@?954]
000000000c14eec0?00007ffadcbe38f4?System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object,?Boolean)?[f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs?@?902]
000000000c14eef0?00007ffadcbe38c2?System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs?@?891]
000000000c14ef40?00007ffaceba60cf?System.Net.ContextAwareResult.Complete(IntPtr)?[f:\dd\NDP\fx\src\net\System\Net\_ContextAwareResult.cs?@?463]
000000000c14ef90?00007ffacebe3d07?System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object,?IntPtr)?[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs?@?368]
000000000c14efd0?00007ffaceba5e2f?System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32,?UInt32,?System.Threading.NativeOverlapped*)?[f:\dd\NDP\fx\src\net\System\Net\Sockets\_BaseOverlappedAsyncResult.cs?@?399]
000000000c14f040?00007ffadcc2ffef?System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32,?UInt32,?System.Threading.NativeOverlapped*)?[f:\dd\ndp\clr\src\BCL\system\threading\overlapped.cs?@?135]
000000000c14f1f0?00007ffade9a6d93?[GCFrame:?000000000c14f1f0]?OS?Thread?Id:?0x5d6c?(70)Child?SP???????????????IP?Call?Site
000000000943e920?00007ffacf139bfb?System.Net.Connection.HandleError(Boolean,?Boolean,?System.Net.WebExceptionStatus,?System.Net.ConnectionReturnResult?ByRef)?OS?Thread?Id:?0x4ad4?(75)Child?SP???????????????IP?Call?Site
...
000000000c94e5a0?00007ffacf139bfb?System.Net.Connection.HandleError(Boolean,?Boolean,?System.Net.WebExceptionStatus,?System.Net.ConnectionReturnResult?ByRef)
...OS?Thread?Id:?0x1d70?(80)Child?SP???????????????IP?Call?Site
...
000000000d24e3a0?00007ffacf139bfb?System.Net.Connection.HandleError(Boolean,?Boolean,?System.Net.WebExceptionStatus,?System.Net.ConnectionReturnResult?ByRef)?[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs?@?3119]
...
從線程棧上看,這個 CPU 是 4個核,剛好對應著 4 個 HandleError
報錯,看樣子是什么網絡出問題了,接下來切到 80
號線程看一下有沒有什么異常類。
0:000>?~80s
clr!AwareLock::Contention+0x194:
00007ffa`deb86e40?4883e801????????sub?????rax,1
0:080>?!mdso
Thread?80:
Location??????????Object????????????Type
------------------------------------------------------------
000000000d24e098??000000015765e028??System.Net.WebException
000000000d24e0f8??0000000340b07110??System.Collections.ArrayList
000000000d24e110??000000015765e2b8??System.Net.HttpWebRequest[]
000000000d24e1c0??0000000340b070b8??System.Net.ConnectionGroup
000000000d24e258??0000000144a79678??System.Net.Connection0:080>?!mdt?000000015765e028
000000015765e028?(System.Net.WebException)_className:NULL?(System.String)_exceptionMethod:NULL?(System.Reflection.MethodBase)_exceptionMethodString:NULL?(System.String)_message:000000015765df70?(System.String)?Length=77,?String="The?underlying?connection?was?closed:?The?connection?was?closed?unexpectedly."...
果然看到了 System.Net.WebException
, 從異常信息看貌似是 連接關閉了
,到這里我就有了一個大膽的猜測,是不是高頻的異常輸出導致的 CPU 爆高呢?為了驗證,可以到托管堆上找下 WebException
的個數。
0:080>?!dumpheap?-stat
Statistics:MT????Count????TotalSize?Class?Name
...
00007ffacecc38b0????13315??????2343440?System.Net.WebException
00007ffadcdf6570????11369??????1909992?System.IO.IOException
00007ffadcdf5fb8????13380??????2247840?System.ObjectDisposedException
...
看到這么多異常還是挺嚇人的,剛好朋友抓了兩個dump可以做個比較。
0:048>?!dumpheap?-stat
Statistics:MT????Count????TotalSize?Class?Name
00007ffacecc38b0????26745??????4707120?System.Net.WebException
00007ffadcdf6570????26722??????4489296?System.IO.IOException
00007ffadcdf5fb8????28745??????4829160?System.ObjectDisposedException
可以看到,2 min 之內異常增加了合計 4w
左右,這就驗證了程序確實是瘋狂的拋異常,在 Windows 平臺上不管是硬件異常
還是軟件異常
都由 Windows SEH 異常處理框架統一處理,會出現用戶態和內核態的切換,這樣瘋狂的拋出,必然會導致 CPU 爆高,終于找到原因了,接下來就是尋找誘發因素。
3. 異常是誰誘發的
再回頭看 HandleError
函數的調用棧都是底層的庫函數,從線程棧的 PerformIOCompletionCallback
函數來看是 IO線程
誘發的,能被 IO 線程兜到是因為這是做了異步處理,既然是 異步
,自然 OverlappedData
也會非常多。
0:080>?!gchandles?-stat
Statistics:MT????Count????TotalSize?Class?Name
00007ffadc6f7b98????14511??????1625232?System.Threading.OverlappedData
Total?17550?objectsHandles:Strong?Handles:???????426Pinned?Handles:???????23Async?Pinned?Handles:?14511Ref?Count?Handles:????24Weak?Long?Handles:????2430Weak?Short?Handles:???132SizedRef?Handles:?????4
說明此時有大概 1.5w 的異步請求待回頭,請求量還是蠻大的,但還是沒找到異常的用戶代碼,只能找下到底是誰發起了什么請求。
0:080>?!mdso
Thread?80:
Location??????????Object????????????Type
------------------------------------------------------------
...
000000000d24e488??0000000358c57918??System.Net.HttpWebRequest
000000000d24e2e8??00000001407b5b40??System.String??"net_io_readfailure"
...0:080>?!mdt?-r:2?0000000358c57918
0000000358c57918?(System.Net.HttpWebRequest)_Uri:0000000358c57210?(System.Uri)m_String:00000002407ee430?(System.String)?Length=98,?String="https://api.xxxx/peer_messages"....
可以看到請求的連接是 https://api.xxxx/peer_messages
,是一個第三方的API接口, 由于底層的連接關閉,導致了最后 net_io_readfailure
。
把所有的信息整合一下就是:
當請求量大了之后,訪問 https://api.xxxx/peer_messages
會出問題,對方關閉了底層連接,導致客戶端這邊請求出現了大量 IO 回調異常:IOException: Unable to read data from the transport connection: The connection was closed.
,2min之間多達合計 4w
的異常拋出,進而引發 CPU 爆高,將信息告訴了朋友,讓朋友重點關注下 https://api.xxxx/peer_messages
這個連接。
三:總結
這次生產事故主要是由于高峰期請求量過大,由于某種原因 Socket 連接關閉,導致了大量的異步回調異常。
解決方法在調用端做好限流,據朋友說減少了不必要的 https://api.xxxx/peer_messages
調用,目前沒有出現 CPU 爆高現象。