一:背景
1. 講故事
前段時間有位朋友在微信上找到我,說他的 web 系統 cpu 運行一段時候后就爆高了,讓我幫忙看一下是怎么回事,那就看吧,聲明一下,我看 dump 是免費的,主要是錘煉自己技術,沒有某軟工程師高額的技術分析費。😅😅😅
閑話不多說,我們上 windbg 說話。
二:WinDbg 分析
1. CPU 真的爆高嗎
是否真的爆高,我們得自己先驗證下,使用 !tp
命令看一下即可。
0:065>?!tp
CPU?utilization:?81%
Worker?Thread:?Total:?32?Running:?7?Idle:?25?MaxLimit:?8191?MinLimit:?32
Work?Request?in?Queue:?1AsyncTimerCallbackCompletion?TimerInfo@018eedc8
--------------------------------------
Number?of?Timers:?1
--------------------------------------
Completion?Port?Thread:Total:?4?Free:?4?MaxFree:?64?CurrentLimit:?4?MaxLimit:?1000?MinLimit:?32
從卦象看,確實存在 CPU 爆高的情況,根據過往經驗,托管程序爆高大多是因為GC
觸發所致,但觸發 GC 的原因千奇百怪,畢竟在 clr 層面 GC 觸發的原因高達 14 種,代碼如下:
static?const?char*?const?str_gc_reasons[]?=
{"alloc_soh","induced","lowmem","empty","alloc_loh","oos_soh","oos_loh","induced_noforce","gcstress","induced_lowmem","induced_compacting","lowmemory_host","pm_full_gc","lowmemory_host_blocking"
};
2. 真的是 GC 觸發嗎
驗證當前程序是否為 GC 觸發,方式有很多,可以用 !t
或者 !t -special
,但這兩種方式不是特別準,最準的就是根據GC模式直接到 CLR 里去搜全局變量 clr!SVR::gc_heap::gc_started
的值就可以了,參考如下:
0:038>?dp?clr!SVR::gc_heap::gc_started?L1
712d3190??00000001
可以看到,此時的 gc_started=1
,說明 GC 是觸發狀態,接下來可以從所有的線程棧中搜 garbage_collect
或者 gc1
什么的關鍵詞即可。
0:038>?k#?ChildEBP?RetAddr??????
00?0318f934?70de8248?????clr!SVR::gc_heap::relocate_survivor_helper+0x1ea
01?0318f944?70de83df?????clr!SVR::gc_heap::relocate_survivors_in_plug+0x24
02?0318f970?70de84ac?????clr!SVR::gc_heap::relocate_survivors_in_brick+0x70
03?0318f9a8?70de830b?????clr!SVR::gc_heap::relocate_survivors+0xe4
04?0318fa00?70de218a?????clr!SVR::gc_heap::relocate_phase+0xb9
05?0318fbb4?70de18bf?????clr!SVR::gc_heap::plan_phase+0x136e
06?0318fbec?70de1d49?????clr!SVR::gc_heap::gc1+0x101
07?0318fc3c?70de1421?????clr!SVR::gc_heap::garbage_collect+0x746
08?0318fc58?70ddacde?????clr!SVR::gc_heap::gc_thread_function+0x14a
09?0318fc6c?70ddac6f?????clr!SVR::gc_heap::gc_thread_stub+0x72
0a?0318fc80?770a6a14?????clr!GCThreadStub+0x1f
0b?0318fc94?77e4a9ef?????kernel32!BaseThreadInitThunk+0x24
0c?0318fcdc?77e4a9ba?????ntdll!__RtlUserThreadStart+0x2f
0d?0318fcec?00000000?????ntdll!_RtlUserThreadStart+0x1b
從卦象看,gc_thread_stub
表示當前是一個 GC 線程,它正在處于 relocate_phase
階段,這表明當前是一個 壓縮回收
,GC回收流程圖可以看下官方文檔。
GarbageCollectGeneration(){SuspendEE();garbage_collect();RestartEE();}garbage_collect(){generation_to_condemn();gc1();}gc1(){mark_phase();plan_phase();}plan_phase(){//?actual?plan?phase?work?to?decide?to?//?compact?or?notif?(compact){relocate_phase();compact_phase();}elsemake_free_lists();}
在這個階段,托管堆也會是損壞狀態,你可以用 ?!dumpheap -stat
驗證下。
0:038>?!dumpheap?-stat
The?garbage?collector?data?structures?are?not?in?a?valid?state?for?traversal.
It?is?either?in?the?"plan?phase,"?where?objects?are?being?moved?around,?or
we?are?at?the?initialization?or?shutdown?of?the?gc?heap.?Commands?related?to?
displaying,?finding?or?traversing?objects?as?well?as?gc?heap?segments?may?not?
work?properly.?!dumpheap?and?!verifyheap?may?incorrectly?complain?of?heap?
consistency?errors.
Object?<exec?cmd="!ListNearObj?/d?03301000">03301000</exec>?has?an?invalid?method?table.
3. 為什么會出現壓縮回收
一般來說,GC 分清除和壓縮回收,后者屬于一種重量級操作,很傷GC,在臨時段上還稍微好一些,接下來我們看下當前 GC 是在回收哪一代?可以到 CLR 里面去查一下判決代字段。clr!WKS::GCHeap::GcCondemnedGeneration
。
0:038>?dp?clr!SVR::GCHeap::GcCondemnedGeneration?L1
712d79d8??00000002
糟了,結果是個 2,這個 2 表示 fullGC
, 也就是全量回收,大多對應著 gc_reason=lowmem
的情況,也就是內存不足。
4. 真的內存不足嗎
要想找到答案,我們用 !address -summary
看下當前的虛擬內存情況。
0:038>?!address?-summary---?Usage?Summary?----------------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal
<unknown>??????????????????????????????1835??????????dce6e000?(???3.452?GB)??91.56%???86.29%
Image???????????????????????????????????842???????????f436000?(?244.211?MB)???6.33%????5.96%
Free????????????????????????????????????312???????????eba5000?(?235.645?MB)????????????5.75%
Stack???????????????????????????????????451???????????2d80000?(??45.500?MB)???1.18%????1.11%
Heap?????????????????????????????????????72???????????2342000?(??35.258?MB)???0.91%????0.86%
TEB?????????????????????????????????????150?????????????96000?(?600.000?kB)???0.02%????0.01%
Other?????????????????????????????????????7?????????????4e000?(?312.000?kB)???0.01%????0.01%
PEB???????????????????????????????????????1??????????????1000?(???4.000?kB)???0.00%????0.00%---?Type?Summary?(for?busy)?------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal
MEM_PRIVATE????????????????????????????2051??????????dd635000?(???3.459?GB)??91.76%???86.48%
MEM_IMAGE??????????????????????????????1267??????????11ad1000?(?282.816?MB)???7.33%????6.90%
MEM_MAPPED???????????????????????????????40???????????2345000?(??35.270?MB)???0.91%????0.86%---?State?Summary?----------------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal
MEM_COMMIT?????????????????????????????2604??????????cbe7d000?(???3.186?GB)??84.51%???79.65%
MEM_RESERVE?????????????????????????????754??????????255ce000?(?597.805?MB)??15.49%???14.60%
MEM_FREE????????????????????????????????312???????????eba5000?(?235.645?MB)????????????5.75%---?Largest?Region?by?Usage?-----------?Base?Address?--------?Region?Size?----------
<unknown>????????????????????????????????????3300000??????????20087000?(?512.527?MB)
Image???????????????????????????????????????6f819000????????????f5f000?(??15.371?MB)
Free????????????????????????????????????????fea50000???????????1590000?(??21.562?MB)
Stack????????????????????????????????????????3110000?????????????7a000?(?488.000?kB)
Heap????????????????????????????????????????3bc80000????????????621000?(???6.129?MB)
TEB?????????????????????????????????????????fe6e5000??????????????1000?(???4.000?kB)
Other???????????????????????????????????????fea10000?????????????33000?(?204.000?kB)
PEB?????????????????????????????????????????fea49000??????????????1000?(???4.000?kB)
從卦象看,當前的 MEM_COMMIT=3.186G
, 最大的Free塊 Free=15.371MB
,再根據之前展示的內存地址,我們發現這個程序是 32bit ,跑了 64bit 機器上,這種情況下程序最多可占用 4G
內存空間,雖然 MEM_RESERVE= 597.805 MB
, 但這種 RESERVE
是零散的,本質上來說此時的程序處于虛擬地址緊張,由于 虛擬地址 緊張,導致 GC 在不斷的做 全量內存 回收。
三:總結
根據上面的分析, GC 觸發的原因主要還是 32bit 程序的 4G 內存限制所致, 導致 GC 在不停的做全量回收,這種場景真的讓 GC 很尷尬,優先解決辦法就是將程序改成 64bit,后面再看看如何優化程序內存,畢竟現在托管堆
處于損壞狀態,也不好分析啦。