Linux系統Kernel Panic的檢索
如何判斷是否發生Kernel Panic,以下以 CentOS 7.9系統為例
#查看 /var/crash 路徑下是否有生成文件夾,Kernel Panic后會生成文件夾在此路徑表示產生了Kernel Panic
ls /var/crash
#/var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore
如何建造debug環境
#Kernel Panic文件有了,分析需要對應的工具才能進行,步驟如下
# 安裝 crash
yum install crash
# 查看內核版本
uname -r
#下載 內核debug info,3.10.0-693.el7.x86_64 是uname -r 查出來的版本
wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-3.10.0-693.el7.x86_64.rpm
wget http://linuxsoft.cern.ch/centos-debuginfo/7/x86_64/kernel-debuginfo-3.10.0-693.el7.x86_64.rpm
#假設下載很慢,建議直接瀏覽器上這個網站下載
#下載好以后使用 rpm -ivh xxx.rpm 安裝以上兩個rpm包#安裝好以后,運行crash應該能看到以下信息:
[root@localhost vmcore]# crashcrash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...WARNING: kernel relocated [184MB]: patching 87476 gdb minimal_symbol valuesKERNEL: /usr/lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinuxDUMPFILE: /dev/crashCPUS: 12DATE: Mon Dec 4 10:10:19 2023UPTIME: 00:13:14
LOAD AVERAGE: 0.29, 0.32, 0.29TASKS: 987NODENAME: localhost.localdomainRELEASE: 3.10.0-1160.88.1.el7.x86_64VERSION: #1 SMP Tue Mar 7 15:41:52 UTC 2023MACHINE: x86_64 (2096 Mhz)MEMORY: 15.4 GBPID: 4240COMMAND: "crash"TASK: ffff9e0d1eefc200 [THREAD_INFO: ffff9e0d083f4000]CPU: 0STATE: TASK_RUNNING (ACTIVE)crash>
這是正常的,可以開始接下來的步驟:
crash /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore
以上/var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore
是kernel panic后生成的文件夾內的信息
解析他可以看到kernel panic的原因
范例一:
創造一個kernel panic的場景
可以使用 以下命令直接觸發,觸發后系統會在幾秒內重啟
echo c > /proc/sysrq-trigger
范例二:
使用oom 觸發:
之前有提到我之前 fio 命令導致 觸發 out of memory 觸發 oom-killer,內核有辦法設定,讓OOM觸發的時候直接Panic重啟,以下是命令:
sysctl -w vm.panic_on_oom=1
sysctl -w kernel.panic=10
echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
echo "kernel.panic=10" >> /etc/sysctl.conf
在此設定下,即可使系統在觸發OOM后10s重啟,同時 /var/crash 內會生成文件夾
以下是我觸發OOM的腳本:
首先是fio配置,至于OOM原因,參考我之前的文章:
https://blog.csdn.net/weixin_44517278/article/details/131661105
以下配置寫到 fio.conf
[JEDEC-219]
ioengine=libaio
direct=1
rw=randrw
norandommap
randrepeat=0
rwmixread=40
iodepth=128
numjobs=4
bssplit=512/4:1024/1:1536/1:2048/1:2560/1:3072/1:3584/1:4k/67:8k/10:16k/7:32k/3:64k/3
blockalign=4k
random_distribution=zoned:50/5:30/15:20/80
loops=10000filename=/dev/nvme0n1
group_reporting
write_iops_log=iops.log
write_bw_log=bw.log
write_lat_log=lat.log
然后為了快速觸發,我使用for循環去快速觸發:
for i in {0..100};do nohup fio fio.conf &;sleep 1;done
這樣很快就能觸發oom panic,系統重啟,重啟后能在 /var/crash
中查到一個帶剛剛日期時間的文件夾,如我試驗的時候生成的/var/crash/127.0.0.1-2023-12-04-09\:56\:53/vmcore
,然后可以用上文說的命令進行分析,如下:
[root@localhost vmcore]# crash /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2023-12-04-09\:56\:53/vmcorecrash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...WARNING: kernel relocated [932MB]: patching 87476 gdb minimal_symbol valuesKERNEL: /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinuxDUMPFILE: /var/crash/127.0.0.1-2023-12-04-09:56:53/vmcore [PARTIAL DUMP]CPUS: 12DATE: Mon Dec 4 09:56:51 2023UPTIME: 00:11:41
LOAD AVERAGE: 60.96, 26.74, 13.81TASKS: 915NODENAME: localhost.localdomainRELEASE: 3.10.0-1160.88.1.el7.x86_64VERSION: #1 SMP Tue Mar 7 15:41:52 UTC 2023MACHINE: x86_64 (2095 Mhz)MEMORY: 15.4 GBPANIC: "Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled"PID: 7020COMMAND: "fio"TASK: ffff9b237633c200 [THREAD_INFO: ffff9b24f013c000]CPU: 8STATE: TASK_RUNNING (PANIC)
如上,可以看到PANIC的點是由于 PANIC: "Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled"
范例三:
WARNING: kernel relocated [54MB]: patching 87292 gdb minimal_symbol valuesKERNEL: /lib/debug/lib/modules/3.10.0-1160.el7.x86_64/vmlinuxDUMPFILE: ./127.0.0.1-2023-10-15-12.14.31/vmcore [PARTIAL DUMP]CPUS: 112DATE: Mon Oct 16 00:13:16 2023UPTIME: 2 days, 04:29:35
LOAD AVERAGE: 9.20, 8.26, 8.13TASKS: 990NODENAME: sh-dell01RELEASE: 3.10.0-1160.el7.x86_64VERSION: #1 SMP Mon Oct 19 16:18:59 UTC 2020MACHINE: x86_64 (2000 Mhz)MEMORY: 63.3 GBPANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)"PID: 44133COMMAND: "umount"TASK: ffff8b476bc9b180 [THREAD_INFO: ffff8b40973d8000]CPU: 61STATE: TASK_RUNNING (PANIC)
分析命令:ps 查看系統崩潰前的進程 帶 > 的是活躍進程,也是有可能導致系統崩潰的進程
crash> ps44118 2 15 ffff8b4f4851e300 IN 0.0 0 0 [kworker/15:2]44125 2 68 ffff8b479fb29080 IN 0.0 0 0 [kworker/68:2]
> 44133 1 61 ffff8b476bc9b180 RU 0.0 123620 1220 umount44136 2 58 ffff8b4789091080 IN 0.0 0 0 [kworker/58:2]44139 1 58 ffff8b476bc9c200 UN 0.0 123620 1224 umount44141 1 59 ffff8b476bc9d280 UN 0.0 123608 996 swapoff
分析命令:log 查看系統崩潰時所有的dmesg(崩潰導致系統重啟,重啟前的dmesg可以在這里查看)
crash> log
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
..............................
分析命令: bt 查看系統崩潰前的堆棧信息
crash> bt
PID: 44133 TASK: ffff8b476bc9b180 CPU: 61 COMMAND: "umount"#0 [ffff8b40973db980] machine_kexec at ffffffff84666294#1 [ffff8b40973db9e0] __crash_kexec at ffffffff84722562#2 [ffff8b40973dbab0] crash_kexec at ffffffff84722650#3 [ffff8b40973dbac8] oops_end at ffffffff84d8b798#4 [ffff8b40973dbaf0] no_context at ffffffff84675d14#5 [ffff8b40973dbb40] __bad_area_nosemaphore at ffffffff84675fe2#6 [ffff8b40973dbb90] bad_area_nosemaphore at ffffffff84676104#7 [ffff8b40973dbba0] __do_page_fault at ffffffff84d8e750#8 [ffff8b40973dbc10] do_page_fault at ffffffff84d8e975#9 [ffff8b40973dbc40] page_fault at ffffffff84d8a778[exception RIP: jbd2_superblock_csum+58]RIP: ffffffffc06f969a RSP: ffff8b40973dbcf8 RFLAGS: 00010246RAX: 0000000000000000 RBX: ffff8b4778d39000 RCX: ffff8b40973dbfd8RDX: 0000000000000000 RSI: ffff8b4778d39000 RDI: ffff8b4f72e9d800RBP: ffff8b40973dbd28 R8: ffff8b40973dbdc8 R9: 0000000000000001R10: 0000000000000001 R11: ffff8b47895d3200 R12: 000000000e33f513R13: ffff8b4f72e9d800 R14: 0000000000001c11 R15: ffff8b4778d39000ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8b40973dbd30] jbd2_write_superblock at ffffffffc06fa61c [jbd2]
#11 [ffff8b40973dbd70] jbd2_mark_journal_empty at ffffffffc06facbd [jbd2]
#12 [ffff8b40973dbda0] jbd2_journal_destroy at ffffffffc06faf6e [jbd2]
#13 [ffff8b40973dbe10] ext4_put_super at ffffffffc0913680 [ext4]
#14 [ffff8b40973dbe50] generic_shutdown_super at ffffffff8485051d
#15 [ffff8b40973dbe70] kill_block_super at ffffffff84850997
#16 [ffff8b40973dbe90] deactivate_locked_super at ffffffff84850cfe
#17 [ffff8b40973dbeb0] deactivate_super at ffffffff84851486
#18 [ffff8b40973dbec8] cleanup_mnt at ffffffff84870b0f
#19 [ffff8b40973dbee0] __cleanup_mnt at ffffffff84870ba2
#20 [ffff8b40973dbef0] task_work_run at ffffffff846c275b
#21 [ffff8b40973dbf30] do_notify_resume at ffffffff8462cc65
#22 [ffff8b40973dbf50] int_signal at ffffffff84d942efRIP: 00007f785a783a07 RSP: 00007ffc82e094e8 RFLAGS: 00000246RAX: 0000000000000000 RBX: 00005597e34bc040 RCX: ffffffffffffffffRDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005597e34c2280RBP: 00005597e34c2280 R8: 00005597e34c21f0 R9: 0000000000000000R10: 00007ffc82e08920 R11: 0000000000000246 R12: 00007f785b301d78R13: 0000000000000000 R14: 00005597e34bc140 R15: 00005597e34bc040ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
這里可以看到 最后在 #22 [ffff8b40973dbf50] int_signal at ffffffff84d942ef 調用發生問題,
可以進一步查看,我這里指向的地址是ffffffff84d942ef
分析命令:dis 反匯編該地址,查看源碼Fail位置
crash> dis -l ffffffff84d942ef
/usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S: 701
0xffffffff84d942ef <int_signal+18>: mov $0xfe0e,%edi
上面列出了源碼指向/usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S: 701
可以直接查看源碼相應位置:
crash> cat -n /usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S
#篩選了一下結果.....695 int_signal:696 testl $_TIF_DO_NOTIFY_MASK,%edx697 jz 1f698 movq %rsp,%rdi # &ptregs -> arg1699 xorl %esi,%esi # oldset -> arg2700 call do_notify_resume701 1: movl $_TIF_WORK_MASK,%edi702 int_restore_rest:703 RESTORE_REST704 DISABLE_INTERRUPTS(CLBR_NONE)705 TRACE_IRQS_OFF706 jmp int_with_check707 CFI_ENDPROC708 END(system_call)
尷尬的是找到這里對我來說也沒啥用,看不懂源碼…
以上,暫時記錄這些…遇到更多Kernel Panic的案例會再總結記錄上來
參考文章:
https://blog.csdn.net/linuxvfast/article/details/116591523
https://blog.csdn.net/weixin_45030965/article/details/124960224