環境介紹
- 編譯主機:amd64 + Ubuntu 22.04
- Android源碼:Android15 GKI
- Kernel版本:Linux 6.16
- Android構建系統:bazel構建
- 工具鏈:gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-
定位Linux kernel crash問題的步驟
通常Linux Kernel crash時會有堆棧信息輸出,從堆棧信息中可以知道導致Kernel crash的大概原因、Kernel crash時系統狀態、Kernel crash時在執行什么。
根據Kernel crash log定位異常問題的步驟:
- 從log中確定異常方向、異常位置
- 從System.map中確定符號地址
- 通過addr2line工具確定異常代碼位置
例子-定位Linux Kernel crash異常位置
從log中找異常信息
[ 6.974145][ T1] arm,isp e8100000.isp: Adding to iommu group 11
[ 6.980371][ T1] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 //從這里看是空指針異常
[ 6.989848][ T1] Mem abort info:
[ 6.993331][ T1] ESR = 0x0000000096000005
[ 6.997772][ T1] EC = 0x25: DABT (current EL), IL = 32 bits
[ 7.003775][ T1] SET = 0, FnV = 0
[ 7.007521][ T1] EA = 0, S1PTW = 0
[ 7.011355][ T1] FSC = 0x05: level 1 translation fault
[ 7.016923][ T1] Data abort info:
[ 7.020495][ T1] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[ 7.026672][ T1] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 7.032416][ T1] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 7.038419][ T1] [0000000000000000] user address but active_mm is swapper
[ 7.045464][ T1] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[ 7.052421][ T1] Modules linked in:
[ 7.056167][ T1] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.6.58-android15-8-maybe-dirty-4k-SE-SDK2P5 #1 1400000003000000474e55008fa9e0c15629191d
[ 7.069549][ T1] Hardware name: TI Davince Evaluation board (DT)
[ 7.076245][ T1] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 7.083896][ T1] pc : readl+0x38/0x80
[ 7.087819][ T1] lr : readl+0x38/0x80
[ 7.091738][ T1] sp : ffffffc0828eb7f0
[ 7.095743][ T1] x29: ffffffc0828eb7f0 x28: 0000000000000000 x27: 0000000000000000
[ 7.103570][ T1] x26: 0000000000000000 x25: 0000000000000000 x24: ffffff8c97335d70
[ 7.111396][ T1] x23: ffffff8eded5d8a8 x22: 0000000000000001 x21: ffffff8c97335d94
[ 7.119220][ T1] x20: ffffffc080cc3e64 x19: 0000000000000000 x18: ffffffc0828c50a0
[ 7.127046][ T1] x17: ffffffc0826e7a40 x16: ffffffc0826e7a70 x15: 001f00003fffffff
[ 7.134872][ T1] x14: 0000000000000901 x13: 2000000000000000 x12: 0000000000000008
[ 7.142697][ T1] x11: 000000000000002b x10: 0000000000000200 x9 : 0000000000000400
[ 7.150522][ T1] x8 : 0000000000000007 x7 : 6e69616d6f642d72 x6 : 0000000000000004
[ 7.158348][ T1] x5 : 0000000000005dc8 x4 : ffffffc08181b2a8 x3 : ffffffc080cc3e64
[ 7.166173][ T1] x2 : ffffffc080cc400c x1 : 0000000000000000 x0 : 0000000000000020
[ 7.173998][ T1] Call trace:
[ 7.177136][ T1] readl+0x38/0x80 //這里看是isp_clk_gate_onoff -> readl踩到空指針
[ 7.180708][ T1] isp_clk_gate_onoff+0x5c/0x204 //這里看到isp driver中isp_clk_gate_onoff()執行時發生空指針異常
[ 7.185495][ T1] isp_platform_probe+0x3ac/0x9f8
[ 7.190369][ T1] platform_probe+0xc0/0xec
[ 7.194724][ T1] really_probe+0x190/0x374
[ 7.199076][ T1] __driver_probe_device+0xa0/0x12c
[ 7.204122][ T1] driver_probe_device+0x3c/0x218
[ 7.208996][ T1] __driver_attach+0x110/0x1ec
[ 7.213608][ T1] bus_for_each_dev+0x104/0x160
[ 7.218310][ T1] driver_attach+0x24/0x34
[ 7.222576][ T1] bus_add_driver+0x154/0x270
[ 7.227104][ T1] driver_register+0x68/0x104
[ 7.231630][ T1] __platform_driver_probe+0x50/0xc8
[ 7.236764][ T1] fw_module_init+0x30/0x78
[ 7.241118][ T1] do_one_initcall+0xdc/0x360
[ 7.245645][ T1] do_initcall_level+0xc8/0x19c
[ 7.250347][ T1] do_initcalls+0x70/0xc0
[ 7.254527][ T1] do_basic_setup+0x1c/0x28
[ 7.258880][ T1] kernel_init_freeable+0xd0/0x138
[ 7.263841][ T1] kernel_init+0x20/0x1ac
[ 7.268022][ T1] ret_from_fork+0x10/0x20
[ 7.272290][ T1] Code: aa1303e1 aa1e03e3 aa1e03f4 97e989cd (b9400268)
[ 7.279072][ T1] ---[ end trace 0000000000000000 ]---
[ 7.287048][ T1] Kernel panic - not syncing: Oops: Fatal exception
[ 7.293483][ T1] SMP: stopping secondary CPUs
[ 7.298099][ T1] Kernel Offset: disabled
[ 7.302277][ T1] CPU features: 0x000002,c0000000,70020143,1001720b
異常原因:
[ ???6.980371][ ???T1] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ ???7.045464][ ???T1] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
從這兩條日志可以確定導致kernel crash的原因是訪問 "NULL pointer dereference"
異常位置:
[ ? ?7.083896][ ? ?T1] pc : readl+0x38/0x80? ?
從這條日志可以確定觸發異常的操作
異常調用棧:
[ ? ?7.173998][ ? ?T1] Call trace:
[ ? ?7.177136][ ? ?T1] ?readl+0x38/0x80? ??
[ ? ?7.180708][ ? ?T1] ?isp_clk_gate_onoff+0x5c/0x204?
[ ? ?7.185495][ ? ?T1] ?isp_platform_probe+0x3ac/0x9f8
[ ? ?7.190369][ ? ?T1] ?platform_probe+0xc0/0xec
[ ? ?7.194724][ ? ?T1] ?really_probe+0x190/0x374
[ ? ?7.199076][ ? ?T1] ?__driver_probe_device+0xa0/0x12c
[ ? ?7.204122][ ? ?T1] ?driver_probe_device+0x3c/0x218
[ ? ?7.208996][ ? ?T1] ?__driver_attach+0x110/0x1ec
[ ? ?7.213608][ ? ?T1] ?bus_for_each_dev+0x104/0x160
[ ? ?7.218310][ ? ?T1] ?driver_attach+0x24/0x34
[ ? ?7.222576][ ? ?T1] ?bus_add_driver+0x154/0x270
[ ? ?7.227104][ ? ?T1] ?driver_register+0x68/0x104
[ ? ?7.231630][ ? ?T1] ?__platform_driver_probe+0x50/0xc8
[ ? ?7.236764][ ? ?T1] ?fw_module_init+0x30/0x78
從調用棧可以大致判斷異常發生的時間段。如上日志可以確定是isp driver加載階段probe處理時出現的異常。"isp_clk_gate_onoff+0x5c/0x204?"可以進一步確定異常位置是isp_clk_gate_onoff符號為基地址的0x5c偏移位置,0x204是isp_clk_gate_onoff代碼段長度。
從System.map符號表中找基地址
如上,找到isp_clk_gate_onoff符號的地址
通過addr2line工具確定代碼位置
這里使用llvm-addr2line定位代碼中的位置。為什么不用aarch64-none-linux-gnu-addr2line在遇到的問題一節有說明。
step1.導出llvm-addr2line工具
export PATH=/data/yuxi/xx-builder/src/android-gki/prebuilts/clang/host/linux-x86/llvm-binutils-stable:$PATH
/data/yuxi/xx-builder/src/android-gki是自己本地android15源碼目錄,android系統構建時會生成llvm工具。
step2.根據代碼段地址定位代碼中位置
通過objdump工具對異常位置反匯編
借助反匯編和異常日志可以對問題進行更深入的分析。
遇到的問題
1.?aarch64-none-linux-gnu-addr2line: vmlinux: unable to initialize decompress status for section .debug_aranges
執行命令:aarch64-none-linux-gnu-addr2line?-e vmlinux 0xffffffc080cc4928
異常日志:
aarch64-none-linux-gnu-addr2line: vmlinux: unable to initialize decompress status for section .debug_aranges
aarch64-none-linux-gnu-addr2line: vmlinux: unable to initialize decompress status for section .debug_aranges
aarch64-none-linux-gnu-addr2line: vmlinux: file format not recognized
異常原因:
vmlinux是Linux Kernel構建時生成的一個靜態鏈接的可執行文件,通常是ELF格式。根據之前Linux Kernel經驗來說這個文件是原始的、未壓縮的Linux內核鏡像。但從返回的信息看這個文件是壓縮的,恰巧使用的這個aarch64-none-linux-gnu-工具鏈不能對這種壓縮進行解壓。
問題解:
使用LLVM工具鏈,LLVM工具鏈通常對較新的ELF特性支持更好,而且Android15源碼構建時也會有LLVM工具鏈生成。