Linux環境下內存錯誤問題排查與修復

最近這幾天服務器總是掉線,要查一下服務器的問題。可以首先查看一下計算機硬件,這是一臺某魚上拼湊的服務器:

sudo lshw -short
H/W path           Device          Class          Description
=============================================================system         NF5270M3 (To be filled by O.E.M.)
/0                                 bus            NF5270M3
/0/0                               memory         64KiB BIOS
/0/4                               processor      Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
/0/4/5                             memory         384KiB L1 cache
/0/4/6                             memory         1536KiB L2 cache
/0/4/7                             memory         15MiB L3 cache
/0/6                               processor      Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
/0/6/9                             memory         384KiB L1 cache
/0/6/a                             memory         1536KiB L2 cache
/0/6/b                             memory         15MiB L3 cache
/0/2c                              memory         24GiB System Memory
/0/2c/0                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/1                            memory         DIMM Synchronous [empty]
/0/2c/2                            memory         DIMM Synchronous [empty]
/0/2c/3                            memory         DIMM Synchronous [empty]
/0/2c/4                            memory         DIMM Synchronous [empty]
/0/2c/5                            memory         DIMM Synchronous [empty]
/0/2c/6                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/7                            memory         DIMM Synchronous [empty]
/0/2c/8                            memory         DIMM Synchronous [empty]
/0/2c/9                            memory         DIMM Synchronous [empty]
/0/2c/a                            memory         DIMM Synchronous [empty]
/0/2c/b                            memory         DIMM Synchronous [empty]
/0/2c/c                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/d                            memory         DIMM Synchronous [empty]
/0/2c/e                            memory         DIMM Synchronous [empty]
/0/2c/f                            memory         8GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/2c/10                           memory         DIMM Synchronous [empty]
/0/2c/11                           memory         DIMM Synchronous [empty]
/0/2c/12                           memory         DIMM Synchronous [empty]
/0/2c/13                           memory         DIMM Synchronous [empty]
/0/100/3/0         /dev/nvme0      storage        LITEON CA3-8D128-HP
/0/100/3/0/0       hwmon0          disk           NVMe disk
/0/100/3/0/2       /dev/ng0n1      disk           NVMe disk
/0/100/3/0/1       /dev/nvme0n1    disk           128GB NVMe disk
/0/100/3/0/1/1                     volume         1074MiB Windows FAT volume
/0/100/3/0/1/2     /dev/nvme0n1p2  volume         2GiB EXT4 volume
/0/100/3/0/1/3     /dev/nvme0n1p3  volume         116GiB EFI partition
/0/100/1f.2/0      /dev/sda        disk           500GB WDC WD5000AAKX-0
/0/100/1f.2/0/1    /dev/sda1       volume         465GiB EXT4 volume
/0/100/1f.2/1      /dev/sdb        disk           500GB WDC WD5000AAKX-2
/0/100/1f.2/1/1    /dev/sdb1       volume         465GiB EXT4 volume

網絡掉線后插上 HDMI 顯示屏查看屏幕顯示狀態,發現 Memory 相關字樣,推測可能和內存條錯誤有關。

重啟后查看系統日志:

tail -200 /var/log/syslog
2025-04-04T18:23:54.720029+08:00 talos kernel: Memory failure: 0x46fab5: unhandlable page.
2025-04-04T18:23:55.230128+08:00 talos kernel: {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
2025-04-04T18:23:55.230140+08:00 talos kernel: {3}[Hardware Error]: It has been corrected by h/w and requires no further action
2025-04-04T18:23:55.230141+08:00 talos kernel: {3}[Hardware Error]: event severity: corrected
2025-04-04T18:23:55.230143+08:00 talos kernel: {3}[Hardware Error]: Error 0, type: corrected
2025-04-04T18:23:55.230144+08:00 talos kernel: {3}[Hardware Error]: fru_text: CorrectedErr
2025-04-04T18:23:55.230145+08:00 talos kernel: {3}[Hardware Error]: section_type: memory error
2025-04-04T18:23:55.230146+08:00 talos kernel: {3}[Hardware Error]: node:0 device:0 
2025-04-04T18:23:55.230147+08:00 talos kernel: {3}[Hardware Error]: error_type: 2, single-bit ECC
2025-04-04T18:24:01.695052+08:00 talos kernel: RAS: Soft-offlining pfn: 0x104e5c
2025-04-04T18:24:01.695076+08:00 talos kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
2025-04-04T18:24:01.695080+08:00 talos kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00050000010092
2025-04-04T18:24:01.695082+08:00 talos kernel: EDAC sbridge MC0: TSC 0 
2025-04-04T18:24:01.695084+08:00 talos kernel: EDAC sbridge MC0: ADDR 104e5c8c0 
2025-04-04T18:24:01.695086+08:00 talos kernel: EDAC sbridge MC0: MISC 40584e86 
2025-04-04T18:24:01.695088+08:00 talos kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1743762241 SOCKET 0 APIC 0
2025-04-04T18:24:01.695090+08:00 talos kernel: EDAC MC0: 20 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x104e5c offset:0x8c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
2025-04-04T18:24:01.695094+08:00 talos kernel: Memory failure: 0x104e5c: unhandlable page.

從系統日志中可以看出,系統正在經歷嚴重的內存錯誤(Memory Errors),主要涉及硬件層面的問題。

檢查詳細錯誤日志:

sudo dmesg | grep -i error
[19108.267949] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[19108.267972] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[19108.267976] {2}[Hardware Error]: event severity: corrected
[19108.267985] {2}[Hardware Error]:  Error 0, type: corrected
[19108.267992] {2}[Hardware Error]:  fru_text: CorrectedErr
[19108.267997] {2}[Hardware Error]:   section_type: memory error
[19108.268003] {2}[Hardware Error]:   node:0 device:0 
[19108.268005] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[19114.873932] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19114.874122] EDAC MC0: 16385 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46f934 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19118.239275] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19118.239533] EDAC MC0: 25284 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fb35 offset:0x5c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19128.825566] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19128.825743] EDAC MC0: 16708 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fab5 offset:0x5c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19133.700096] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19133.700127] EDAC MC0: 32750 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46f834 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19135.870233] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19135.870309] EDAC MC0: 16687 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fa34 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19138.224432] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19138.224502] EDAC MC0: 15745 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x46fcb4 offset:0xd80 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19140.213293] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19140.213328] EDAC MC0: 15575 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x10aac5 offset:0x1c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19141.210137] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19141.210164] EDAC MC0: 19211 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fab4 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19141.906759] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19141.906780] EDAC MC0: 16437 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46f9b4 offset:0xbc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19143.127824] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19143.127876] EDAC MC0: 24609 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x46f835 offset:0x680 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19145.175716] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19145.175754] EDAC MC0: 5555 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7f3ec offset:0x280 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19148.183616] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19148.183654] EDAC MC0: 4858 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1021ad offset:0x180 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19149.143580] mce: [Hardware Error]: Machine check events logged
[19149.143583] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19149.143619] EDAC MC0: 4223 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x7f3ee offset:0xec0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19149.143629] mce: [Hardware Error]: Machine check events logged
[19151.167012] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19151.167036] EDAC MC0: 4119 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7f3ec offset:0x280 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19152.151462] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19152.151502] EDAC MC0: 3976 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x46f835 offset:0x680 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:1 rank:1 )
[19153.175444] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19153.175485] EDAC MC0: 24245 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fc34 offset:0x9c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )
[19169.174851] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[19169.174898] EDAC MC0: 48 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x46fab5 offset:0x3c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1 )

dmesg 輸出的硬件錯誤日志來看,系統正在經歷嚴重的ECC內存錯誤,主要集中在 Channel 2, DIMM 0Channel 0, DIMM 0

內存插槽與 CPU 信息

sudo dmidecode -t memory | grep -A10 "Memory Device$" | egrep "Locator|Bank Locator|Size"
	Size: 8 GBLocator: Node0_Dimm0Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm1Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm2Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm3Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm4Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm5Bank Locator: Node0_Bank0Size: 8 GBLocator: Node0_Dimm6Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm7Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm8Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm9Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm10Bank Locator: Node0_Bank0Size: No Module InstalledLocator: Node0_Dimm11Bank Locator: Node0_Bank0Size: 8 GBLocator: Node1_Dimm0Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm1Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm2Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm3Bank Locator: Node1_Bank0Size: 8 GBLocator: Node1_Dimm4Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm5Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm6Bank Locator: Node1_Bank0Size: No Module InstalledLocator: Node1_Dimm7Bank Locator: Node1_Bank0

從上可以看出應該是 CPU0 的第一個插槽。直接將本插槽的內存條移出恢復正常。

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/pingmian/75600.shtml
繁體地址,請注明出處:http://hk.pswp.cn/pingmian/75600.shtml
英文地址,請注明出處:http://en.pswp.cn/pingmian/75600.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

函數和模式化——python

一、模塊和包 將一段代碼保存為應該擴展名為.py 的文件,該文件就是模塊。Python中的模塊分為三種,分別為:內置模塊、第三方模塊和自定義模塊。 內置模塊和第三方模塊又稱為庫內置模塊,有 python 解釋器自帶,不用單獨安…

windows下載安裝遠程桌面工具RealVNC-Server教程(RealVNC_E4_6_1版帶注冊碼)

文章目錄 前言一、下載安裝包二、安裝步驟三、使用VNC-Viewer客戶端遠程連接,輸入ip地址,密碼完成連接 前言 在現代工作和生活中,遠程控制軟件為我們帶來了極大的便利。RealVNC - Server 是一款功能強大的遠程控制服務器軟件,通過…

Android Dagger 2 框架的注解模塊深入剖析 (一)

本人掘金號,歡迎點擊關注:https://juejin.cn/user/4406498335701950 一、引言 在 Android 開發中,依賴注入(Dependency Injection,簡稱 DI)是一種強大的設計模式,它能夠有效降低代碼的耦合度&…

HTML語言的空值合并

HTML語言的空值合并 引言 在現代Web開發中,HTML(超文本標記語言)是構建網頁的基礎語言。隨著前端技術的快速發展,開發者們面臨著大量不同的工具和技術,尤其是在數據處理和用戶交互方面。空值合并是一些編程語言中常用…

【數據結構】樹的介紹

目錄 一、樹1.1什么是樹?1.2 樹的概念與結構1.3樹的相關術語1.4 樹形結構實際運用場景 二、二叉樹2.1 概念與結構2.2 特殊的二叉樹2.2.1 滿二叉樹2.2.2 完全二叉樹 個人主頁,點擊這里~ 數據結構專欄,點擊這里~ 一、樹 1.1什么是樹&#xff1…

Muduo網絡庫實現 [十三] - HttpRequest模塊

目錄 設計思路 成員設計 模塊實現 設計思路 首先我們要先知道HTTP的請求的流程是什么樣子的,不然我們會學的很迷糊。對于HTTP請求如何到來以及去往哪里,我們應該很清楚的知道 HTTP請求在服務器系統中的傳遞流程是一個多層次的過程: 客戶端發起請求…

6. RabbitMQ 死信隊列的詳細操作編寫

6. RabbitMQ 死信隊列的詳細操作編寫 文章目錄 6. RabbitMQ 死信隊列的詳細操作編寫1. 死信的概念2. 消息 TTL 過期(觸發死信隊列)3. 隊列超過隊列的最大長度(觸發死信隊列)4. 消息被拒(觸發死信隊列)5. 最后: 1. 死信的概念 先從概念上解釋上搞清楚這個定義&#…

如何使用Selenium進行自動化測試?

🍅 點擊文末小卡片 ,免費獲取軟件測試全套資料,資料在手,漲薪更快 對于很多剛入門的測試新手來說,大家都將自動化測試作為自己職業發展的一個主要階段。可是,在成為一名合格的自動化測試工程師之前&#…

洛谷題單3-P5724 【深基4.習5】求極差 最大跨度值 最大值和最小值的差-python-流程圖重構

題目描述 給出 n n n 和 n n n 個整數 a i a_i ai?,求這 n n n 個整數中的極差是什么。極差的意思是一組數中的最大值減去最小值的差。 輸入格式 第一行輸入一個正整數 n n n,表示整數個數。 第二行輸入 n n n 個整數 a 1 , a 2 … a n a_1,…

STM32智能手表——任務線程部分

RTOS和LVGL我沒學過,但是應該能硬啃這個項目例程 ├─Application/User/Tasks # 用于存放任務線程的函數 │ ├─user_TaskInit.c # 初始化任務 │ ├─user_HardwareInitTask.c # 硬件初始化任務 │ ├─user_RunModeTasks.c…

ubuntu22.04LTS設置中文輸入法

打開搜狗網址直接下載軟件,軟件下載完成后,會彈出安裝教程說明書。 網址:搜狗輸入法linux-首頁搜狗輸入法for linux—支持全拼、簡拼、模糊音、云輸入、皮膚、中英混輸https://shurufa.sogou.com/linux

SQL Server數據庫異常-[SqlException (0x80131904): 執行超時已過期] 操作超時問題及數據庫日志已滿的解決方案

🧑 博主簡介:CSDN博客專家、CSDN平臺優質創作者,獲得2024年博客之星榮譽證書,高級開發工程師,數學專業,擁有高級工程師證書;擅長C/C、C#等開發語言,熟悉Java常用開發技術&#xff0c…

php8 ?-> nullsafe 操作符 使用教程

簡介 PHP 8 引入了 ?->(Nullsafe 操作符),用于簡化 null 檢查,減少繁瑣的 if 語句或 isset() 代碼,提高可讀性。 ?-> Nullsafe 操作符的作用 在 PHP 7 及以下,訪問對象的屬性或方法時&#xff0…

WORD+VISIO輸出PDF圖片提高清晰度的方法

WORDVISIO輸出PDF圖片提高清晰度的方法 part 1: visio 繪圖part 2: word 導出 part 1: visio 繪圖 先在visio中把圖片和對應的文字調整為適合插入到文章中的尺寸; 在visio中把所有元素進行組合; 把組合后的圖片長和寬等比例放縮,如放大10倍…

重要頭文件下的函數

1、<cctype> #include<cctype>加入這個頭文件就可以調用以下函數&#xff1a; 1、isalpha(x) 判斷x是否為字母 isalpha 2、isdigit(x) 判斷x是否為數字 isdigit 3、islower(x) 判斷x是否為小寫字母 islower 4、isupper(x) 判斷x是否為大寫字母 isupper 5、isa…

基于大模型預測不穩定性心絞痛的多維度研究與應用

目錄 一、引言 1.1 研究背景與意義 1.2 研究目的 1.3 國內外研究現狀 二、不穩定性心絞痛概述 2.1 定義與分類 2.2 發病機制 2.3 臨床表現 三、大模型技術原理與應用基礎 3.1 大模型介紹 3.2 在醫療領域的應用現狀 3.3 用于不穩定性心絞痛預測的可行性 四、術前預…

第一講—函數的極限與連續(一)

思維導圖 筆記 雙曲正弦函數及其反函數

Mac VM 卸載 win10 安裝win7系統

卸載 找到相應直接刪除&#xff08;移動到廢紙簍&#xff09; 可參考&#xff1a;mac如何卸載虛擬機win 下載 win7下載地址

免費送源碼:Java+SSM+Android Studio 基于Android Studio游戲搜索app的設計與實現 計算機畢業設計原創定制

摘要 本文旨在探討基于SSM框架和Android Studio的游戲搜索App的設計與實現。首先&#xff0c;我們詳細介紹了SSM框架&#xff0c;這是一種經典的Java Web開發框架&#xff0c;由Spring、SpringMVC和MyBatis三個開源項目整合而成&#xff0c;為開發企業級應用提供了高效、靈活、…

網絡安全的現狀與防護措施

隨著數字化和信息化的迅猛發展&#xff0c;互聯網已成為人們日常生活、工作和學習不可或缺的一部分。然而&#xff0c;隨著網絡技術的普及&#xff0c;網絡安全問題也日益突出。近年來&#xff0c;數據泄露、惡意軟件、網絡攻擊等事件層出不窮&#xff0c;給企業和個人帶來了巨…