ORACLE 6節點組成的ACFS文件系統異常的分析思路

近期遇到多次6節點集群的ACFS文件系統環境異常問題;如24日中午12點附近出現ACFS文件系統訪問異常,通過查看集群ALERT日志、CSSD進程日志及OSW監控軟件的日志,可以發現OSW監控軟件在11:55-12:40分時沒有收集到虛擬機LINUX主機的監控數據,同期數據庫的CSSD進程也有與其它節點的私網信息已經丟失網絡心跳,因此可以推斷當時主機已經HANG住。

前環境為VMWARE虛擬機環境搭建6節點ORACLE GRID集群,使用ACFS文件系統為應用程序提供數據共享目錄 /DATA,應用程序同時部署在6節點ORACLE GRID集群的對應主機上;未在集群環境運行ORACLE數據庫。

對于近期兩次典型問題分析如下:6/24日中午12點附近出現ACFS文件系統訪問異常,通過查看集群ALERT日志、CSSD進程日志及OSW監控軟件的日志,可以發現OSW監控軟件在11:55-12:40分時沒有收集到主機的監控數據,同期數據庫的CSSD進程也有與其它節點的私網信息已經丟失網絡心跳,因此可以推斷當時主機已經HANG住。7/2日上午9點附近ACFS文件系統無法訪問,當時OSW監控未開啟;從集群ALERT日志來看當時有應用進程在使用/DATA目錄 無法UNMOUNT,操作系統日志中有NFO: task java:12227 blocked for more than 120 seconds.信息,因未有其它有效信息,暫無法判斷當時何種原因導致ACFS文件系統訪問異常。

從具體的問題來看,ORACLE集群軟件做為操作系統上層的軟件,會受到底層操作系統OS以及更底層的VMWARE虛擬機環境的影響;由于幾層系統之間監控日志粒度也不同,對于問題的分析帶來了較大的復雜度;許多信息無法向下追蹤去查找根本原因;

如下為分析過程:

1.集群 alert日志信息

2019-06-24 11:32:43.138:

[ctssd(3268)]CRS-2408:The clock on host node5 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

2019-06-24 12:17:40.220:

[cssd(3148)]CRS-1612:Network communication with node node2 (2) missing for 50% of timeout interval. ?Removal of this node from cluster in 14.560 seconds

2019-06-24 12:17:48.222:

[cssd(3148)]CRS-1611:Network communication with node node2 (2) missing for 75% of timeout interval. ?Removal of this node from cluster in 6.560 seconds

2019-06-24 12:17:52.223:

[cssd(3148)]CRS-1610:Network communication with node node2 (2) missing for 90% of timeout interval. ?Removal of this node from cluster in 2.560 seconds

2019-06-24 12:17:54.790:

[cssd(3148)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node3 node4 node5 node6 .

2019-06-24 12:18:38.016:

[cssd(3148)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node2 node3 node4 node5 node6 .

2019-06-24 12:33:48.943:

[cssd(3148)]CRS-1662:Member kill requested by node node6 for member number 5, group ocr_oanew-cluster

2019-06-24 12:33:48.959:

2.OSW監控數據

部分輸入如下:

zzz ***Mon Jun 24 11:55:04 CST 2019

Tasks: 520 total, ??1 running, 519 sleeping, ??0 stopped, ??0 zombie

Cpu(s): ?1.9%us, ?1.4%sy, ?0.1%ni, 96.5%id, ?0.0%wa, ?0.1%hi, ?0.1%si, ?0.0%st

Mem: ?24608192k total, 24400720k used, ??207472k free, ??450168k buffers

Swap: 16383992k total, ??149316k used, 16234676k free, ?3719180k cached

??PID USER ?????PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ???TIME+ ?COMMAND ???????????

21449 root ?????20 ??0 14.9g 6.5g ?16m S 34.6 27.6 311:40.89 java ??????????????

25134 root ?????20 ??0 ?109m 1212 ?892 D ?5.9 ?0.0 ??0:50.58 find ??????????????

25763 root ?????10 -10 ????0 ???0 ???0 S ?4.0 ?0.0 ??1:33.21 oks_comm ??????????

?2522 root ?????20 ??0 ?157m ?19m 6044 S ?2.0 ?0.1 296:44.37 Xorg ??????????????

25569 root ?????30 ?10 ?238m ?12m 5388 S ?2.0 ?0.1 ??0:07.87 floaters ??????????

32100 oracle ???20 ??0 ?4636 1268 ?660 S ?2.0 ?0.0 ??0:00.03 pidstat ???????????

32110 oracle ???20 ??0 ?4648 1284 ?660 S ?2.0 ?0.0 ??0:00.03 pidstat ???????????

32125 oracle ???20 ??0 15300 1556 ?932 R ?2.0 ?0.0 ??0:00.02 top ???????????????

32152 root ?????20 ??0 7407m ?11m 7084 S ?2.0 ?0.0 ??0:00.02 jstat ?????????????

??106 root ?????20 ??0 ????0 ???0 ???0 S ?1.0 ?0.0 ??6:46.23 kblockd/0 ?????????

24801 oracle ???20 ??0 1835m ?38m ?16m S ?1.0 ?0.2 ??5:47.71 oraagent.bin ??????

25759 root ?????10 -10 ????0 ???0 ???0 S ?1.0 ?0.0 ??0:04.91 oks_comm ??????????

25760 root ?????10 -10 ????0 ???0 ???0 S ?1.0 ?0.0 ??0:05.08 oks_comm ??????????

25761 root ?????10 -10 ????0 ???0 ???0 S ?1.0 ?0.0 ??0:04.99 oks_comm ??????????

25762 root ?????10 -10 ????0 ???0 ???0 S ?1.0 ?0.0 ??0:17.44 oks_comm ??????????

27667 root ?????20 ??0 ?815m ?19m ?10m S ?1.0 ?0.1 110:42.34 octssd.bin ????????

27731 root ?????RT ??0 ?756m ?90m ?57m S ?1.0 ?0.4 823:45.84 osysmond.bin ??????

????1 root ?????20 ??0 19364 1152 ?920 S ?0.0 ?0.0 ??0:01.55 init ??????????????

????2 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.50 kthreadd ??????????

????3 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:37.36 migration/0 ???????

????4 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:29.71 ksoftirqd/0 ???????

????5 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/0 ???????

????6 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:03.78 watchdog/0 ????????

????7 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??2:56.86 migration/1 ???????

????8 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/1 ???????

????9 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:40.01 ksoftirqd/1 ???????

???10 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:03.14 watchdog/1 ????????

???11 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:45.06 migration/2 ???????

???12 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/2 ???????

???13 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:30.71 ksoftirqd/2 ???????

???14 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:04.09 watchdog/2 ????????

???15 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:39.74 migration/3 ???????

???16 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/3 ???????

???17 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:15.30 ksoftirqd/3 ???????

???18 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:05.59 watchdog/3 ????????

???19 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:21.81 migration/4 ???????

???20 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/4 ???????

???21 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:24.62 ksoftirqd/4 ???????

???22 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:02.89 watchdog/4 ????????

???23 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??2:59.13 migration/5 ???????

???24 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/5 ???????

???25 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:29.33 ksoftirqd/5 ???????

???26 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:03.09 watchdog/5 ????????

???27 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??2:05.44 migration/6 ???????

zzz ***Mon Jun 24 12:40:18 CST 2019

top - 12:40:19 up 38 days, 21:55, ?7 users, ?load average: 389.66, 349.33, 237.0

Tasks: 479 total, ??2 running, 476 sleeping, ??0 stopped, ??1 zombie

Cpu(s): 12.3%us, ?7.6%sy, ?0.6%ni, 79.2%id, ?0.2%wa, ?0.0%hi, ?0.1%si, ?0.0%st

Mem: ?24608192k total, 13600176k used, 11008016k free, ??450344k buffers

Swap: 16383992k total, ??121744k used, 16262248k free, ?3679644k cached

??PID USER ?????PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ???TIME+ ?COMMAND ???????????

?2079 root ?????20 ??0 1925m ?60m 8280 R 195.2 ?0.3 ??0:01.97 java ?????????????

25787 root ?????20 ??0 ????0 ???0 ???0 S 29.7 ?0.0 ??0:13.62 acfsvol1 ??????????

?1785 root ?????30 ?10 ?233m 8240 5356 S 10.9 ?0.0 ??0:00.18 floaters ??????????

?1780 root ?????20 ??0 1434m ?33m ?15m S ?2.0 ?0.1 ??0:00.11 orarootagent.bi ???

?1848 oracle ???20 ??0 ?4660 1292 ?660 S ?2.0 ?0.0 ??0:00.03 pidstat ???????????

?1784 oracle ???20 ??0 ?4708 1344 ?660 S ?1.0 ?0.0 ??0:00.02 pidstat ???????????

?2522 root ?????20 ??0 ?156m ?17m 6044 S ?1.0 ?0.1 296:44.88 Xorg ??????????????

23104 root ?????20 ??0 1914m ?34m ?16m S ?1.0 ?0.1 168:20.21 ohasd.bin ?????????

27384 oracle ???RT ??0 1346m 115m ?54m S ?1.0 ?0.5 390:41.65 ocssd.bin ?????????

27731 root ?????RT ??0 ?756m ?90m ?57m S ?1.0 ?0.4 823:47.37 osysmond.bin ??????

????1 root ?????20 ??0 19364 1152 ?920 S ?0.0 ?0.0 ??0:01.59 init ??????????????

????2 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.50 kthreadd ??????????

????3 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:37.36 migration/0 ???????

????4 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:29.89 ksoftirqd/0 ???????

????5 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/0 ???????

????6 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:03.79 watchdog/0 ????????

????7 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??2:56.86 migration/1 ???????

????8 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/1 ???????

????9 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:40.02 ksoftirqd/1 ???????

???10 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:03.14 watchdog/1 ????????

???11 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:45.06 migration/2 ???????

???12 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/2 ???????

???13 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:30.72 ksoftirqd/2 ???????

???14 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:04.10 watchdog/2 ????????

???15 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:39.82 migration/3 ???????

???16 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/3 ???????

???17 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:15.30 ksoftirqd/3 ???????

???18 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:05.60 watchdog/3 ????????

???19 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??1:21.81 migration/4 ???????

???20 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/4 ???????

???21 root ?????20 ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:24.64 ksoftirqd/4 ???????

???22 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:02.89 watchdog/4 ????????

???23 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??2:59.15 migration/5 ???????

???24 root ?????RT ??0 ????0 ???0 ???0 S ?0.0 ?0.0 ??0:00.00 migration/5 ??

3.節點1 CSSD進程日志信息

2019-06-24 12:17:46.238: [ ???CSSD][2716677888]clssnmSendingThread: sent 5 status msgs to all nodes

2019-06-24 12:17:46.631: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349866/3359336954

2019-06-24 12:17:47.132: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349867/3359337454

2019-06-24 12:17:47.631: [ ???CSSD][2718254848]clssnmPollingThread: node node2 (2) at 75% heartbeat fatal, removal in 7.150 seconds

2019-06-24 12:17:47.631: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349867/3359337954

2019-06-24 12:17:48.132: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349868/3359338454

2019-06-24 12:17:48.631: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349868/3359338954

2019-06-24 12:17:49.132: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349869/3359339454

2019-06-24 12:17:49.631: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349869/3359339954

2019-06-24 12:17:50.132: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349870/3359340454

2019-06-24 12:17:50.631: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349870/3359340954

2019-06-24 12:17:51.132: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349871/3359341454

2019-06-24 12:17:51.240: [ ???CSSD][2716677888]clssnmSendingThread: sending status msg to all nodes

2019-06-24 12:17:51.240: [ ???CSSD][2716677888]clssnmSendingThread: sent 5 status msgs to all nodes

2019-06-24 12:17:51.632: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349871/3359341954

2019-06-24 12:17:52.133: [ ???CSSD][2727913216]clssnmvDiskPing: Writing with status 0x3, timestamp 1561349872/3359342454

2019-06-24 12:17:52.632: [ ???CSSD][2718254848]clssnmPollingThread: node node2 (2) at 90% heartbeat fatal, removal in 2.150 seconds,

seedhbimpd 1

    1. 7/2日問題分析

7/2日上午9點附近ACFS文件系統無法訪問,當時OSW監控未開啟;從集群ALERT日志來看當時有應用進程在使用/DATA目錄 無法UNMOUNT,操作系統日志中有NFO: task java:12227 blocked for more than 120 seconds.信息,因未有其它有效信息,暫無法判斷當時何種原因導致ACFS文件系統訪問異常。

1.集群ALERT日志信息

2019-07-02 08:49:03.484:

[ctssd(3257)]CRS-2408:The clock on host node1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

[client(17179)]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo

WARNING:Alert message too long

[client(17188)]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo

WARNING:Alert message too long

[client(17190)]CRS-10001:02-Jul-19 09:04 ACFS-9252: The following process IDs have open references on mount point '/data':

[client(17192)]CRS-10001:5822

[client(17194)]CRS-10001:02-Jul-19 09:04 ACFS-9253: Failed to unmount mount point '/data'. ?Mount point likely in use.

[client(17196)]CRS-10001:02-Jul-19 09:04 ACFS-9254: Manual intervention is required.

[client(17219)]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo

WARNING:Alert message too long

[client(17225)]CRS-10001:02-Jul-19 09:04 ACFS-9153: Program '/app/weaver/jdk1.6.0_27/bin/java -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djava.system.class.loader=com.caucho.loader.SystemClassLoader -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Xmx6000m -Xms6000m -Xss256k -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC -javaagent:wagent.jar -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremote -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl -Djava.awt.headless=true -Dresin.home=/app/weaver/Resin/ -Dresin.root=/app/weaver/Resin/ -Dcom.sun.management.jmxremo

WARNING:Alert message too long

[client(17227)]CRS-10001:02-Jul-19 09:04 ACFS-9252: The following process IDs have open references on mount point '/data':

[client(17229)]CRS-10001:5822

2.操作系統日志

Jul ?2 08:59:49 node1 kernel: [<ffffffff81185d29>] do_sys_open+0x69/0x140

Jul ?2 08:59:49 node1 kernel: [<ffffffff81185e40>] sys_open+0x20/0x30

Jul ?2 08:59:49 node1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Jul ?2 09:01:49 node1 kernel: INFO: task java:12227 blocked for more than 120 seconds.

Jul ?2 09:01:49 node1 kernel: ?????Tainted: P ??????????--------------- H ?2.6.32-431.el6.x86_64 #1

Jul ?2 09:01:49 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jul ?2 09:01:49 node1 kernel: java ?????????D 000000000000000c ????0 12227 ??5821 0x00000000

Jul ?2 09:01:49 node1 kernel: ffff88033e49d8c8 0000000000000082 0000000200001000 ffff88062d4742c0

Jul ?2 09:01:49 node1 kernel: ffff8803432b6d70 ffff880635ced090 ffff88062d4316c0 ffff8800283968a8

Jul ?2 09:01:49 node1 kernel: ffff8805c7f25af8 ffff88033e49dfd8 000000000000fbc8 ffff8805c7f25af8

Jul ?2 09:01:49 node1 kernel: Call Trace:

Jul ?2 09:01:49 node1 kernel: [<ffffffff8109b4ee>] ? prepare_to_wait_exclusive+0x4e/0x80

Jul ?2 09:01:49 node1 kernel: [<ffffffffa052cdb5>] OfsWaitEvent+0x225/0x290 [oracleacfs]

Jul ?2 09:01:49 node1 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20

Jul ?2 09:01:49 node1 kernel: [<ffffffff8152784d>] ? bictcp_cong_avoid+0x2d/0x390

3.數據庫CHM相關日志

[oracle@node6 node6]$ cat 02-JUL-2019-09:20:20.txt|grep "spent too much time"

dm-1 ior: 0.000 iow: 1117.912 ios: 279 qlen: 304 wait: 7914;';3:Time=07-02-19 09.15.20, Disk dm-1 spent too much time (7914 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdb ior: 0.000 iow: 1654.062 ios: 152 qlen: 23 wait: 573;';3:Time=07-02-19 09.15.20, Disk sdb spent too much time (573 msecs) waiting for I/O (> 100 msecs)' type: SYS

sda ior: 0.000 iow: 11.182 ios: 1 qlen: 0 wait: 119;';3:Time=07-02-19 09.15.40, Disk sda spent too much time (119 msecs) waiting for I/O (> 100 msecs)' type: SWAP

sda3 ior: 0.000 iow: 11.182 ios: 1 qlen: 0 wait: 119;';3:Time=07-02-19 09.15.40, Disk sda3 spent too much time (119 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-0 ior: 0.000 iow: 11.182 ios: 2 qlen: 1 wait: 412;';3:Time=07-02-19 09.15.40, Disk dm-0 spent too much time (412 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdc ior: 192.196 iow: 1.996 ios: 7 qlen: 2 wait: 377;';3:Time=07-02-19 09.15.40, Disk sdc spent too much time (377 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdc ior: 106.347 iow: 2.101 ios: 14 qlen: 0 wait: 148;';3:Time=07-02-19 09.16.20, Disk sdc spent too much time (148 msecs) waiting for I/O (> 100 msecs)' type: SYS

sda ior: 0.000 iow: 13.605 ios: 3 qlen: 3 wait: 937;';3:Time=07-02-19 09.16.40, Disk sda spent too much time (937 msecs) waiting for I/O (> 100 msecs)' type: SWAP

sda3 ior: 0.000 iow: 13.605 ios: 3 qlen: 3 wait: 937;';3:Time=07-02-19 09.16.40, Disk sda3 spent too much time (937 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-1 ior: 0.000 iow: 24.811 ios: 6 qlen: 14 wait: 1565;';3:Time=07-02-19 09.16.40, Disk dm-1 spent too much time (1565 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-0 ior: 0.000 iow: 15.206 ios: 3 qlen: 4 wait: 838;';3:Time=07-02-19 09.16.40, Disk dm-0 spent too much time (838 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdc ior: 0.899 iow: 2.000 ios: 3 qlen: 1 wait: 382;';3:Time=07-02-19 09.16.40, Disk sdc spent too much time (382 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdb ior: 0.000 iow: 18.407 ios: 1 qlen: 3 wait: 770;';3:Time=07-02-19 09.16.40, Disk sdb spent too much time (770 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-1 ior: 0.000 iow: 737.072 ios: 184 qlen: 10 wait: 1060;';3:Time=07-02-19 09.16.55, Disk dm-1 spent too much time (1060 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdb ior: 0.000 iow: 1011.573 ios: 15 qlen: 0 wait: 1210;';3:Time=07-02-19 09.16.55, Disk sdb spent too much time (1210 msecs) waiting for I/O (> 100 msecs)' type: SYS

sda ior: 0.000 iow: 8.803 ios: 1 qlen: 0 wait: 3992;';3:Time=07-02-19 09.17.00, Disk sda spent too much time (3992 msecs) waiting for I/O (> 100 msecs)' type: SWAP

sda3 ior: 0.000 iow: 8.803 ios: 1 qlen: 0 wait: 3992;';3:Time=07-02-19 09.17.00, Disk sda3 spent too much time (3992 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-0 ior: 0.000 iow: 7.202 ios: 1 qlen: 0 wait: 4436;';3:Time=07-02-19 09.17.00, Disk dm-0 spent too much time (4436 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdc ior: 2.596 iow: 1.896 ios: 3 qlen: 1 wait: 370;';3:Time=07-02-19 09.17.40, Disk sdc spent too much time (370 msecs) waiting for I/O (> 100 msecs)' type: SYS

sda ior: 0.000 iow: 21.602 ios: 3 qlen: 1 wait: 1943;';3:Time=07-02-19 09.18.45, Disk sda spent too much time (1943 msecs) waiting for I/O (> 100 msecs)' type: SWAP

sda3 ior: 0.000 iow: 21.602 ios: 3 qlen: 1 wait: 1943;';3:Time=07-02-19 09.18.45, Disk sda3 spent too much time (1943 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-1 ior: 0.000 iow: 1968.174 ios: 492 qlen: 77 wait: 202;';3:Time=07-02-19 09.18.45, Disk dm-1 spent too much time (202 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-0 ior: 0.000 iow: 8.801 ios: 2 qlen: 2 wait: 4660;';3:Time=07-02-19 09.18.45, Disk dm-0 spent too much time (4660 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdc ior: 5.700 iow: 2.899 ios: 6 qlen: 0 wait: 1033;';3:Time=07-02-19 09.18.45, Disk sdc spent too much time (1033 msecs) waiting for I/O (> 100 msecs)' type: SYS

dm-1 ior: 0.000 iow: 274.506 ios: 68 qlen: 208 wait: 12512;';3:Time=07-02-19 09.20.05, Disk dm-1 spent too much time (12512 msecs) waiting for I/O (> 100 msecs)' type: SYS

sdb ior: 0.000 iow: 579.425 ios: 47 qlen: 39 wait: 2515;';3:Time=07-02-19 09.20.05, Disk sdb spent too much time (2515 msecs) waiting for I/O (> 100 msecs)' type: SYS

三、總結與后續處理建議

3.1 問題總結

當前環境為VMWARE虛擬機環境搭建6節點ORACLE GRID集群,使用ACFS文件系統為應用程序提供數據共享目錄 /DATA,應用程序同時部署在6節點ORACLE GRID集群的對應主機上;未在集群環境運行ORACLE數據庫。

對于近期兩次典型問題分析如下:6/24日中午12點附近出現ACFS文件系統訪問異常,通過查看集群ALERT日志、CSSD進程日志及OSW監控軟件的日志,可以發現OSW監控軟件在11:55-12:40分時沒有收集到主機的監控數據,同期數據庫的CSSD進程也有與其它節點的私網信息已經丟失網絡心跳,因此可以推斷當時主機已經HANG住。7/2日上午9點附近ACFS文件系統無法訪問,當時OSW監控未開啟;從集群ALERT日志來看當時有應用進程在使用/DATA目錄 無法UNMOUNT,操作系統日志中有NFO: task java:12227 blocked for more than 120 seconds.信息,因未有其它有效信息,暫無法判斷當時何種原因導致ACFS文件系統訪問異常。

從具體的問題來看,ORACLE集群軟件做為操作系統上層的軟件,會受到底層操作系統OS以及更底層的VMWARE虛擬機環境的影響;由于幾層系統之間監控日志粒度也不同,對于問題的分析帶來了較大的復雜度;許多信息無法向下追蹤去查找根本原因;

3.2 后續處理建議

因此結合歷次問題及整體架構的考慮建議如下:

1.加強對LINUX虛擬主機運行情況的監控,如開啟OSW監控,開啟ZABBIX監控。

2.建議聯系VMWARE虛擬機維護人員溝通是否可以從VMWARE虛擬機層面對LINUX主機進行監控,同時對VMWARE虛擬機本身及底層的物理機能有更加細粒度的監控。

3.ASM實例的memory_max_target內存參數當前為默認的1076M;后續建議調整到2048M,提升ASM實例的性能。

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/web/15511.shtml
繁體地址,請注明出處:http://hk.pswp.cn/web/15511.shtml
英文地址,請注明出處:http://en.pswp.cn/web/15511.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

【OceanBase診斷調優】—— 直連普通租戶時遇到報錯:Tenant not in this server

本文介紹了直連 OceanBase 數據庫中的普通租戶時&#xff0c;出現報錯&#xff1a;ERROR 5150 (HY000) : Tenant not in this server 的處理方法。 問題描述 在 n-n 或者 n-n-n (n>1) 的部署架構中&#xff0c;使用 2881 端口 直連 OceanBase 集群的普通租戶&#xff0c;可…

ChatGPT大語言模型發展歷史

技術背景與OpenAI成立 2010年代初&#xff1a;隨著深度學習技術的突破&#xff0c;特別是循環神經網絡(RNN)和長短時記憶網絡(LSTM)的成功應用&#xff0c;自然語言處理(NLP)領域開始經歷一場革命。2015年12月&#xff1a;OpenAI作為一個旨在確保人工智能(AI)研究造福全人類的…

Java并行流

Java并行流 什么是并行流&#xff1f;并行流的優缺點優點缺點 如何使用&#xff1f;1.創建流2.轉換為并行流3.流操作4.收集結果5.關閉流 并行流的本質1.任務劃分和調度2.并發處理數據3.任務結果合并4.性能優化 參考文獻官方文檔 什么是并行流&#xff1f; 并行流&#xff08;p…

【C++風云錄】揭秘醫療機器人:技術解析與應用探索

打造智能醫療&#xff1a;醫療機器人技術與手術輔助 前言 本文將在深度和廣度上探討六種尖端醫療機器人系統&#xff0c;并重點介紹其應用、C控制接口及其功能。這些機器人系統分別是ROSA Robot、Da Vinci Surgical SystemSDK、Intuitive Surgical’s da Vinci Xi、Medroboti…

黑龍江等保測評新要求下的政府信息化安全實踐案例分析

在數字化轉型的浪潮中&#xff0c;政府機構作為社會管理和公共服務的核心&#xff0c;其信息安全的重要性日益凸顯。近期&#xff0c;黑龍江省積極響應國家網絡安全戰略&#xff0c;依據最新的等級保護測評&#xff08;簡稱“等保測評”&#xff09;要求&#xff0c;對政府信息…

SpringBoot運維篇(打包,多環境,日志)

文章目錄 一、SpringBoot程序的打包與運行二、配置高級三、多環境開發四、日志 一、SpringBoot程序的打包與運行 剛開始做開發學習的小伙伴可能在有一個知識上面有錯誤的認知&#xff0c;我們天天寫程序是在Idea下寫的&#xff0c;運行也是在Idea下運行的。 ?但是實際開發完成…

CDH6.3.2集成Flink1.17

直接運行腳本即可&#xff0c;一鍵輸出相關依賴包 運行步驟已給到文檔 下載地址

Html基礎筆記

Html超文本標記語言 (HyperText Markup Language) 超文本 指的是網頁中可以顯示的內容(圖片,超鏈接,視頻,) 標記語言 標記–>標簽(標注) 例如:買東西的時候—>商品具有標簽,看到標簽就知道商品的屬性(價格,材質,型號等,) 標記語言就是提供了很多的標簽,不同的標簽…

若依框架對于后端返回異常后怎么處理?

1、后端返回自定義異常serviceException 2、觸發該異常后返回json數據 因為若依對請求和響應都封裝了&#xff0c;所以根據返回值response獲取不到Code值但若依提供了一個catch方法用來捕獲返回異常的數據 3、處理的方法

antd design 自定義表頭

<template><a-card :bordered"false"><div class"contentWrap"><!-- 查詢區域 --><div class"table-page-search-wrapper"><a-form layout"inline" keyup.enter.native"searchQuery">&…

云端智享——記移動云手寫docker-demo

目錄 前言什么是移動云&#xff1f;為何我會使用移動云&#xff1f;移動云“好”在哪里&#xff1f;資源大屏顯示繼續項目部署其他細節 移動云產品的評價未來展望 前言 在如今這個萬物都上云的時代&#xff0c;我們需要選擇合適的云產品&#xff0c;而移動云有著獨特的優勢和廣…

TypeScript-聯合類型和別名類型

聯合類型 作用&#xff1a;將多個類型合并為一個類型對變量進行注解 // 數組里面既有字符串類型 也有數字類型 let arr:(string | number)[] [20,lily] 別名類型 通過type關鍵詞給寫起來較復雜的類型起一個其它的名字 好處&#xff1a;用來簡化和復用類型 說明&#xff…

golang中chan的高級用法

在閱讀k8s的源代碼中&#xff0c;發現了一些比較有意思的用法。 在Go語言中&#xff0c;chan&#xff08;通道&#xff09;是一種用于在不同的goroutine之間進行通信的機制。WaitForCacheSync(stopCh <-chan struct{}) error方法中的參數stopCh <-chan struct{}表示一個…

1.存儲部分

1.Flash Memory--閃速存儲器&#xff08;注&#xff1a;U盤&#xff0c;SD卡就是閃存&#xff09;在EEPROM基礎上發展而來的&#xff0c;斷電后也能保存信息&#xff0c;且可進行多次 快速擦除重寫。注意&#xff1a;由于閃存需要先擦除再寫入&#xff0c;因此閃存寫的速度要比…

達夢數據庫學習筆記

架構、特點和基本概念 達夢數據庫&#xff08;DM Database&#xff09;是中國達夢數據庫有限公司自主研發的關系型數據庫管理系統。它廣泛應用于政府、金融、電信、能源等行業&#xff0c;具備高性能、高可靠性和高安全性的特點。 架構 達夢數據庫的架構設計注重高性能和高可…

python-繪制五星紅旗(非標準)

完整代碼如下&#xff1a; #五星紅旗&#xff08;非標準版&#xff09; from turtle import* import math from random import* tracer(0) penup() goto(-640,220) pendown() color(gold,gold) begin_fill() for i in range(5): fd(150) right(144) # 大五角星 penup(…

基于UDP的網絡多人聊天室

UDP服務器 #include <myheader.h>//宏定義打印錯誤信息 #define PRINT_ERR(msg) \do \{ \printf("%S,%D,%S\n",__FI…

java單元測試:編寫可測試性好的代碼

寫出可測試性好的代碼是編寫高質量軟件的關鍵。以下是一些有助于提高代碼可測試性的最佳實踐&#xff1a; 1. 單一職責原則 (Single Responsibility Principle) 每個類或方法應只負責一個功能。這樣可以讓測試更容易集中于單一功能。 2. 依賴注入 (Dependency Injection) 通…

【一個糟糕的詞:省流】

今日思考&#xff0c;博主分享&#x1f4dd;&#xff0c;原文如下&#xff0c; 我最近聽到了一個特別糟糕的詞叫省流。我甚至認為這個詞可以用來衡量一個人的智商啊&#xff0c;我們可以把一個知識簡單的分成三部分問題&#xff0c;答案思維方式就是這個答案是怎么推導出來的啊…

Python數據可視化(二)

Patches繪制幾何圖形 模塊 patches 主要用來完成多邊形的繪制工作。這些多邊形都是以類&#xff08;Class&#xff09;的形式出現的&#xff0c; 主要包括圓&#xff08;Circle&#xff09;、橢圓&#xff08;Ellipse&#xff09;、矩形&#xff08;Rectangle&#xff09;、圓…