測試環境:三節點實時主從
版本:--03134283938-20221019-172201-20018
測試1
系統沒有啟動確認監視器
關閉節點3網卡
登錄節點1檢查主庫狀態
顯示向節點2發送歸檔成功,但無法收到節點3的消息,節點1掛起
日志報錯如下:
2024-06-06 00:47:38.481 [INFO] database P0000002319 T0000000000000002373 Send archive log to remote instance failed, switch all ep to SUSPEND status success!
2024-06-06 00:47:48.482 [ERROR] database P0000002319 T0000000000000002356 Can't connect to DM server on '192.168.100.102' port(5800) errno(115)
恢復節點3網卡
主庫日志信息如下:
2024-06-06 00:58:00.760 [INFO] database P0000002319 T0000000000000002356 mal_site_ctl_link_create startup from mal_site(0) to mal_site(2)!
2024-06-06 00:58:00.760 [INFO] database P0000002319 T0000000000000002356 mal_site_magic_gen site_magic[46500], src_site:0, dst_site:2
2024-06-06 00:58:00.761 [INFO] database P0000002319 T0000000000000002356 site[0] mal_site_ctl_port_set to site[2, IP: 192.168.100.102, port_num: 5800], socket handle = 12, site_magic = 46500
2024-06-06 00:58:00.761 [INFO] database P0000002319 T0000000000000002350 mal_site_port_get site_magic:46500, src_site:0, dst_site:2
2024-06-06 00:58:00.761 [INFO] database P0000002319 T0000000000000002349 mal_site_port_get site_magic:46500, src_site:0, dst_site:2
2024-06-06 00:58:00.768 [INFO] database P0000002319 T0000000000000002355 site[0] mal_site_data_port_set from site[2, IP: 192.168.100.102, port_num: 5800], socket handle = 14, site_magic = 46500
2024-06-06 00:58:00.769 [INFO] database P0000002319 T0000000000000002348 mal_site_port_get site_magic:46500, src_site:0, dst_site:2
2024-06-06 00:58:00.769 [INFO] database P0000002319 T0000000000000002351 mal_site_port_get site_magic:46500, src_site:0, dst_site:2
但檢查主庫狀態依舊是suspend
重啟(SHUTDOWN后被watcher自動拉起)數據庫后再檢查狀態恢復正常
測試2
啟動節點2上的確認監視器
中斷節點3的網絡
登錄主庫檢查狀態
雖然到TEST3發送歸檔失敗,但主庫狀態正常
主庫日志信息如下:
2024-06-06 01:07:44.807 [ERROR] database P0000002774 T0000000000000002819 [mal recv for arch] mal receive from site(TEST3) failed, begin lsn:622386010, end lsn:622386010, code:-6021
2024-06-06 01:07:44.807 [ERROR] database P0000002774 T0000000000000002819 send realtime archive to instance[TEST3] failed, code = -6021, begin_lsn = 622386010, end_lsn = 622386010!
2024-06-06 01:07:44.811 [INFO] database P0000002774 T0000000000000002819 Send archive log to remote instance failed, switch all ep to SUSPEND status success!
2024-06-06 01:07:46.268 [INFO] database P0000002774 T0000000000000002872 utsk_cmd_add, cmd info: cmd=217, dseq=1717631069, name_in=, begin_lsn=-1!
2024-06-06 01:07:46.268 [INFO] database P0000002774 T0000000000000002872 utsk_set_global_dw_stat, begin, msg_dseq:1717631069
2024-06-06 01:07:46.268 [INFO] database P0000002774 T0000000000000002872 set g_dw_stat from NONE to DW_FAILOVER success, g_dw_recover_stop is 0
2024-06-06 01:07:46.268 [INFO] database P0000002774 T0000000000000002872 utsk_set_global_dw_stat, finished, msg_dseq:1717631069, set code:0
2024-06-06 01:07:47.269 [INFO] database P0000002774 T0000000000000002872 utsk_cmd_add, cmd info: cmd=214, dseq=1717631070, name_in=, begin_lsn=-1!
2024-06-06 01:07:47.269 [INFO] database P0000002774 T0000000000000002832 utsk_cmd_exec, cmd:214, sys_status:SUSPEND, dseq:1717631070
2024-06-06 01:07:47.270 [INFO] database P0000002774 T0000000000000002832 Change TEST3 arch status from VALID to INVALID
2024-06-06 01:07:47.270 [INFO] database P0000002774 T0000000000000002872 utsk_cmd_add, received sql exec cmd:1, dseq:1717631071, sql:ALTER DATABASE OPEN FORCE
日志顯示主庫被掛起后立刻狀態恢復為open
測試3
啟動節點2上的確認監視器
中斷節點2的網絡
登錄主庫檢查狀態
網絡恢復后節點2也變成了主,集群分裂
登錄監視器顯示如下:
集群分裂后只能重建