核心架構原理
集群組成(典型三節點結構):
[Node1] PostgreSQL + Pacemaker + Corosync + pcsd
[Node2] PostgreSQL + Pacemaker + Corosync + pcsd
[Node3] PostgreSQL + Pacemaker + Corosync + pcsd
?????????? ??????????? ??????????? ?
??????? ← Corosync 多播/單播心跳通信 →
?????????? ??????????? ??????????? ?
??????? Pacemaker 資源調度器決定主從角色、服務狀態、VIP綁定位置
組件介紹
組件 | 作用 |
PostgreSQL 14 | 數據庫核心,負責數據存儲與復制。 |
Pacemaker | 集群資源管理器,負責數據庫服務、VIP 等資源的啟停與主備切換。 |
Corosync | 集群通信組件,提供節點間心跳和消息廣播機制。 |
pcsd | Web UI + CLI 集群管理組件,簡化 pcs 工具使用,提供節點認證與配置同步。 |
系統資源及組件規劃
節點名稱 | CPU/內存 | 網卡 | 硬盤 | IP地址 | OS | 節點角色 |
pgtest1 | 2c/2g | ens33 | 60G | 192.168.24.11 | Centos7.9 | PostgreSQL、Pacemaker、Corosync |
pgtest2 | 2c/2g | ens33 | 60G | 192.168.24.12 | Centos7.9 | |
pgtest3 | 2c/2g | ens33 | 60G | 192.168.24.13 | Centos7.9 | |
vip-master | 192.168.24.14 | |||||
vip-slave | 192.168.24.15 |
數據庫版本:
Postgresql14.6
PCS集群版本:
#rpm -qa|grep pacemaker
pacemaker-libs-1.1.23-1.el7_9.1.x86_64
pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
pacemaker-1.1.23-1.el7_9.1.x86_64
pacemaker-cli-1.1.23-1.el7_9.1.x86_64
#rpm -qa|grep pcs
pcsc-lite-libs-1.8.8-8.el7.x86_64
pcs-0.9.169-3.el7.centos.3.x86_64
# rpm -qa|grep corosync
corosync-2.4.5-7.el7_9.2.x86_64
corosynclib-2.4.5-7.el7_9.2.x86_64
一、安裝基礎軟件
1.1 配置hosts文件[all servers]
[root@pgtest1 ~]# cat >> /etc/hosts <<EOF
192.168.24.11 pgtest1
192.168.24.12 pgtest2
192.168.24.13 pgtest3
192.168.24.14 vip-master
192.168.24.15 vip-slave
EOF
1.2 關閉防火墻及selinux [all servers]
# systemctl disable firewalld
# systemctl stop firewalld
#vi /etc/selinux/config
selinux=disabled
# sestatus
SELinux status:? ? ? ? ? ? ? ? ?disabled
1.3 安裝PCS [all servers]
yum -y install libsmb*
yum install -y pacemaker pcs
yum install -y autoconf automake libtool
yum install -y docbook-style-xsl
yum install -y gcc-c++ glib2-devel
1.4 pacemaker resource-agents 更新 [all servers]
注:centos7.9 yum安裝集群軟件只支持pg9、pg10以下,需要安裝新版本resource-agents,如下:
resource-agents-4.12.0下載地址:? Release v4.12.0 · ClusterLabs/resource-agents · GitHub
# tar zxvf resource-agents-4.12.0.tar.gz
# cd resource-agents-4.12.0
# ./autogen.sh
# ./configure
# make && make install
確認支持PG12以上版本
[root@pgtest1 ~]# cat /usr/lib/ocf/resource.d/heartbeat/pgsql | grep ocf_version_cmp
??????? ocf_version_cmp "$version" "9.3"
??????? ocf_version_cmp "$version" "12"
??????? ocf_version_cmp "$version" "10"
??????? ocf_version_cmp "$version" "9.4"
??????? ocf_version_cmp "$OCF_RESKEY_crm_feature_set" "3.1.0"
??????????? ocf_version_cmp "$OCF_RESKEY_crm_feature_set" "3.2.0"
1.5 pcsd[all servers]
# echo "hacluster"| passwd --stdin hacluster???? #設置hacluster用戶密碼
1.6 安裝postgresql數據庫軟件 [all servers]
#使用腳本一鍵安裝,可忽略以下步驟。
--安裝依賴包:
yum install -y perl-ExtUtils-Embed?readline zlib-devel pam-devel libxml2-devel libxslt-devel openldap-devel python-devel gcc-c++?openssl-devel?cmake gcc* readline-devel zlib bison flex bison-devel flex-devel openssl openssl-devel
--創建用戶、組
groupadd -g 666 postgres
useradd -g postgres -u 666 postgres
echo "postgres"| passwd --stdin postgres
--創建目錄
mkdir -p /postgresql/{pg14,pgdata,arch,soft}
chown -R postgres. /postgresql
chmod -R 700 /postgresql
--上傳解壓安裝包
tar zxvf postgresql-14.6.tar.gz
--編譯安裝
[root@pgtest1 soft]# su - postgres
Last login: Fri May? 5 16:39:55 CST 2023 on pts/2
[postgres@pgtest1 ~]$ cd /postgresql/soft/
[postgres@pgtest1 soft]$ cd postgresql-14.6/
[postgres@pgtest1 postgresql-14.6]$ ls
aclocal.m4? config? config.log? config.status? configure? configure.ac? contrib? COPYRIGHT? doc? GNUmakefile? GNUmakefile.in? HISTORY? INSTALL? Makefile? README? src
[postgres@pgtest1 postgresql-14.6]$ ./configure --prefix=/postgresql/pg14 --with-pgport=5432
[postgres@pgtest1 postgresql-14.6]$ make world &&make install-world
1.7 service setup [all servers]
systemctl disable corosync
systemctl disable pacemaker
systemctl enable pcsd.service
systemctl start pcsd.service
--查看服務啟動狀態
systemctl status corosync pacemaker pcsd.service
1.8 cluster auth[any one host]
#pcs cluster auth pgtest1 pgtest2 pgtest3 -u hacluster -p "hacluster"
pgtest2: Authorized
pgtest3: Authorized
pgtest1: Authorized
?#在任意節點上啟用集群認證
1.9 設置數據庫集群[any one host]?
#### 配置集群節點 ####
(選擇master執行)
#pcs cluster setup --name pgcluster pgtest1 pgtest2 pgtest3
Destroying cluster on nodes: pgtest1, pgtest2, pgtest3...
pgtest1: Stopping Cluster (pacemaker)...
pgtest2: Stopping Cluster (pacemaker)...
pgtest3: Stopping Cluster (pacemaker)...
pgtest3: Successfully destroyed cluster
pgtest1: Successfully destroyed cluster
pgtest2: Successfully destroyed cluster
Sending 'pacemaker_remote authkey' to 'pgtest1', 'pgtest2', 'pgtest3'
pgtest1: successful distribution of the file 'pacemaker_remote authkey'
pgtest2: successful distribution of the file 'pacemaker_remote authkey'
pgtest3: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
pgtest1: Succeeded
pgtest2: Succeeded
pgtest3: Succeeded
Synchronizing pcsd certificates on nodes pgtest1, pgtest2, pgtest3...
pgtest2: Success
pgtest3: Success
pgtest1: Success
Restarting pcsd on the nodes in order to reload the certificates...
pgtest2: Success
pgtest3: Success
pgtest1: Success
#### 啟動所有集群節點 ####
(選擇master執行)
# pcs cluster start --all
pgtest1: Starting Cluster (corosync)...
pgtest2: Starting Cluster (corosync)...
pgtest3: Starting Cluster (corosync)...
pgtest3: Starting Cluster (pacemaker)...
pgtest1: Starting Cluster (pacemaker)...
pgtest2: Starting Cluster (pacemaker)...
#### 檢查集群狀態 ####
# pcs status --full
Cluster name: pgcluster
WARNINGS:
No stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: pgtest3 (3) (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Mon Oct? 9 09:58:45 2023
Last change: Mon Oct? 9 09:55:47 2023 by hacluster via crmd on pgtest3
3 nodes configured
0 resource instances configured
Online: [ pgtest1 (1) pgtest2 (2) pgtest3 (3) ]
No resources
Node Attributes:
* Node pgtest1 (1):
* Node pgtest2 (2):
* Node pgtest3 (3):
Migration Summary:
* Node pgtest3 (3):
* Node pgtest1 (1):
* Node pgtest2 (2):
Fencing History:
PCSD Status:
? pgtest1: Online
? pgtest2: Online
? pgtest3: Online
Daemon Status:
? corosync: active/disabled
? pacemaker: active/disabled
? pcsd: active/enabled
1.10 數據庫環境變量設置[all servers]
[root@pgtest1 soft]# su - postgres
Last login: Fri May? 5 17:18:28 CST 2023 on pts/3
[postgres@pgtest1 ~]$ cat .bash_profile
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
??????? . ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export PATH
export PGHOME=/postgresql/pg14
export PGDATA=/postgresql/pgdata
export PATH=$PGHOME/bin:$PATH
export LD_LIBRARY_PATH=$PGHOME/lib:$LD_LIBRARY_PATH
1.11 primary數據庫配置?[pgtest1]
#### pgtest1 ####
1).初始化數據庫
# su - postgres
$ initdb -D $PGDATA
2).配置主機訪問
修改 pg_hba.conf
sed -i '/^host[[:space:]]\+all[[:space:]]\+all[[:space:]]\+127\.0\.0\.1\/32[[:space:]]\+trust/i host??? all???????????? all???????????? 0.0.0.0/0?????????????? scram-sha-256'? /postgresql/pgdata/pg_hba.conf
sed -i '/^host[[:space:]]\+replication[[:space:]]\+all[[:space:]]\+127\.0\.0\.1\/32[[:space:]]\+trust/i host??? replication???? all???????????? 0.0.0.0/0?????????????? scram-sha-256'? /postgresql/pgdata/pg_hba.conf
3).配置數據庫參數
cat <<EOF >> /postgresql/pgdata/postgresql.conf
# Custom settings appended by script
listen_addresses = '*'
wal_keep_size = 10240? # 單位MB
log_destination = 'csvlog'
logging_collector = on
archive_mode = on
archive_command = 'cp %p /postgresql/arch/%f'
EOF
4).修改用戶默認密碼
[postgres@pgtest1 pgdata]$ pg_ctl start
[postgres@pgtest1 pgdata]$ psql -c "alter user postgres password 'postgres'";
1.12 創建secondary數據庫 [pgtest2]
[postgres@pgtest2 pgdata]$ pg_basebackup -h pgtest1 -U repl -D /postgresql/pgdata/ -Fp -Pv -Xs
#無需啟動
1.13 創建third數據庫 [pgtest3]
[postgres@pgtest3 pgdata]$ pg_basebackup -h pgtest1 -U repl -D /postgresql/pgdata/ -Fp -Pv -Xs
#無需啟動
1.14 停止primary數據庫 [pgtest1]
$ pg_ctl stop
waiting for server to shut down.... done
server stopped
二、配置pacemaker數據庫集群 [any one host] (一個節點配置即可,會自動同步到另外一個節點,master節點執行)
2.1 檢查集群狀態
[root@pgtest1 resource-agents-4.12.0]# pcs status
Cluster name: cluster_pg01
WARNINGS:
No stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: pgtest1 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 12:21:03 2025
Last change: Fri May? 9 12:14:30 2025 by hacluster via crmd on pgtest1
3 nodes configured
0 resource instances configured
Online: [ pgtest1 pgtest2 pgtest3 ]
No resources
Daemon Status:
? corosync: active/disabled
? pacemaker: active/disabled
? pcsd: active/enabled
2.2 創建集群文件
[root@pgtest1 resource-agents-4.12.0]# pcs cluster cib cib.xml
2.3 配置數據庫資源
-- property: cluster-name --
# pcs -f cib.xml property set cluster-name="pgcluster"
-- property: disable stonith, quorum --
# pcs -f cib.xml property set no-quorum-policy="ignore"
# pcs -f cib.xml property set stonith-enabled="false"
-- resource: vip-master --
# pcs -f cib.xml resource create vip-master ocf:heartbeat:IPaddr2 ip=192.168.24.14 cidr_netmask=24 nic=ens33 iflabel=master op monitor interval=5s
-- resource: vip-slave --
# pcs -f cib.xml resource create vip-slave ocf:heartbeat:IPaddr2 ip=192.168.24.15 cidr_netmask=24 nic=ens33 iflabel=slave op monitor interval=5s
-- resource: pgsql --
# pcs -f cib.xml resource create pgsql ocf:heartbeat:pgsql\
??? pgctl="/postgresql/pg14/bin/pg_ctl"\
??? psql="/postgresql/pg14/bin/psql"\
??? pgdata="/postgresql/pgdata"\
??? node_list="pgtest1 pgtest2 pgtest3"\
??? restore_command=""\
??? master_ip="192.168.24.14"\
??? repuser="postgres"\
??? rep_mode="sync"\
??? primary_conninfo_opt="password=postgres keepalives_idle=60 keepalives_interval=5 keepalives_count=5"\
??? op monitor interval="11s"\
??? op monitor interval="10s" role="Master"\
??? master master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true target-role='Started'
-- constraint: vip-master, pgsql on master node --
# pcs -f cib.xml constraint colocation add vip-master with master pgsql-master INFINITY
-- constraint: pgsql promote? node? MasterGroup --
# pcs -f cib.xml constraint order promote pgsql-master then start vip-master symmetrical=false score=INFINITY
-- constraint: pgsql demote? node? MasterGroup --
# pcs -f cib.xml constraint order demote pgsql-master then stop vip-master symmetrical=false score=0
-- constraint: vip-slave? sync standby 、sync standby on master --
# pcs -f cib.xml constraint location vip-slave rule score=200 pgsql-status eq HS:sync
# pcs -f cib.xml constraint location vip-slave rule score=100 pgsql-status eq PRI
# pcs -f cib.xml constraint location vip-slave rule score=-INFINITY not_defined pgsql-status
# pcs -f cib.xml constraint location vip-slave rule score=-INFINITY pgsql-status ne HS:sync and pgsql-status ne PRI
-- cluster: push cib file into cib --
# pcs cluster cib-push cib.xml
2.4 刷新集群狀態
# pcs resource refresh --full
Waiting for 1 reply from the CRMd. OK
2.5 查詢集群狀態
[root@pgtest1 ~]# pcs status --full???? #需要等待幾秒或者多刷新幾次
Cluster name: pgcluster
Stack: corosync
Current DC: pgtest3 (3) (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 12:35:17 2025
Last change: Fri May? 9 12:35:11 2025 by root via crm_attribute on pgtest1
3 nodes configured
5 resource instances configured
Online: [ pgtest1 (1) pgtest2 (2) pgtest3 (3) ]
Full list of resources:
?vip-master???? (ocf::heartbeat:IPaddr2):?????? Started pgtest1
?vip-slave????? (ocf::heartbeat:IPaddr2):?????? Started pgtest2
?Master/Slave Set: pgsql-master [pgsql]
???? pgsql????? (ocf::heartbeat:pgsql): Slave pgtest2
???? pgsql????? (ocf::heartbeat:pgsql): Slave pgtest3
???? pgsql????? (ocf::heartbeat:pgsql): Master pgtest1
???? Masters: [ pgtest1 ]
???? Slaves: [ pgtest2 pgtest3 ]
Node Attributes:
* Node pgtest1 (1):
??? + master-pgsql????????????????????? : 1000
??? + pgsql-data-status???????????????? : LATEST
??? + pgsql-master-baseline???????????? : 00000000050000A0
??? + pgsql-status????????????????????? : PRI
* Node pgtest2 (2):
??? + master-pgsql????????????????????? : 100
??? + pgsql-data-status???????????????? : STREAMING|SYNC
??? + pgsql-status????????????????????? : HS:sync
* Node pgtest3 (3):
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : STREAMING|ASYNC
??? + pgsql-status????????????????????? : HS:async
Migration Summary:
* Node pgtest2 (2):
* Node pgtest3 (3):
* Node pgtest1 (1):
Fencing History:
PCSD Status:
? pgtest1: Online
? pgtest3: Online
? pgtest2: Online
Daemon Status:
? corosync: active/disabled
? pacemaker: active/disabled
? pcsd: active/enabled
--查看流復制狀態
[root@pgtest1 ~]# ps -ajxf|grep postgres
---查看VIP
#master vip (write and read vip)
[root@pgtest1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
??? link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
??? inet 127.0.0.1/8 scope host lo
?????? valid_lft forever preferred_lft forever
??? inet6 ::1/128 scope host
?????? valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
??? link/ether 00:0c:29:b0:62:cd brd ff:ff:ff:ff:ff:ff
??? inet 192.168.24.11/24 brd 192.168.24.255 scope global noprefixroute ens33
?????? valid_lft forever preferred_lft forever
??? inet 192.168.24.14/24 brd 192.168.24.255 scope global secondary ens33:master
?????? valid_lft forever preferred_lft forever
??? inet6 fe80::20c:29ff:feb0:62cd/64 scope link
#slave vip (read vip)
[postgres@pgtest2 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
??? link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
??? inet 127.0.0.1/8 scope host lo
?????? valid_lft forever preferred_lft forever
??? inet6 ::1/128 scope host
?????? valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
??? link/ether 00:0c:29:c8:ee:12 brd ff:ff:ff:ff:ff:ff
??? inet 192.168.24.12/24 brd 192.168.24.255 scope global noprefixroute ens33
?????? valid_lft forever preferred_lft forever
??? inet 192.168.24.15/24 brd 192.168.24.255 scope global secondary ens33:slave
?????? valid_lft forever preferred_lft forever
??? inet6 fe80::15bb:4008:354c:4f0f/64 scope link noprefixroute
?????? valid_lft forever preferred_lft forever
[postgres@pgtest3 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
??? link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
??? inet 127.0.0.1/8 scope host lo
?????? valid_lft forever preferred_lft forever
??? inet6 ::1/128 scope host
?????? valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
??? link/ether 00:0c:29:c8:ee:12 brd ff:ff:ff:ff:ff:ff
??? inet 192.168.24.12/24 brd 192.168.24.255 scope global noprefixroute ens33
?????? valid_lft forever preferred_lft forever
??? inet 192.168.24.15/24 brd 192.168.24.255 scope global secondary ens33:slave
?????? valid_lft forever preferred_lft forever
??? inet6 fe80::20c:29ff:fec8:ee12/64 scope link
?????? valid_lft forever preferred_lft forever
三、PostgreSQL故障演示
3.1 數據同步驗證
--master vip連接測試
[postgres@pgtest3 ~]$ psql -h 192.168.24.14 -p5432
Password for user postgres:
psql (14.6)
Type "help" for help.
postgres=# create table t1 (id int);
CREATE TABLE
postgres=# insert into t1 values(1);
INSERT 0 1
--slave vip連接測試
[postgres@pgtest3 ~]$ psql -h 192.168.24.15 -p5432
Password for user postgres:
psql (14.6)
Type "help" for help.
postgres=# select * from t1;
?id
----
? 1
(1 row)
postgres=# insert into t1 values(2);
ERROR:? cannot execute INSERT in a read-only transaction
3.2 模擬主節點故障
[postgres@pgtest1 ~]$ pg_ctl stop
waiting for server to shut down.... done
server stopped
---查看當前節點資源狀態
[root@pgtest1 ~]# crm_mon -Afr -1
Stack: corosync
Current DC: pgtest3 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 13:53:14 2025
Last change: Fri May? 9 13:52:49 2025 by root via crm_attribute on pgtest2
3 nodes configured
5 resource instances configured
Online: [ pgtest1 pgtest2 pgtest3 ]
Full list of resources:
?vip-master???? (ocf::heartbeat:IPaddr2):?????? Started pgtest2
?vip-slave????? (ocf::heartbeat:IPaddr2):?????? Started pgtest3
?Master/Slave Set: pgsql-master [pgsql]
???? Masters: [ pgtest2 ]
???? Slaves: [ pgtest3 ]
???? Stopped: [ pgtest1 ]
Node Attributes:
* Node pgtest1:
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : DISCONNECT
??? + pgsql-status????????????????????? : STOP
* Node pgtest2:
??? + master-pgsql????????????????????? : 1000
??? + pgsql-data-status???????????????? : LATEST
??? + pgsql-master-baseline???????????? : 00000000060000A0
??? + pgsql-status????????????????????? : PRI
* Node pgtest3:
??? + master-pgsql????????????????????? : 100
??? + pgsql-data-status???????????????? : STREAMING|SYNC
??? + pgsql-status????????????????????? : HS:sync
Migration Summary:
* Node pgtest2:
* Node pgtest3:
* Node pgtest1:
?? pgsql: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May? 9 13:52:32 2025'
Failed Resource Actions:
* pgsql_start_0 on pgtest1 'unknown error' (1): call=55, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
??? last-rc-change='Fri May? 9 13:52:32 2025', queued=0ms, exec=234ms
---查看集群狀態
[root@pgtest1 ~]# pcs status --all
Cluster name: pgcluster
Stack: corosync
Current DC: pgtest3 (3) (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 13:55:31 2025
Last change: Fri May? 9 13:52:49 2025 by root via crm_attribute on pgtest2
3 nodes configured
5 resource instances configured
Online: [ pgtest1 (1) pgtest2 (2) pgtest3 (3) ]
Full list of resources:
?vip-master???? (ocf::heartbeat:IPaddr2):?????? Started pgtest2
?vip-slave????? (ocf::heartbeat:IPaddr2):?????? Started pgtest3
?Master/Slave Set: pgsql-master [pgsql]
???? pgsql????? (ocf::heartbeat:pgsql): Master pgtest2
???? pgsql????? (ocf::heartbeat:pgsql): Slave pgtest3
???? pgsql????? (ocf::heartbeat:pgsql): Stopped
???? Masters: [ pgtest2 ]
???? Slaves: [ pgtest3 ]
???? Stopped: [ pgtest1 ]
Node Attributes:
* Node pgtest1 (1):
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : DISCONNECT
??? + pgsql-status????????????????????? : STOP
* Node pgtest2 (2):
??? + master-pgsql????????????????????? : 1000
??? + pgsql-data-status???????????????? : LATEST
??? + pgsql-master-baseline???????????? : 00000000060000A0
??? + pgsql-status????????????????????? : PRI
* Node pgtest3 (3):
??? + master-pgsql????????????????????? : 100
??? + pgsql-data-status???????????????? : STREAMING|SYNC
??? + pgsql-status????????????????????? : HS:sync
Migration Summary:
* Node pgtest2 (2):
* Node pgtest3 (3):
* Node pgtest1 (1):
?? pgsql: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May? 9 13:52:32 2025'
Failed Resource Actions:
* pgsql_start_0 on pgtest1 'unknown error' (1): call=55, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
??? last-rc-change='Fri May? 9 13:52:32 2025', queued=0ms, exec=234ms
Fencing History:
PCSD Status:
? pgtest1: Online
? pgtest2: Online
? pgtest3: Online
Daemon Status:
? corosync: active/disabled
? pacemaker: active/disabled
? pcsd: active/enabled
#此時集群master已自動切換至pgtest2。
--vip發生了漂移
#Master-vip漂移到pgtest2,slave-vip漂移到pgtest3。
[root@pgtest2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
??? link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
??? inet 127.0.0.1/8 scope host lo
?????? valid_lft forever preferred_lft forever
??? inet6 ::1/128 scope host
?????? valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
??? link/ether 00:0c:29:c8:ee:12 brd ff:ff:ff:ff:ff:ff
??? inet 192.168.24.12/24 brd 192.168.24.255 scope global noprefixroute ens33
?????? valid_lft forever preferred_lft forever
??? inet 192.168.24.14/24 brd 192.168.24.255 scope global secondary ens33:master
?????? valid_lft forever preferred_lft forever
??? inet6 fe80::20c:29ff:fec8:ee12/64 scope link
?????? valid_lft forever preferred_lft forever
[postgres@pgtest3 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
??? link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
??? inet 127.0.0.1/8 scope host lo
?????? valid_lft forever preferred_lft forever
??? inet6 ::1/128 scope host
?????? valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
??? link/ether 00:0c:29:50:60:1a brd ff:ff:ff:ff:ff:ff
??? inet 192.168.24.13/24 brd 192.168.24.255 scope global noprefixroute ens33
?????? valid_lft forever preferred_lft forever
??? inet 192.168.24.15/24 brd 192.168.24.255 scope global secondary ens33:slave
?????? valid_lft forever preferred_lft forever
??? inet6 fe80::20c:29ff:fe50:601a/64 scope link
?????? valid_lft forever preferred_lft forever
---恢復原master節點
注意原master節點掛掉之后,使用 crm_mon -Arf -1命令看到末尾有"You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start."字樣,master宕機啟動時,需要刪除臨時鎖文件方可進行集群角色轉換。即執行rm -rf /var/lib/pgsql/tmp/PGSQL.lock
[root@pgtest1 ~]# ls /var/lib/pgsql/tmp/PGSQL.lock
/var/lib/pgsql/tmp/PGSQL.lock
[postgres@pgtest1 ~]$ rm -rf /var/lib/pgsql/tmp/PGSQL.lock
拉起數據庫不需要執行pg_ctl start,只需重啟服務,軟件會自動把數據庫拉起。
[root@pgtest1 tmp]# systemctl restart corosync pacemaker pcsd
---查看集群狀態
[root@pgtest1 tmp]# pcs status --full
Cluster name: pgcluster
Stack: corosync
Current DC: pgtest3 (3) (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 14:00:05 2025
Last change: Fri May? 9 13:59:09 2025 by root via crm_attribute on pgtest2
3 nodes configured
5 resource instances configured
Online: [ pgtest1 (1) pgtest2 (2) pgtest3 (3) ]
Full list of resources:
?vip-master???? (ocf::heartbeat:IPaddr2):?????? Started pgtest2
?vip-slave????? (ocf::heartbeat:IPaddr2):?????? Started pgtest3
?Master/Slave Set: pgsql-master [pgsql]
???? pgsql????? (ocf::heartbeat:pgsql): Master pgtest2
???? pgsql????? (ocf::heartbeat:pgsql): Slave pgtest3
???? pgsql????? (ocf::heartbeat:pgsql): Slave pgtest1
???? Masters: [ pgtest2 ]
???? Slaves: [ pgtest1 pgtest3 ]
Node Attributes:
* Node pgtest1 (1):
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : STREAMING|ASYNC
??? + pgsql-status????????????????????? : HS:async
* Node pgtest2 (2):
??? + master-pgsql????????????????????? : 1000
??? + pgsql-data-status???????????????? : LATEST
??? + pgsql-master-baseline???????????? : 00000000060000A0
??? + pgsql-status????????????????????? : PRI
* Node pgtest3 (3):
??? + master-pgsql????????????????????? : 100
??? + pgsql-data-status???????????????? : STREAMING|SYNC
??? + pgsql-status????????????????????? : HS:sync
Migration Summary:
* Node pgtest2 (2):
* Node pgtest3 (3):
* Node pgtest1 (1):
Fencing History:
PCSD Status:
? pgtest1: Online
? pgtest2: Online
? pgtest3: Online
Daemon Status:
? corosync: active/disabled
? pacemaker: active/disabled
? pcsd: active/enabled
#pgtest1重新加入集群后變成了slave節點。
3.3 二次模擬主節點故障
[postgres@pgtest2 ~]$ pg_ctl stop
waiting for server to shut down.... done
server stopped
--查看資源狀態
Stack: corosync
Current DC: pgtest3 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 14:02:47 2025
Last change: Fri May? 9 14:02:43 2025 by root via crm_attribute on pgtest3
3 nodes configured
5 resource instances configured
Online: [ pgtest1 pgtest2 pgtest3 ]
Full list of resources:
?vip-master???? (ocf::heartbeat:IPaddr2):?????? Started pgtest3
?vip-slave????? (ocf::heartbeat:IPaddr2):?????? Started pgtest1
?Master/Slave Set: pgsql-master [pgsql]
???? Masters: [ pgtest3 ]
???? Slaves: [ pgtest1 ]
???? Stopped: [ pgtest2 ]
Node Attributes:
* Node pgtest1:
??? + master-pgsql????????????????????? : 100
??? + pgsql-data-status???????????????? : STREAMING|SYNC
??? + pgsql-status????????????????????? : HS:sync
* Node pgtest2:
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : DISCONNECT
??? + pgsql-status????????????????????? : STOP
* Node pgtest3:
??? + master-pgsql????????????????????? : 1000
??? + pgsql-data-status???????????????? : LATEST
??? + pgsql-master-baseline???????????? : 00000000070000A0
??? + pgsql-status????????????????????? : PRI
Migration Summary:
* Node pgtest2:
?? pgsql: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May? 9 14:02:26 2025'
* Node pgtest3:
* Node pgtest1:
Failed Resource Actions:
* pgsql_start_0 on pgtest2 'unknown error' (1): call=73, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
??? last-rc-change='Fri May? 9 14:02:25 2025', queued=0ms, exec=262ms
--查看集群狀態
[root@pgtest2 ~]# pcs status --full
Cluster name: pgcluster
Stack: corosync
Current DC: pgtest3 (3) (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 14:02:32 2025
Last change: Fri May? 9 14:02:29 2025 by root via crm_attribute on pgtest3
3 nodes configured
5 resource instances configured
Online: [ pgtest1 (1) pgtest2 (2) pgtest3 (3) ]
Full list of resources:
?vip-master???? (ocf::heartbeat:IPaddr2):?????? Started pgtest3
?vip-slave????? (ocf::heartbeat:IPaddr2):?????? Started pgtest3
?Master/Slave Set: pgsql-master [pgsql]
???? pgsql????? (ocf::heartbeat:pgsql): Master pgtest3
???? pgsql????? (ocf::heartbeat:pgsql): Slave pgtest1
???? pgsql????? (ocf::heartbeat:pgsql): Stopped
???? Masters: [ pgtest3 ]
???? Slaves: [ pgtest1 ]
???? Stopped: [ pgtest2 ]
Node Attributes:
* Node pgtest1 (1):
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : DISCONNECT
??? + pgsql-status????????????????????? : HS:alone
* Node pgtest2 (2):
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : DISCONNECT
??? + pgsql-status????????????????????? : STOP
* Node pgtest3 (3):
??? + master-pgsql????????????????????? : 1000
??? + pgsql-data-status???????????????? : LATEST
??? + pgsql-master-baseline???????????? : 00000000070000A0
??? + pgsql-status????????????????????? : PRI
Migration Summary:
* Node pgtest2 (2):
?? pgsql: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May? 9 14:02:26 2025'
* Node pgtest3 (3):
* Node pgtest1 (1):
Failed Resource Actions:
* pgsql_start_0 on pgtest2 'unknown error' (1): call=73, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
??? last-rc-change='Fri May? 9 14:02:25 2025', queued=0ms, exec=262ms
Fencing History:
PCSD Status:
? pgtest2: Online
? pgtest1: Online
? pgtest3: Online
Daemon Status:
? corosync: active/disabled
? pacemaker: active/disabled
? pcsd: active/enabled
集群master已經切換到pgtest3上。
--vip發生漂移
#Master-vip漂移到pgtest3,slave-vip漂移到pgtest1。
[root@pgtest1 tmp]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
??? link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
??? inet 127.0.0.1/8 scope host lo
?????? valid_lft forever preferred_lft forever
??? inet6 ::1/128 scope host
?????? valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
??? link/ether 00:0c:29:b0:62:cd brd ff:ff:ff:ff:ff:ff
??? inet 192.168.24.11/24 brd 192.168.24.255 scope global noprefixroute ens33
?????? valid_lft forever preferred_lft forever
??? inet 192.168.24.15/24 brd 192.168.24.255 scope global secondary ens33:slave
?????? valid_lft forever preferred_lft forever
??? inet6 fe80::a2cc:510c:af40:af87/64 scope link noprefixroute
?????? valid_lft forever preferred_lft forever
[postgres@pgtest3 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
??? link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
??? inet 127.0.0.1/8 scope host lo
?????? valid_lft forever preferred_lft forever
??? inet6 ::1/128 scope host
?????? valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
??? link/ether 00:0c:29:50:60:1a brd ff:ff:ff:ff:ff:ff
??? inet 192.168.24.13/24 brd 192.168.24.255 scope global noprefixroute ens33
?????? valid_lft forever preferred_lft forever
??? inet 192.168.24.14/24 brd 192.168.24.255 scope global secondary ens33:master
?????? valid_lft forever preferred_lft forever
??? inet6 fe80::6d33:e581:1ad:cc2e/64 scope link noprefixroute
?????? valid_lft forever preferred_lft forever
---恢復原master節點
注意原master節點掛掉之后,使用 crm_mon -Arf -1命令看到末尾有"You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start."字樣,master宕機啟動時,需要刪除臨時鎖文件方可進行集群角色轉換。即執行rm -rf /var/lib/pgsql/tmp/PGSQL.lock
[root@pgtest2 ~]# ls /var/lib/pgsql/tmp/PGSQL.lock
/var/lib/pgsql/tmp/PGSQL.lock
[root@pgtest2 ~]# rm -rf /var/lib/pgsql/tmp/PGSQL.lock
拉起數據庫不需要執行pg_ctl start,只需重啟服務,軟件會自動把數據庫拉起。
[root@pgtest2 tmp]# systemctl restart corosync pacemaker pcsd
---查看集群狀態
[root@pgtest2 tmp]# pcs status --full
Cluster name: pgcluster
Stack: corosync
Current DC: pgtest3 (3) (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri May? 9 14:06:50 2025
Last change: Fri May? 9 14:05:22 2025 by root via crm_attribute on pgtest3
3 nodes configured
5 resource instances configured
Online: [ pgtest1 (1) pgtest2 (2) pgtest3 (3) ]
Full list of resources:
?vip-master???? (ocf::heartbeat:IPaddr2):?????? Started pgtest3
?vip-slave????? (ocf::heartbeat:IPaddr2):?????? Started pgtest1
?Master/Slave Set: pgsql-master [pgsql]
???? pgsql?? ???(ocf::heartbeat:pgsql): Slave pgtest2
???? pgsql????? (ocf::heartbeat:pgsql): Master pgtest3
???? pgsql????? (ocf::heartbeat:pgsql): Slave pgtest1
???? Masters: [ pgtest3 ]
???? Slaves: [ pgtest1 pgtest2 ]
Node Attributes:
* Node pgtest1 (1):
??? + master-pgsql????????????????????? : 100
??? + pgsql-data-status???????????????? : STREAMING|SYNC
??? + pgsql-status????????????????????? : HS:sync
* Node pgtest2 (2):
??? + master-pgsql????????????????????? : -INFINITY
??? + pgsql-data-status???????????????? : STREAMING|ASYNC
??? + pgsql-status????????????????????? : HS:async
* Node pgtest3 (3):
??? + master-pgsql????????????????????? : 1000
??? + pgsql-data-status???????????????? : LATEST
??? + pgsql-master-baseline???????????? : 00000000070000A0
??? + pgsql-status????????????????????? : PRI
Migration Summary:
* Node pgtest2 (2):
* Node pgtest3 (3):
* Node pgtest1 (1):
Fencing History:
PCSD Status:
? pgtest1: Online
? pgtest3: Online
? pgtest2: Online
Daemon Status:
? corosync: active/disabled
? pacemaker: active/disabled
? pcsd: active/enabled
pgtest2重新加入集群后,變成了slave節點。
依此操作,執行多次mster切換,master節點會在pgtest1、pgtest2、pgtest3循環切換。