在2024年3月30号,MogDB发布了最新版的5.0.6版本,其中引入了一个比较有意思的小特性,用于保护备机不被switchover或者failover命令拉起提升为主库,使用户可以对指定备机升主行为进行控制。 这个需求源于一些用户场景的实际需求洞察。
同时针对该特性,MogDB引入了一个新的参数protect_standby,该参数为布尔型,即on or off.
这里我们通过实际测试,来为大家演示一下该特性的效果究竟如何。
首先准备环境
1.新建别名:alias c=“cm_ctl query -Cvid”,方便操作。
2.至少是一主两从,三个节点或者以上。
[root@mogdb114 506]# ptk ls
cluster_name | id | addr | user | data_dir | db_version | create_time | comment
----------------+------+--------------------------------+------+---------------------+------------------------------+---------------------+----------
cluster_26000 | 6001 | 172.20.22.114:26000 (cm:15300) | omm | /data/mogdb5.0/data | MogDB 5.0.6 (build 8b0a6ca8) | 2024-04-01T16:57:25 |
| 6002 | 172.20.22.115:26000 (cm:15300) | omm | /data/mogdb5.0/data | | |
| 6003 | 172.20.22.117:26000 (cm:15300) | omm | /data/mogdb5.0/data | | |
[root@mogdb114 506]#
[omm@mogdb114 ~]$ cm_ctl show
[ Network Connect State ]
Network timeout: 6s
Current CMServer time: 2024-04-01 17:42:13
Network stat('Y' means connected, otherwise 'N'):
| \ | Y | Y |
| Y | \ | Y |
| Y | Y | \ |
[ Node Disk HB State ]
Node disk hb timeout: 200s
Current CMServer time: 2024-04-01 17:42:14
Node disk hb stat('Y' means connected, otherwise 'N'):
| N | N | N |
[ FloatIp Network State ]
node instance base_ip float_ip_name float_ip
---------------------------------------------------------------
1 mogdb114 6001 172.20.22.114 VIP_az240917 172.20.22.180
[omm@mogdb114 ~]$
[omm@mogdb114 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Primary Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Standby Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Standby Normal
[omm@mogdb114 ~]$
ok! 环境准备就绪之后,就可以开始测试验证工作了。
参数启用前验证switchover效果
在使用5.0.6版本的该新特性之前,我们先做一下手工切换,看看switchover的情况。
例如这里我们将主库switcover到节点2上。
[omm@mogdb114 ~]$ cm_ctl switchover -n 2 -D $PGDATA
..Killed
[omm@mogdb114 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Standby Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Primary Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Standby Normal
[omm@mogdb114 ~]$
接下来观察一下详细的切换过程:
### 2024-04-01 17:50:49.956: sleep 1
>>>>>>>>>>>>>>> tctest_insert,log=tctest.log.insert 2024-04-01 17:49:30.495 ---- 2024-04-01 17:50:50.987: 76
INSERT 0 10
now | get_hostname | tctest_insert
-------------------------------+--------------+---------------
2024-04-01 17:50:51.020981+08 | mogdb114 | 760
(1 row)
now
-------------------------------
2024-04-01 17:50:51.021683+08
(1 row)
### 2024-04-01 17:50:51.025: sleep 1
failed to connect 172.20.22.180:26000.
### failed to connect mogdb. sleep 1,tctest.log.insert tctest.log.ustore 2024-04-01 17:49:30.495 >>>> 2024-04-01 17:50:52.291: 77
failed to connect 172.20.22.180:26000.
### failed to connect mogdb. sleep 1,tctest.log.insert tctest.log.ustore 2024-04-01 17:49:30.495 >>>> 2024-04-01 17:50:53.352: 77
>>>>>>>>>>>>>>> tctest_insert,log=tctest.log.insert 2024-04-01 17:49:30.495 ---- 2024-04-01 17:50:54.384: 77
INSERT 0 10
now | get_hostname | tctest_insert
-------------------------------+--------------+---------------
2024-04-01 17:50:54.418988+08 | mogdb115 | 770
(1 row)
now
-------------------------------
2024-04-01 17:50:54.419806+08
(1 row)
从上面的日志可以看到,主库成功从114切换到了115节点,符合预期。
那么能否将主库切换到节点3呢?当然可以,如下:
[omm@mogdb114 ~]$ cm_ctl switchover -n 3 -D $PGDATA
......
cm_ctl: switchover successfully.
[omm@mogdb114 ~]$
[omm@mogdb114 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Standby Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Standby Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Primary Normal
[omm@mogdb114 ~]$
同样这里我们可以来观察一下切换效果。
>>>>>>>>>>>>>>> tctest_insert,log=tctest.log.insert 2024-04-01 17:51:40.596 ---- 2024-04-01 17:51:58.970: 18
INSERT 0 10
now | get_hostname | tctest_insert
-------------------------------+--------------+---------------
2024-04-01 17:51:59.002625+08 | mogdb115 | 180
(1 row)
now
-------------------------------
2024-04-01 17:51:59.003296+08
(1 row)
### 2024-04-01 17:51:59.006: sleep 1
failed to connect 172.20.22.180:26000.
### failed to connect mogdb. sleep 1,tctest.log.insert tctest.log.ustore 2024-04-01 17:51:40.596 >>>> 2024-04-01 17:52:00.026: 19
>>>>>>>>>>>>>>> tctest_insert,log=tctest.log.insert 2024-04-01 17:51:40.596 ---- 2024-04-01 17:52:08.072: 19
INSERT 0 10
now | get_hostname | tctest_insert
-------------------------------+--------------+---------------
2024-04-01 17:52:08.125014+08 | mogdb117 | 190
(1 row)
now
-------------------------------
2024-04-01 17:52:08.125707+08
(1 row)
符合预期,跟前面的验证一样,主库被外面switchover到了节点3,也就是117号节点。
参数启用前验证failover
首先我们来观察此时数据库集群的状态,如下:
[omm@mogdb115 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Standby Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Primary Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Standby Normal
[omm@mogdb115 ~]$ ps -fu omm
UID PID PPID C STIME TTY TIME CMD
omm 5451 1 0 16:57 ? 00:01:10 /data/mogdb5.0/app/5.0.5/bin/om_monitor -L /data/mogdb5.0/log/cm/om_monitor
omm 13162 1 2 17:57 ? 00:01:46 /data/mogdb5.0/app/5.0.5/bin/mogdb -D /data/mogdb5.0/data -M pending
omm 26464 30588 0 18:58 ? 00:00:00 arping -D -f -w 1 -I ens192 172.20.22.180
omm 26465 28012 0 18:58 pts/0 00:00:00 ps -fu omm
omm 28012 28011 0 17:46 pts/0 00:00:00 -bash
omm 30588 5451 17 17:48 ? 00:12:05 /data/mogdb5.0/app/5.0.5/bin/cm_agent
omm 30608 1 13 17:48 ? 00:09:23 /data/mogdb5.0/app/5.0.5/bin/cm_server
omm 30632 1 0 17:48 ? 00:00:00 mogdb fenced UDF master process
目前主库在115号节点,我们尝试将主库115 强行kill,模拟failover的场景。
[omm@mogdb115 ~]$ kill -9 13162
[omm@mogdb115 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Primary Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Standby Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Standby Normal
[omm@mogdb115 ~]$
可以看到114节点成功接管了,主库。那么此时模拟的insert数据场景如何呢?
>>>>>>>>>>>>>>> tctest_insert,log=tctest.log.insert 2024-04-01 18:58:31.448 ---- 2024-04-01 18:58:54.326: 22
INSERT 0 10
now | get_hostname | tctest_insert
-------------------------------+--------------+---------------
2024-04-01 18:58:54.361219+08 | mogdb115 | 220
(1 row)
now
-------------------------------
2024-04-01 18:58:54.361889+08
(1 row)
### 2024-04-01 18:58:54.365: sleep 1
failed to connect 172.20.22.180:26000.
### failed to connect mogdb. sleep 1,tctest.log.insert tctest.log.ustore 2024-04-01 18:58:31.448 >>>> 2024-04-01 18:58:55.448: 23
failed to connect 172.20.22.180:26000.
### failed to connect mogdb. sleep 1,tctest.log.insert tctest.log.ustore 2024-04-01 18:58:31.448 >>>> 2024-04-01 18:58:56.468: 23
gsql: FATAL: can not accept connection in pending mode.
### failed to connect mogdb. sleep 1,tctest.log.insert tctest.log.ustore 2024-04-01 18:58:31.448 >>>> 2024-04-01 18:58:57.521: 23
>>>>>>>>>>>>>>> tctest_insert,log=tctest.log.insert 2024-04-01 18:58:31.448 ---- 2024-04-01 18:59:05.569: 23
INSERT 0 10
now | get_hostname | tctest_insert
-------------------------------+--------------+---------------
2024-04-01 18:59:05.604845+08 | mogdb114 | 230
(1 row)
now
-------------------------------
2024-04-01 18:59:05.605569+08
(1 row)
如上面的测试,CM做了切换,将114提升为了主库,此时我们的115节点变成了备库。
启用特性的测试验证
在验证特性之前,我们需要先测试一下相关的参数,如下:
[omm@mogdb115 ~]$ gs_guc set -D $PGDATA -c "protect_standby=on"
The gs_guc run with the following arguments: [gs_guc -D /data/mogdb5.0/data -c protect_standby=on set ].
expected instance path: [/data/mogdb5.0/data/postgresql.conf]
gs_guc set: protect_standby=on: [/data/mogdb5.0/data/postgresql.conf]
Total instances: 1. Failed instances: 0.
Success to perform gs_guc!
[omm@mogdb115 ~]$ gs_ctl reload
[2024-04-01 19:04:37.767][4895][][gs_ctl]: gs_ctl reload ,datadir is /data/mogdb5.0/data
server signaled
[omm@mogdb115 ~]$ gsql -r
gsql ((MogDB 5.0.6 build 8b0a6ca8) compiled at 2024-03-27 11:05:29 commit 0 last mr 1804 )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.
MogDB=# show protect_standby ;
protect_standby
-----------------
on
(1 row)
MogDB=#
那么此时的集群状态如何呢?
[omm@mogdb114 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Primary Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Protect Standby Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Standby Normal
[omm@mogdb114 ~]$
我们可以看到,此时115 节点上的数据库从Standby Normal 变成了 Protect Standby Normal 。
接下来我们就分别测一下switchover和failover。
[omm@mogdb114 ~]$ cm_ctl switchover -n 2 -D $PGDATA
.
cm_ctl: can not do switchover at current role(not standby),You can execute "cm_ctl query -v" and check
[omm@mogdb114 ~]$
[omm@mogdb114 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Primary Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Protect Standby Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Standby Normal
[omm@mogdb114 ~]$
可以看到此时如果你进行switchover,那么会报错,提示不允许进行操作。
那么如果进行failover操作会是什么情况呢,如果没有启用这个特性设置,那么kill主库114的进程,CM可能会把115 提升为主库。接下来就是见证奇迹的时刻。
[omm@mogdb114 ~]$ ps -fu omm
UID PID PPID C STIME TTY TIME CMD
omm 7693 17884 0 19:07 ? 00:00:00 arping -D -f -w 1 -I ens192 172.20.22.180
omm 7702 27470 0 19:07 pts/2 00:00:00 ps -fu omm
omm 17884 25302 15 17:48 ? 00:12:32 /data/mogdb5.0/app/5.0.5/bin/cm_agent
omm 17904 1 14 17:48 ? 00:11:47 /data/mogdb5.0/app/5.0.5/bin/cm_server
omm 17930 1 0 17:48 ? 00:00:00 mogdb fenced UDF master process
omm 25302 1 1 16:57 ? 00:01:17 /data/mogdb5.0/app/5.0.5/bin/om_monitor -L /data/mogdb5.0/log/cm/om_monitor
omm 27470 27469 0 17:02 pts/2 00:00:00 -bash
omm 30626 1 6 18:58 ? 00:00:34 /data/mogdb5.0/app/5.0.5/bin/mogdb -D /data/mogdb5.0/data -M pending
[omm@mogdb114 ~]$ kill -9 30626
[omm@mogdb114 ~]$
[omm@mogdb114 ~]$ c
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------
1 mogdb114 172.20.22.114 1 /data/mogdb5.0/cm/cm_server Primary
2 mogdb115 172.20.22.115 2 /data/mogdb5.0/cm/cm_server Standby
3 mogdb117 172.20.22.117 3 /data/mogdb5.0/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state | node node_ip instance state | node node_ip instance state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 mogdb114 172.20.22.114 6001 /data/mogdb5.0/data P Standby Normal | 2 mogdb115 172.20.22.115 6002 /data/mogdb5.0/data S Protect Standby Normal | 3 mogdb117 172.20.22.117 6003 /data/mogdb5.0/data S Primary Normal
[omm@mogdb114 ~]$
当我们kill掉114 主节点上的进程后,我们可以看到集群failover到了117节点上了,主库不在往115上做切换。这是符合我们的预期的!
那么这个特性有什么实际应用场景?
一些特定场景下,用户需要确保某个备库不被集群管理软件所干预,永远保存一份standby的状态,必要的时候可以进行人工干预切换;同时实现将特定节点提升为主库的需求,而不是让集群管理软件来随机选择。
所以,大家觉得这个小特性有用吗?
参考:
MogDB 5.0.6 新特性介绍 https://docs.mogdb.io/zh/mogdb/v5.0/5.0.6
本文由mdnice多平台发布
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。