Performing a Primary/Standby Switchover

Scenarios

During openGaussdatabase running, the database administrator needs to manually perform an primary/standby switchover on the database node. For example, after a primary/standby database node failover, you need to restore the original primary/standby roles, or you need to manually perform a primary/standby switchover due to a hardware fault. A cascaded standby server cannot be directly switched to a primary server. You must perform a switchover or failover to change the cascaded standby server to a standby server, and then to a primary server.

NOTE:
The primary/standby switchover is a maintenance operation. Ensure that the openGaussdatabase is normal and perform the switchover after all services are complete.
When the ultimate RTO is enabled, cascaded standby servers are not supported. The standby server cannot be connected when the ultimate RTO is enabled. As a result, the cascaded standby server cannot synchronize data.

Procedure

Log in to any database node as the OS user omm and run the following command to check the primary/standby status:
```
gs_om -t status --detail
```
Log in to the standby node to be switched to the primary node as the OS user omm and run the following command:
```
gs_ctl switchover -D /home/omm/cluster/dn1/
```
/home/omm/cluster/dn1/ is the data directory of the standby database node.
NOTICE: For the same database, you cannot perform a new primary/standby switchover if the previous switchover has not completed. If a switchover is performed when the host thread is processing services, the thread cannot stop, and switchover timeout will be reported. Actually, the switchover is ongoing in the background and will complete after the thread finishes service processing and stops. For example, when a host is deleting a large partitioned table, it may fail to respond to the switchover request.
After the switchover is successful, run the following command to record the information about the current primary and standby nodes:
```
gs_om -t refreshconf
```

Examples

Run the following command to switch the standby database instance to the primary database instance:

Queries database status.

gs_om -t status --detail

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node             node_ip         port      instance                            state
--------------------------------------------------------------------------------------------------
1  pekpopgsci00235  10.244.62.204    5432      6001 /home/wuqw/cluster/dn1/   P Primary Normal
2  pekpopgsci00238  10.244.61.81     5432      6002 /home/wuqw/cluster/dn1/   S Standby Normal

gs_om -t status --detail
[  CMServer State   ]

node      node_ip         instance                                 state
--------------------------------------------------------------------------
1  host40 10.243.40.20    1    /usr1/cm_gauss/cluster/cm/cm_server Primary
2  host39 10.243.39.8     2    /usr1/cm_gauss/cluster/cm/cm_server Standby
3  host15 10.243.15.65    3    /usr1/cm_gauss/cluster/cm/cm_server Standby

[    ETCD State     ]

node      node_ip         instance                         state
------------------------------------------------------------------------
1  host40 10.243.40.20    7001 /usr1/cm_gauss/cluster/etcd StateFollower
2  host39 10.243.39.8     7002 /usr1/cm_gauss/cluster/etcd StateFollower
3  host15 10.243.15.65    7003 /usr1/cm_gauss/cluster/etcd StateLeader

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL

[  Datanode State   ]

node      node_ip         instance                        state            | node      node_ip         instance                        state            | node      node_ip         instance                        state
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  host40 10.243.40.20    6001 /usr1/cm_gauss/cluster/dn1 P Primary Normal | 2  host39 10.243.39.8     6002 /usr1/cm_gauss/cluster/dn1 S Standby Normal | 3  host15 10.243.15.65    6003 /usr1/cm_gauss/cluster/dn1 S Standby Normal

Log in to the standby node and perform a primary/standby switchover. In addition, after a cascaded standby node is switched over, the cascaded standby server becomes a standby server, and the original standby server becomes a cascaded standby server.

gs_ctl switchover -D /home/wuqw/cluster/dn1/
[2020-06-17 14:28:01.730][24438][][gs_ctl]: gs_ctl switchover ,datadir is -D "/home/wuqw/cluster/dn1"
[2020-06-17 14:28:01.730][24438][][gs_ctl]: switchover term (1)
[2020-06-17 14:28:01.768][24438][][gs_ctl]: waiting for server to switchover............
[2020-06-17 14:28:11.175][24438][][gs_ctl]: done
[2020-06-17 14:28:11.175][24438][][gs_ctl]: switchover completed (/home/wuqw/cluster/dn1)

Save the information about the primary and standby nodes in the database.

gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.

Troubleshooting

If a switchover fails, troubleshoot the problem according to the log information. For details, see Log Reference.

Exception Handling

Exception handling rules are as follows:

A switchover takes a long time under high service loads. In this case, no further operation is required.
When standby nodes are being built, a primary node can be demoted to a standby node only after sending logs to one of the standby nodes. As a result, the primary/standby switchover takes a long time. In this case, no further operation is required. However, you are not advised to perform a primary/standby switchover during the build process.
During a switchover, due to network faults and high disk usage, it is possible that the primary and standby instances are disconnected, or two primary nodes exist in a single pair. In this case, perform the following steps:
WARNING: After two primary nodes appear, perform the following steps to restore the normal primary/standby state: Otherwise, data loss may occur.

Run the following commands to query the current instance status of the database:
```
gs_om -t status --detail
```
The query result shows that the status of two instances is Primary, which is abnormal.
Determine the node that functions as the standby node and run the following command on the node to stop the service:
```
gs_ctl stop -D /home/omm/cluster/dn1/
```
Run the following command to start the standby node in standby mode:
```
gs_ctl start -D /home/omm/cluster/dn1/ -M standby
```
Save the information about the primary and standby nodes in the database.
```
gs_om -t refreshconf
```
Check the database status and ensure that the instance status is restored.