Performing a Primary/Standby Switchover

Scenarios

The database administrator may need to manually perform an primary/standby switchover on an openGauss database node that is currently running. For example, after a primary/standby database node failover, you may need to restore the original primary/standby roles or to manually perform a primary/standby switchover due to a hardware fault A cascaded standby node cannot be directly upgraded to primary. You need to first perform a switchover or failover to change the cascaded standby node to a standby node, and then to a primary node.

NOTE:
The primary/standby switchover is a maintenance operation. Ensure that the openGauss status is normal and perform the switchover after all services are complete.
When the ultimate RTO is enabled, cascaded standby nodes are not supported. The standby node cannot be connected when the ultimate RTO is enabled. As a result, data cannot be synchronized with the cascaded standby node.
After the cascaded standby node is switched over, the synchronous_standby_names parameter of the primary node cannot be automatically adjusted. Therefore, you may need to manually adjust the synchronous_standby_names parameter of the primary node. Otherwise, the write services on the primary node may be blocked.

Procedure

Log in to any database node as the OS user omm and run the following command to check the primary/standby status:
```
gs_om -t status --detail
```
Log in to the standby node to be switched to the primary node as the OS user omm and run the following command:
```
gs_ctl switchover -D /home/omm/cluster/dn1/
```
/home/omm/cluster/dn1/ is the data directory of the standby database node.
NOTICE: For the same database, you cannot perform a new primary/standby switchover if the previous switchover has not completed. If a switchover is performed when the host thread is processing services, the thread cannot stop, and switchover timeout will be reported. Actually, the switchover is ongoing in the background and will complete after the thread finishes service processing and stops. For example, when a thread it deleting a large partitioned table, the primary instance may fail to respond to the switchover request.
If the primary node is faulty, run the following command on the standby node:
```
gs_ctl failover -D /home/omm/cluster/dn1/
```
After the switchover or failover is successful, run the following command to record information about the current primary and standby nodes:
```
gs_om -t refreshconf
```

Examples

Run the following command to switch the standby database instance to the primary database instance:

Queries database status.

gs_om -t status --detail

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node             node_ip         port      instance                            state
--------------------------------------------------------------------------------------------------
1  pekpopgsci00235  10.244.62.204    5432      6001 /home/wuqw/cluster/dn1/   P Primary Normal
2  pekpopgsci00238  10.244.61.81     5432      6002 /home/wuqw/cluster/dn1/   S Standby Normal

Log in to the standby node and perform a primary/standby switchover. In addition, after a cascaded standby node is switched over, the cascaded standby node becomes a standby node, and the original standby node becomes a cascaded standby node.

gs_ctl switchover -D /home/wuqw/cluster/dn1/
[2020-06-17 14:28:01.730][24438][][gs_ctl]: gs_ctl switchover ,datadir is -D "/home/wuqw/cluster/dn1"
[2020-06-17 14:28:01.730][24438][][gs_ctl]: switchover term (1)
[2020-06-17 14:28:01.768][24438][][gs_ctl]: waiting for server to switchover............
[2020-06-17 14:28:11.175][24438][][gs_ctl]: done
[2020-06-17 14:28:11.175][24438][][gs_ctl]: switchover completed (/home/wuqw/cluster/dn1)

Save the information about the primary and standby nodes in the database.

gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.

Troubleshooting

If a switchover fails, troubleshoot the problem according to the log information. For details, see Log Reference.

Exception Handling

Exception handling rules are as follows:

A switchover takes a long time under high service loads. In this case, no further operation is required.
When standby nodes are being built, a primary node can be demoted to a standby node only after sending logs to one of the standby nodes. As a result, the primary/standby switchover takes a long time. In this case, no further operation is required. However, you are not advised to perform a primary/standby switchover during the build process.
During a switchover, due to network faults and high disk usage, it is possible that the primary and standby instances are disconnected, or two primary nodes exist in a single pair. In this case, perform the following steps:
WARNING: After two primary nodes appear, perform the following steps to restore the normal primary/standby state: Otherwise, data loss may occur.

Run the following commands to query the current instance status of the database:
```
gs_om -t status --detail
```
The query result shows that the status of two instances is Primary, which is abnormal.
Determine the node that functions as the standby node and run the following command on the node to stop the service:
```
gs_ctl stop -D /home/omm/cluster/dn1/
```
Run the following command to start the standby node in standby mode:
```
gs_ctl start -D /home/omm/cluster/dn1/ -M standby
```
Save the information about the primary and standby nodes in the database.
```
gs_om -t refreshconf
```
Check the database status and ensure that the instance status is restored.