gs_sdr

Background

openGauss 3.1.0 and later versions provide the gs_sdr tool to implement cross-region remote DR without using additional storage media. The tool provides functions such as streaming DR establishment, DR switchover, planned primary/standby switchover, DR removal, DR status monitoring, and displaying the help information and version number.

Prerequisites

Syntax

Establishing a DR Relationship

gs_sdr -t start -m [primary|disaster_standby] [-U DR_USERNAME] [-W DR_PASSWORD] [-X XMLFILE] [--json JSONFILE] [--time-out=SECS] [-l LOGFILE]

Promoting DR Instance to Primary
```
gs_sdr -t failover [-l LOGFILE] 
```

Planned Primary/Standby Switchover

gs_sdr -t switchover -m [primary|disaster_standby] [--time-out=SECS] [-l LOGFILE]

DR Removal

gs_sdr -t stop [-X XMLFILE] [--json JSONFILE] [-l LOGFILE]

Monitoring DR Status
```
gs_sdr -t query [-l LOGFILE]
```

Parameter Description

gs_sdr has the following types of parameters:

Common parameters
- -t
  Specifies the type of the gs_sdr command.
  Value range: start, failover, switchover, stop, or query.
- -l
  Specifies a log file and its storage path.
  Default value: $GAUSSLOG/om/gs_sdr-YYYY-MM-DD_hhmmss.log
- -?, --help
  Display the help information.
- -V, --version
  Displays version information.

Parameters for establishing DR relationship:

-m
Expected role of the cluster in the DR relationship.
Value range: primary or disaster_standby.
-U
Name of the DR user with the streaming replication permission.
-W
Password of the DR user.
NOTE:
1. Before the DR relationship is established, you need to create a DR user on the primary cluster for DR authentication. The primary and standby clusters must use the same DR username and password. After a DR relationship is established, the user password cannot be changed. You can remove the DR relationship, modify the username and password, and establish the DR relationship again. The DR user password cannot contain blank characters and the following characters: |;&$<>`'"{}()[]~*?!\n
2. If the -U and -W parameters are not input in the command line, they can be input in interactive mode during the establishment.

-X

XML file used during cluster installation. DR information can be configured in the XML file for DR establishment. That is, three columns (“localStreamIpmap1”, “remoteStreamIpmap1” and remotedataPortBase) can be extended based on the XML file.

The following describes how to configure the new columns. The information in bold is an example. Each line of information has a comment.

<!-- Information about the node deployment on each server -->
<DEVICELIST>
<DEVICE sn="pekpomdev00038">
<!-- Number of primary DNs that need to be deployed on the current host -->
<PARAM name="dataNum" value="1"/>
<!-- Basic port number of the primary DN -->
<PARAM name="dataPortBase" value="26000"/>
<!--Mapping between the SSH reliable channel IP address and streaming replication IP address of each DN shard node in the cluster -->
<PARAM name="localStreamIpmap1" value="(10.244.44.216,172.31.12.58),(10.244.45.120,172.31.0.91)"/>
<!--Mapping between the SSH reliable channel IP address and streaming replication IP address of each DN shard node in the peer cluster -->
<PARAM name="remoteStreamIpmap1" value="(10.244.45.144,172.31.2.200),(10.244.45.40,172.31.0.38),(10.244.46.138,172.31.11.145),(10.244.48.60,172.31.9.37),(10.244.47.240,172.31.11.125)"/>
<!--Port number of the primary DN in the peer cluster -->
<PARAM name="remotedataPortBase" value="26000"/>
</DEVICE>

--json

JSON file containing DR information.

The following describes how to configure the JSON file. The information in bold is an example.

{"remoteClusterConf": {"port": 26000, "shards": [[{"ip": "10.244.45.144", "dataIp": "172.31.2.200"}, {"ip": "10.244.45.40", "dataIp": "172.31.0.38"}, {"ip": "10.244.46.138", "dataIp": "172.31.11.145"}, {"ip": "10.244.48.60", "dataIp": "172.31.9.37"}, {"ip": "10.244.47.240", "dataIp": "172.31.11.125"}]]}, "localClusterConf": {"port": 26000, "shards": [[{"ip": "10.244.44.216", "dataIp": "172.31.12.58"}, {"ip": "10.244.45.120", "dataIp": "172.31.0.91"}]]}}
Parameter description:
# remoteClusterConf: DN shard information of the peer cluster. In the preceding command, port indicates the port of the primary DN in the peer cluster, and {"ip": "10.244.45.144", "dtaIp": "172.31.2.200"} indicates the mapping between the SSH reliable channel IP address and streaming replication IP address of each DN shard node in the peer cluster.
# localClusterConf: DN shard information of the cluster. In the preceding command, port indicates the port of the primary DN in the cluster, and {"ip": "10.244.44.216", "dtaIp": "172.31.12.58"} indicates the mapping between the SSH reliable channel IP address and streaming replication IP address of each DN shard node in the cluster.

NOTE:
-Either -X or --json can be used to configure DR information. If both parameters are delivered in the command, the JSON file prevails.

--time-out=SECS
Specifies the timeout period. The primary cluster waits for the connection to the standby cluster. If the connection times out, the OM script automatically exits. Unit: s
Value range: a positive integer. The recommended value is 1200.
Default value: 1200

Parameters for switching a DR node to primary:
None.
Parameters for removing DR:
- -X
  DR information configured in the XML file during cluster installation. That is, three columns (“localStreamIpmap1”, “remoteStreamIpmap1” and remotedataPortBase) need to be extended.
- --json
  JSON file containing local and peer DR information.
  NOTE:
  -For details about how to configure -X and --json, see the parameters for establishing DR relationship in this section.
DR query parameters:
- None.
The DR status query result is described as follows:

Item	Meaning	Value	Description	Remarks
hadr_cluster_stat	Database instance status in streaming DR	normal	The database instance does not participate in streaming DR.	-
		full_backup	Full data replication in the primary database instance is in progress.	This status is available only for the primary database instance in streaming DR.
		archive	Streaming log replication in the primary database instance is in progress.	This status is available only for the primary database instance in streaming DR.
		backup_fail	Full data replication in the primary database instance fails.	This status is available only for the primary database instance in streaming DR.
		archive_fail	Streaming log replication in the primary database instance fails.	This status is available only for the primary database instance in streaming DR.
		switchover	Planned primary/standby switchover is in progress.	This status is available for both the primary and standby database instances in streaming DR.
		restore	Full data restoration in the DR database instance is in progress.	This status is available only for DR database instances in streaming DR.
		restore_fail	Full data restoration in the DR database instance fails.	This status is available only for DR database instances in streaming DR.
		recovery	Streaming log replication in the DR database instance is in progress.	This status is available only for DR database instances in streaming DR.
		recovery_fail	Streaming log replication in the DR database instance fails.	This status is available only for DR database instances in streaming DR.
		promote	The DR database instance is being promoted to primary.	This status is available only for DR database instances in streaming DR.
		promote_fail	The DR database instance fails to promote to primary.	This status is available only for DR database instances in streaming DR.
hadr_switchover_stat	Progress of the planned switchover between the primary and standby database instances in streaming DR	Percentage	Switchover progress.	-
hadr_failover_stat	Progress of promoting a streaming DR database instance to primary	Percentage	Switchover progress.	-
RTO	Time required for data restoration when a disaster occurs	Null	Streaming DR is interrupted due to database instance shutdown or network exceptions.	Only the primary database instance can be queried in streaming DR.
RTO	Time required for data restoration when a disaster occurs	Not null	Time required for data restoration $unit: s$
RPO	Data loss duration of the database instance when a disaster occurs.	Null	Streaming DR is interrupted due to database instance shutdown or network exceptions.	Only the primary database instance can be queried in streaming DR.
RPO		Non null	Duration in which data of the database instance may be lost, in seconds.

Examples

Establish a DR relationship in a primary cluster.

gs_sdr -t start -m primary -X /opt/install_streaming_primary_cluster.xml --time-out=1200 -U 'hadr_user' -W 'opengauss@123'
--------------------------------------------------------------------------------
Streaming disaster recovery start 2b9bc268d8a111ecb679fa163e2f2d28
--------------------------------------------------------------------------------
Start create streaming disaster relationship ...
Got step:[-1] for action:[start].
Start first step of streaming start.
Start common config step of streaming start.
Start generate hadr key files.
Streaming key files already exist.
Finished generate and distribute hadr key files.
Start encrypt hadr user info.
Successfully encrypt hadr user info.
Start save hadr user info into database.
Successfully save hadr user info into database.
Start update pg_hba config.
Successfully update pg_hba config.
Start second step of streaming start.
Successfully check cluster status is: Normal
Successfully check instance status.
Successfully check cm_ctl is available.
Successfully check cluster is not under upgrade opts.
Start checking disaster recovery user.
Successfully check disaster recovery user.
Start prepare secure files.
Start copy hadr user key files.
Successfully copy secure files.
Start fourth step of streaming start.
Starting reload wal_keep_segments value: 16384.
Successfully reload wal_keep_segments value: 16384.
Start fifth step of streaming start.
Successfully set [/omm/CMServer/backup_open][0].
Start sixth step of streaming start.
Start seventh step of streaming start.
Start eighth step of streaming start.
Waiting main standby connection..
Main standby already connected.
Successfully check cluster status is: Normal
Start ninth step of streaming start.
Starting reload wal_keep_segments value: {'6001': '128'}.
Successfully reload wal_keep_segments value: {'6001': '128'}.
Successfully removed step file.
Successfully do streaming disaster recovery start.

Establish a DR relationship in a standby cluster.

gs_sdr -t start -m disaster_standby -X /opt/install_streaming_standby_cluster.xml --time-out=1200 -U 'hadr_user' -W 'opengauss@123'
--------------------------------------------------------------------------------
Streaming disaster recovery start e34ec1e4d8a111ecb617fa163e77e94a
--------------------------------------------------------------------------------
Start create streaming disaster relationship ...
Got step:[-1] for action:[start].
Start first step of streaming start.
Start common config step of streaming start.
Start update pg_hba config.
Successfully update pg_hba config.
Start second step of streaming start.
Successfully check cluster status is: Normal
Successfully check instance status.
Successfully check cm_ctl is available.
Successfully check cluster is not under upgrade opts.
Start build key files from remote cluster.
Start copy hadr user key files.
Successfully build and distribute key files to all nodes.
Start fourth step of streaming start.
Start fifth step of streaming start.
Successfully set [/omm/CMServer/backup_open][2].
Stopping the cluster by node.
Successfully stopped the cluster by node for streaming cluster.
Start sixth step of streaming start.
Start seventh step of streaming start.
Start eighth step of streaming start.
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal...
Successfully started standby instances.
Successfully check cluster status is: Normal
Start ninth step of streaming start.
Successfully removed step file.
Successfully do streaming disaster recovery start.

Demote a primary cluster to standby as planned.

gs_sdr -t switchover -m disaster_standby
--------------------------------------------------------------------------------
Streaming disaster recovery switchover 6897d15ed8a411ec82acfa163e2f2d28
--------------------------------------------------------------------------------
Start streaming disaster switchover ...
Streaming disaster cluster switchover...
Successfully check cluster status is: Normal
Parse cluster conf from file.
Successfully parse cluster conf from file.
Successfully check cluster is not under upgrade opts.
Got step:[-1] for action:[switchover].
Stopping the cluster.
Successfully stopped the cluster.
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal...
Successfully started standby instances.
Start checking truncation, please wait...
Stopping the cluster.
Successfully stopped the cluster.
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal...
Successfully started standby instances.
.
The cluster status is Normal.
Successfully removed step file.
Successfully do streaming disaster recovery switchover.

Promote a standby cluster to primary as planned.

gs_sdr -t switchover -m primary
--------------------------------------------------------------------------------
Streaming disaster recovery switchover 20542bbcd8a511ecbbdbfa163e77e94a
--------------------------------------------------------------------------------
Start streaming disaster switchover ...
Streaming disaster cluster switchover...
Waiting for cluster and instances normal...
Successfully check cluster status is: Normal
Parse cluster conf from file.
Successfully parse cluster conf from file.
Successfully check cluster is not under upgrade opts.
Waiting for switchover barrier...
Got step:[-1] for action:[switchover].
Stopping the cluster by node.
Successfully stopped the cluster by node for streaming cluster.
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal...
Successfully started standby instances.
Successfully check cluster status is: Normal
Successfully removed step file.
Successfully do streaming disaster recovery switchover.

Promote a DR cluster to primary.

gs_sdr -t failover
--------------------------------------------------------------------------------
Streaming disaster recovery failover 65535214d8a611ecb804fa163e2f2d28
--------------------------------------------------------------------------------
Start streaming disaster failover ...
Got step:[-1] for action:[failover].
Successfully check cluster status is: Normal
Successfully check cluster is not under upgrade opts.
Parse cluster conf from file.
Successfully parse cluster conf from file.
Got step:[-1] for action:[failover].
Starting drop all node replication slots
Finished drop all node replication slots
Stopping the cluster by node.
Successfully stopped the cluster by node for streaming cluster.
Start remove replconninfo for instance:6001
Start remove replconninfo for instance:6002
Start remove replconninfo for instance:6003
Start remove replconninfo for instance:6005
Start remove replconninfo for instance:6004
Successfully removed replconninfo for instance:6001
Successfully removed replconninfo for instance:6004
Successfully removed replconninfo for instance:6003
Successfully removed replconninfo for instance:6002
Successfully removed replconninfo for instance:6005
Start remove pg_hba config.
Finished remove pg_hba config.
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal...
Successfully started standby instances.
Successfully check cluster status is: Normal
Try to clean hadr user info.
Successfully clean hadr user info from database.
Successfully removed step file.
Successfully do streaming disaster recovery failover.

Remove a primary cluster DR.

gs_sdr -t stop -X /opt/install_streaming_standby_cluster.xml
--------------------------------------------------------------------------------
Streaming disaster recovery stop dae8539ed8a611ecade9fa163e77e94a
--------------------------------------------------------------------------------
Start remove streaming disaster relationship ...
Got step:[-1] for action:[stop].
Start first step of streaming stop.
Start second step of streaming start.
Successfully check cluster status is: Normal
Check cluster type succeed.
Successfully check cluster is not under upgrade opts.
Start third step of streaming stop.
Start remove replconninfo for instance:6001
Start remove replconninfo for instance:6002
Successfully removed replconninfo for instance:6001
Successfully removed replconninfo for instance:6002
Start remove cluster file.
Finished remove cluster file.
Start fourth step of streaming stop.
Start remove pg_hba config.
Finished remove pg_hba config.
Start fifth step of streaming start.
Starting drop all node replication slots
Finished drop all node replication slots
Start sixth step of streaming stop.
Successfully check cluster status is: Normal
Try to clean hadr user info.
Successfully clean hadr user info from database.
Successfully removed step file.
Successfully do streaming disaster recovery stop.

Query the DR status.

gs_sdr -t query
--------------------------------------------------------------------------------
Streaming disaster recovery query 1201b062d8a411eca83efa163e2f2d28
--------------------------------------------------------------------------------
Start streaming disaster query ...
Successfully check cluster is not under upgrade opts.
Start check archive.
Start check recovery.
Start check RPO & RTO.
Successfully execute streaming disaster recovery query, result:
{'hadr_cluster_stat': 'archive', 'hadr_failover_stat': '', 'hadr_switchover_stat': '', 'RPO': '0', 'RTO': '0'}