Routine Maintenance

To ensure data security in GaussDB Kernel and prevent accidents, such as data loss and illegal data access, read this section carefully.

Routine Maintenance Check Items

Checking openGauss Status

openGauss provides tools to check database and instance status, ensuring that databases and instances are running properly to provide data services.

  • Check instance status.

    gs_check -U omm -i CheckClusterState
    
  • Check parameters.

    postgres=# SHOW parameter_name;
    
  • Modify parameters.

    gs_guc reload  -D /gaussdb/data/dbnode -c "paraname=value"
    

Checking Lock Information

The lock mechanism is an important method to ensure data consistency. Information check helps learn database transactions and database running status.

  • Query lock information in the database.

    postgres=# SELECT * FROM pg_locks;
    
  • Query the status of threads waiting to acquire locks.

    postgres=# SELECT * FROM pg_thread_wait_status WHERE wait_status = 'acquire lock';
    
  • Query the status of events waiting to acquire locks.

    postgres=# SELECT node_name, thread_name, tid, wait_status, query_id FROM pgxc_thread_wait_status WHERE wait_status = 'acquire lock';
    
  • Kill a system process.

    Search for a system process that is running and run the following command to end the process:

    ps ux
    kill -9 pid
    

Collecting Event Statistics

Long-time running of SQL statements will occupy a lot of system resources. You can check event occurrence time and occupied memory to learn about database running status.

  • Query the time points about an event.

    Run the following command to query the thread start time, transaction start time, SQL start time, and status change time of the event:

    postgres=# SELECT backend_start,xact_start,query_start,state_change FROM pg_stat_activity;
    
  • Query the number of sessions on the current server.

    postgres=# SELECT count(*) FROM pg_stat_activity;
    
  • Collect system-level statistics.

    Run the following command to query information about the session that uses the maximum memory:

    postgres=# SELECT * FROM pv_session_memory_detail() ORDER BY usedsize desc limit 10;
    

Checking Objects

Tables, indexes, partitions, and constraints are key storage objects of a database. A database administrator needs to routinely maintain key information and these objects.

  • View table details.

    postgres=# \d+ table_name 
    
  • Query table statistics.

    postgres=# SELECT * FROM pg_statistic;
    
  • View index details.

    postgres=# \d+ index_name
    
  • Query partitioned table information.

    postgres=# SELECT * FROM pg_partition;
    
  • Collect statistics.

    Run the ANALYZE statement to collect related statistics on the database.

    Run the VACUUM statement to reclaim space and update statistics.

  • Query constraint information.

    postgres=# SELECT * FROM pg_constraint;
    

Checking an SQL Report

Run the EXPLAIN statement to view execution plans.

Backing Up Data

Never forget to back up data. During the routine work, the backup execution and backup data validity need to be checked to ensure data security and encryption security.

  • Export a specified user.

    gs_dump dbname -p port -f out.sql -U user_name -W password
    
  • Export a schema.

    gs_dump dbname -p port -n schema_name -f out.sql
    
  • Export a table.

    gs_dump dbname -p port -t table_name -f out.sql
    

Checking Basic Information

Basic information includes versions, components, and patches. Periodic database information checks and records are important for database life cycle management.

  • Check version information.

    postgres=# SELECT version();
    
  • Check table size and database size.

    postgres=# SELECT pg_table_size('table_name');
    postgres=# SELECT pg_database_size('database_name');
    

Checking OS Parameters

Check Method

Use the gs_checkos tool provided by openGauss to check the OS status.

Prerequisites

  • The hardware and network are working properly.
  • The trust relationship of user root among the hosts is normal.
  • Only user root is authorized to run the gs_checkos command.

Procedure

  1. Log in to a server as user root.

  2. Run the following command to check OS parameters of servers where the openGauss nodes are deployed:

    gs_checkos -i A
    

    Check the OS parameters to ensure that openGauss has passed the pre-installation check and can efficiently operate after it is installed. For details about the check items, see “Server Tools > gs_checkos” in the openGauss Tool Reference.

Examples

Before running the gs_checkos command, execute pre-processing scripts by running gs_preinstall to prepare the environment. The following uses parameter A as an example:

gs_checkos -i A
Checking items:
    A1. [ OS version status ]                                   : Normal
    A2. [ Kernel version status ]                               : Normal
    A3. [ Unicode status ]                                      : Normal
    A4. [ Time zone status ]                                    : Normal
    A5. [ Swap memory status ]                                  : Normal
    A6. [ System control parameters status ]                    : Normal
    A7. [ File system configuration status ]                    : Normal
    A8. [ Disk configuration status ]                           : Normal
    A9. [ Pre-read block size status ]                          : Normal
    A10.[ IO scheduler status ]                                 : Normal
    A11.[ Network card configuration status ]                   : Normal
    A12.[ Time consistency status ]                             : Warning
    A13.[ Firewall service status ]                             : Normal
    A14.[ THP service status ]                                  : Normal
Total numbers:14. Abnormal numbers:0. Warning number:1.

The following uses parameter B as an example:

gs_checkos -i B
Setting items:
    B1. [ Set system control parameters ]                       : Normal
    B2. [ Set file system configuration value ]                 : Normal
    B3. [ Set pre-read block size value ]                       : Normal
    B4. [ Set IO scheduler value ]                              : Normal
    B5. [ Set network card configuration value ]                : Normal
    B6. [ Set THP service ]                                     : Normal
    B7. [ Set RemoveIPC value ]                                 : Normal
    B8. [ Set Session Process ]                                 : Normal
Total numbers:6. Abnormal numbers:0. Warning number:0.

Exception Handling

If you use the gs_checkos tool to check the OS and the command output shows Abnormal, run the following command to view detailed error information:

gs_checkos -i A --detail

The Abnormal state cannot be ignored because the OS in this state affects cluster installation. The Warning state does not affect cluster installation and thereby can be ignored.

  • If the check result for OS version status (A1) is Abnormal, replace OSs out of the mixed programming scope with those within the scope.

  • If the check result for kernel version status (A2) is Warning, the platform kernel versions in the cluster are inconsistent.

  • If the check result for Unicode status (A3) is Abnormal, set the same character set for all the hosts. You can add export LANG=unicode to the /etc/profile file.

    vim /etc/profile
    
  • If the check result for time zone status (A4) is Abnormal, set the same time zone for all the hosts. You can copy the time zone file in the /usr/share/zoneinfo/ directory as the /etc/localtime file.

    cp /usr/share/zoneinfo/$primary time zone/$secondary time zone /etc/localtime
    
  • If the check result for swap memory status (A5) is Abnormal, a possible cause is that the swap memory is larger than the physical memory. You can troubleshoot this issue by reducing the swap memory or increasing the physical memory.

  • If the check result for system control parameter status (A6) is Abnormal, troubleshoot this issue in either of the following two ways:

    • Run the following command:

      gs_checkos -i B1
      
    • Modify the /etc/sysctl.conf file based on the error message and run sysctl -p to make it take effect.

      vim /etc/sysctl.conf
      
  • If the check result for file system configuration status (A7) is Abnormal, run the following command to troubleshoot this issue:

    gs_checkos -i B2
    
  • If the check result for disk configuration status (A8) is Abnormal, set the disk mounting format to rw,noatime,inode64,allocsize=16m.

    Run the man mount command to mount the XFS parameter:

    rw,noatime,inode64,allocsize=16m
    

    You can also set the XFS parameter in the /etc/fstab file. For example:

    /dev/data /data xfs rw,noatime,inode64,allocsize=16m 0 0
    
  • If the check result for pre-read block size status (A9) is Abnormal, run the following command to troubleshoot this issue:

    gs_checkos -i B3
    
  • If the check result for I/O scheduling status (A10) is Abnormal, run the following command to troubleshoot this issue:

    gs_checkos -i B4
    
  • If the check result for NIC configuration status (A11) is Warning, run the following command to troubleshoot this issue:

    gs_checkos -i B5
    
  • If the check result for time consistency status (A12) is Abnormal, verify that the NTP service has been installed and started and has synchronized time from the NTP clock.

  • If the check result for firewall status (A13) is Abnormal, disable the firewall. Run the following commands:

    • SUSE:

      SuSEfirewall2 stop
      
    • RedHat7:

      systemctl disable firewalld
      
    • RedHat6:

      service iptables stop
      
  • If the check result for THP service status (A14) is Abnormal, run the following command to troubleshoot this issue:

    gs_checkos -i B6
    

Checking openGauss Health Status

Check Method

Use the gs_check tool provided by openGauss to check the openGauss health status.

Precautions

  • Only user root is authorized to check new nodes added during cluster scale-out. In other cases, the check can be performed only by user omm.
  • Parameter -i or -e must be set. -i specifies a single item to be checked, and -e specifies an inspection scenario where multiple items will be checked.
  • If -i is not set to a root item or no such items are contained in the check item list of the scenario specified by -e, you do not need to enter the name or password of user root.
  • You can run --skip-root-items to skip root items.
  • Check the consistency between the new node and existing nodes. Run the gs_check command on an existing node and specify the --hosts parameter. The IP address of the new node needs to be written into the hosts file.

Procedure

Method 1:

  1. Log in as the OS user omm to the primary node of the database.

  2. Run the following command to check the openGauss database status:

    gs_check -i CheckClusterState
    

    In the command, -i indicates the check item and is case-sensitive. The format is -i CheckClusterState, -i CheckCPU or -i CheckClusterState,CheckCPU.

    Checkable items are listed in “Server Tools > gs_check > openGauss status checks” in the openGauss Tool Reference. You can create a check item as needed.

Method 2:

  1. Log in as the OS user omm to the primary node of the database.

  2. Run the following command to check the openGauss database health status:

    gs_check -e inspect
    

    In the command, -e indicates the inspection scenario and is case-sensitive. The format is -e inspect or -e upgrade.

    The inspection scenarios include inspect (routine inspection), upgrade (inspection before upgrade), Install (install inspection ), binary_upgrade (inspection before in-place upgrade), slow_node (node inspection), longtime (time-consuming inspection) and health (health inspection). You can create an inspection scenario as needed.

The openGauss inspection is performed to check openGauss status during openGauss running or to check the environment and conditions before critical operations, such as upgrade or scale-out. For details about the inspection items and scenarios, see “Server Tools > gs_check > openGauss status checks” in the openGauss Tool Reference.

Examples

Check result of a single item:

perfadm@lfgp000700749:/opt/huawei/perfadm/tool/script> gs_check -i CheckCPU
Parsing the check items config file successfully
Distribute the context file to remote hosts successfully
Start to health check for the cluster. Total Items:1 Nodes:3

Checking...               [=========================] 1/1
Start to analysis the check result
CheckCPU....................................OK
The item run on 3 nodes.  success: 3

Analysis the check result successfully
Success. All check items run completed. Total:1  Success:1  Failed:0
For more information please refer to /opt/huawei/wisequery/script/gspylib/inspection/output/CheckReport_201902193704661604.tar.gz

Local execution result:

perfadm@lfgp000700749:/opt/huawei/perfadm/tool/script> gs_check -i CheckCPU -L

2017-12-29 17:09:29 [NAM] CheckCPU
2017-12-29 17:09:29 [STD] Check the CPU usage of the host. If the value of idle is greater than 30% and the value of iowait is less than 30%, this item passes the check. Otherwise, this item fails the check.
2017-12-29 17:09:29 [RST] OK

2017-12-29 17:09:29 [RAW]
Linux 4.4.21-69-default (lfgp000700749)  12/29/17  _x86_64_

17:09:24        CPU     %user     %nice   %system   %iowait    %steal     %idle
17:09:25        all      0.25      0.00      0.25      0.00      0.00     99.50
17:09:26        all      0.25      0.00      0.13      0.00      0.00     99.62
17:09:27        all      0.25      0.00      0.25      0.13      0.00     99.37
17:09:28        all      0.38      0.00      0.25      0.00      0.13     99.25
17:09:29        all      1.00      0.00      0.88      0.00      0.00     98.12
Average:        all      0.43      0.00      0.35      0.03      0.03     99.17

Check result of a scenario:

[perfadm@SIA1000131072 Check]$ gs_check -e inspect
Parsing the check items config file successfully
The below items require root privileges to execute:[CheckBlockdev CheckIOrequestqueue CheckIOConfigure CheckCheckMultiQueue CheckFirewall CheckSshdService CheckSshdConfig CheckCrondService CheckNoCheckSum CheckSctpSeProcMemory CheckBootItems CheckFilehandle CheckNICModel CheckDropCache]
Please enter root privileges user[root]:root
Please enter password for user[root]:
Please enter password for user[root] on the node[10.244.57.240]:
Check root password connection successfully
Distribute the context file to remote hosts successfully
Start to health check for the cluster. Total Items:59 Nodes:2

Checking...               [                         ] 21/59
Checking...               [=========================] 59/59
Start to analysis the check result
CheckClusterState...........................OK
The item run on 2 nodes.  success: 2

CheckDBParams...............................OK
The item run on 1 nodes.  success: 1

CheckDebugSwitch............................OK
The item run on 2 nodes.  success: 2

CheckDirPermissions.........................OK
The item run on 2 nodes.  success: 2

CheckReadonlyMode...........................OK
The item run on 1 nodes.  success: 1

CheckEnvProfile.............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
GAUSSHOME        /usr1/gaussdb/app
LD_LIBRARY_PATH  /usr1/gaussdb/app/lib
PATH             /usr1/gaussdb/app/bin


CheckBlockdev...............................OK
The item run on 2 nodes.  success: 2

CheckCurConnCount...........................OK
The item run on 1 nodes.  success: 1

CheckCursorNum..............................OK
The item run on 1 nodes.  success: 1

CheckPgxcgroup..............................OK
The item run on 1 nodes.  success: 1

CheckDiskFormat.............................OK
The item run on 2 nodes.  success: 2

CheckSpaceUsage.............................OK
The item run on 2 nodes.  success: 2

CheckInodeUsage.............................OK
The item run on 2 nodes.  success: 2

CheckSwapMemory.............................OK
The item run on 2 nodes.  success: 2

CheckLogicalBlock...........................OK
The item run on 2 nodes.  success: 2

CheckIOrequestqueue.....................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:
On device (vdb) 'IO Request' RealValue '256' ExpectedValue '32768'
On device (vda) 'IO Request' RealValue '256' ExpectedValue '32768'

CheckMaxAsyIOrequests.......................OK
The item run on 2 nodes.  success: 2

CheckIOConfigure............................OK
The item run on 2 nodes.  success: 2

CheckMTU....................................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
1500

CheckPing...................................OK
The item run on 2 nodes.  success: 2

CheckRXTX...................................NG
The item run on 2 nodes.  ng: 2
The ng[host240,host157] value:
NetWork[eth0]
RX: 256
TX: 256


CheckNetWorkDrop............................OK
The item run on 2 nodes.  success: 2

CheckMultiQueue.............................OK
The item run on 2 nodes.  success: 2

CheckEncoding...............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
LANG=en_US.UTF-8

CheckFirewall...............................OK
The item run on 2 nodes.  success: 2

CheckKernelVer..............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
3.10.0-957.el7.x86_64

CheckMaxHandle..............................OK
The item run on 2 nodes.  success: 2

CheckNTPD...................................OK
host240: NTPD service is running, 2020-06-02 17:00:28
host157: NTPD service is running, 2020-06-02 17:00:06


CheckOSVer..................................OK
host240: The current OS is centos 7.6 64bit.
host157: The current OS is centos 7.6 64bit.


CheckSysParams..........................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:
Warning reason: variable 'net.ipv4.tcp_retries1' RealValue '3' ExpectedValue '5'.
Warning reason: variable 'net.ipv4.tcp_syn_retries' RealValue '6' ExpectedValue '5'.
Warning reason: variable 'net.sctp.path_max_retrans' RealValue '5' ExpectedValue '10'.
Warning reason: variable 'net.sctp.max_init_retransmits' RealValue '8' ExpectedValue '10'.


CheckTHP....................................OK
The item run on 2 nodes.  success: 2

CheckTimeZone...............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
+0800

CheckCPU....................................OK
The item run on 2 nodes.  success: 2

CheckSshdService............................OK
The item run on 2 nodes.  success: 2

CheckSshdConfig.........................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:

Warning reason: UseDNS parameter is not set; expected: no

CheckCrondService...........................OK
The item run on 2 nodes.  success: 2

CheckStack..................................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
8192

CheckNoCheckSum.............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
Nochecksum value is N,Check items pass.

CheckSysPortRange...........................OK
The item run on 2 nodes.  success: 2

CheckMemInfo................................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
totalMem: 31.260929107666016G

CheckHyperThread............................OK
The item run on 2 nodes.  success: 2

CheckTableSpace.............................OK
The item run on 1 nodes.  success: 1

CheckSctpService............................OK
The item run on 2 nodes.  success: 2

CheckSysadminUser...........................OK
The item run on 1 nodes.  success: 1

CheckGUCConsistent..........................OK
All DN instance guc value is consistent.

CheckMaxProcMemory..........................OK
The item run on 1 nodes.  success: 1

CheckBootItems..............................OK
The item run on 2 nodes.  success: 2

CheckHashIndex..............................OK
The item run on 1 nodes.  success: 1

CheckPgxcRedistb............................OK
The item run on 1 nodes.  success: 1

CheckNodeGroupName..........................OK
The item run on 1 nodes.  success: 1

CheckTDDate.................................OK
The item run on 1 nodes.  success: 1

CheckDilateSysTab...........................OK
The item run on 1 nodes.  success: 1

CheckKeyProAdj..............................OK
The item run on 2 nodes.  success: 2

CheckProStartTime.......................WARNING
host157:
STARTED COMMAND
Tue Jun  2 16:57:18 2020 /usr1/dmuser/dmserver/metricdb1/server/bin/gaussdb --single_node -D /usr1/dmuser/dmb1/data -p 22204
Mon Jun  1 16:15:15 2020 /usr1/gaussdb/app/bin/gaussdb -D /usr1/gaussdb/data/dn1 -M standby


CheckFilehandle.............................OK
The item run on 2 nodes.  success: 2

CheckRouting................................OK
The item run on 2 nodes.  success: 2

CheckNICModel...............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
version: 1.0.0
model: Red Hat, Inc. Virtio network device


CheckDropCache..........................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:
No DropCache process is running

CheckMpprcFile..............................NG
The item run on 2 nodes.  ng: 2
The ng[host240,host157] value:
There is no mpprc file

Analysis the check result successfully
Failed. All check items run completed. Total:59   Success:52   Warning:5   NG:2
For more information please refer to /usr1/gaussdb/tool/script/gspylib/inspection/output/CheckReport_inspect611.tar.gz

Exception Handling

Troubleshoot exceptions detected in the inspection by following instructions in this section.

Table 1 Check of openGauss running status

Check Item

Abnormal Status

Solution

CheckClusterState (Checks the openGauss status.)

openGauss or openGauss instances are not started.

Run the following command to start openGauss and instances:

gs_om -t start

The status of openGauss or openGauss instances is abnormal.

Check the status of hosts and instances. Troubleshoot this issue based on the status information.

gs_check -i CheckClusterState

CheckDBParams (Checks database parameters.)

Database parameters have incorrect values.

Use the gs_guc tool to set the parameters to specified values.

CheckDebugSwitch (Checks debug logs.)

The log level is incorrect.

Use the gs_guc tool to set log_min_messages to specified content.

CheckDirPermissions (Checks directory permissions.)

The permission for a directory is incorrect.

Change the directory permission to a specified value (750 or 700).

chmod 750 DIR

CheckReadonlyMode (Checks the read-only mode.)

The read-only mode is enabled.

Verify that the usage of the disk where database nodes are located does not exceed the threshold (60% by default) and no other O&M operations are performed.

gs_check -i CheckDataDiskUsage
ps ux

Use the gs_guc tool to disable the read-only mode of openGauss.

gs_guc reload -N all -I all -c 'default_transaction_read_only = off' 

CheckEnvProfile (Checks environment variables.)

Environment variables are inconsistent.

Update the environment variable information.

CheckBlockdev (Checks pre-read blocks.)

The size of a pre-read block is not 16384 KB.

Use the gs_checkos tool to set the size of the pre-read block to 16384 KB and write the setting into the auto-startup file.

gs_checkos -i B3

CheckCursorNum (Checks the number of cursors.)

The number of cursors fails to be checked.

Check whether the database is properly connected and whether the openGauss status is normal.

CheckPgxcgroup (Checks the data redistribution status.)

There are pgxc_group tables that have not been redistributed.

Proceed with the redistribution.

gs_expand, gs_dropnode

CheckDiskFormat (Checks disk configurations.)

Disk configurations are inconsistent between nodes.

Configure disk specifications to be consistent between nodes.

CheckSpaceUsage (Checks the disk space usage.)

Disk space is insufficient.

Clear or expand the disk for the directory.

CheckInodeUsage (Checks the disk index usage.)

Disk indexes are insufficient.

Clear or expand the disk for the directory.

CheckSwapMemory (Checks the swap memory.)

The swap memory is greater than the physical memory.

Reduce or disable the swap memory.

CheckLogicalBlock (Checks logical blocks.)

The size of a logical block is not 512 KB.

Use the gs_checkos tool to set the size of the logical block to 512 KB and write the setting into the auto-startup file.

gs_checkos -i B4

CheckIOrequestqueue (Checks I/O requests.)

The requested I/O is not 32768.

Use the gs_checkos tool to set the requested I/O to 32768 and write the setting into the auto-startup file.

gs_checkos -i B4

CheckCurConnCount (Checks the number of current connections.)

The number of current connections exceeds 90% of the allowed maximum number of connections.

Break idle primary database node connections.

CheckMaxAsyIOrequests (Checks the maximum number of asynchronous requests.)

The maximum number of asynchronous requests is less than 104857600 or (Number of database instances on the current node x 1048576).

Use the gs_checkos tool to set the maximum number of asynchronous requests to the larger one between 104857600 and (Number of database instances on the current node x 1048576).

gs_checkos -i B4

CheckMTU (Checks MTU values.)

MTU values are inconsistent between nodes.

Set the MTU value on each node to 1500 or 8192.

ifconfig eth* MTU 1500

CheckIOConfigure (Checks I/O configurations.)

The I/O mode is not deadline.

Use the gs_checkos tool to set the I/O mode to deadline and write the setting into the auto-startup file.

gs_checkos -i B4

CheckRXTX (Checks the RX/TX value.)

The NIC RX/TX value is not 4096.

Use the checkos tool to set the NIC RX/TX value to 4096 for openGauss.

gs_checkos -i B5

CheckPing (Checks whether the network connection is normal.)

There are openGauss IP addresses that cannot be pinged.

Check the network settings, network status, and firewall status between the abnormal IP addresses.

CheckNetWorkDrop (Checks the network packet loss rate.)

The network packet loss rate is greater than 1%.

Check the network load and status between the corresponding IP addresses.

CheckMultiQueue (Checks the NIC multi-queue function.)

Multiqueue is not enabled for the NIC, and NIC interruptions are not bound to different CPU cores.

Enable multiqueue for the NIC, and bind NIC interruptions to different CPU cores.

CheckEncoding (Checks the encoding format.)

Encoding formats are inconsistent between nodes.

Write the same encoding format into /etc/profile for each node.

echo "export LANG=XXX" >> /etc/profile

CheckActQryCount (Checks the archiving mode.)

The archiving mode is enabled, and the archiving directory is not under the primary database node directory.

Disable archiving mode or set the archiving directory to be under the primary database node directory.

CheckFirewall (Checks the firewall.)

The firewall is enabled.

Disable the firewall.

redHat(CentOS)7.x: systemctl disable firewalld.service
redHat(CentOS)6.x: service iptables down
SuSE: SuSEfirewall2 down

CheckKernelVer (Checks kernel versions.)

Kernel versions are inconsistent between nodes.

CheckMaxHandle (Checks the maximum number of file handles.)

The maximum number of handles is less than 1000000.

Set the soft and hard limits in the 91-nofile.conf or 90-nofile.conf file to 1000000.

gs_checkos -i B2

CheckNTPD (Checks the time synchronization service.)

The NTPD service is disabled or the time difference is greater than 1 minute.

Enable the NTPD service and set the time to be consistent.

CheckSysParams (Checks OS parameters.)

OS parameter settings do not meet requirements.

Use the gs_checkos tool or manually set parameters to values meeting requirements.

gs_checkos -i B1
vim /etc/sysctl.conf

CheckTHP (Checks the THP service.)

The THP service is disabled.

Use the gs_checkos to enable the THP service.

gs_checkos -i B6

CheckTimeZone (Checks time zones.)

Time zones are inconsistent between nodes.

Set time zones to be consistent between nodes.

cp /usr/share/zoneinfo/$primary time zone/$secondary time zone /etc/localtime

CheckCPU (Checks the CPU.)

CPU usage is high or I/O waiting time is too long.

Upgrade CPUs or improve disk performance.

CheckSshdService (Checks the SSHD service.)

The SSHD service is disabled.

Enable the SSHD service and write the setting into the auto-startup file.

service sshd start
echo "server sshd start" >> initFile

CheckSshdConfig (Checks SSHD configurations.)

The SSHD service is incorrectly configured.

Reconfigure the SSHD service.

PasswordAuthentication=no;
MaxStartups=1000;
UseDNS=yes;
ClientAliveInterval=10800/ClientAliveInterval=0

Restart the service.

server sshd start

CheckCrondService (Checks the Crond service.)

The Crond service is disabled.

Install and enable the Crond service.

CheckStack (Checks the stack size.)

The stack size is less than 3072.

Use the gs_checkos tool to set the stack size to 3072 and restart the processes with a smaller stack size.

gs_checkos -i B2

CheckNoCheckSum (Checks the NoCheckSum parameter.)

NoCheckSum is incorrectly set or its value is inconsistent on each node.

Set NoCheckSum to a consistent value on each node. If redHat6.4 or redHat6.5 with the NIC binding mode bond0 exists, set NoCheckSum to Y. In other cases, set it to N.

echo Y > /sys/module/sctp/parameters/no_checksums

CheckSysPortRange (Checks OS port configurations.)

OS IP ports are not within the required port range or openGauss ports are within the OS IP port range.

Set the OS IP ports within 26000 to 65535 and set the openGauss ports beyond the OS IP port range.

vim /etc/sysctl.conf

CheckMemInfo (Checks the memory information.)

Memory sizes are inconsistent between nodes.

Use physical memory of the same specifications between nodes.

CheckHyperThread (Checks the hyper-threading.)

The CPU hyper-threading is disabled.

Enable the CPU hyper-threading.

CheckTableSpace (Checks tablespaces.)

The tablespace path is nested with the openGauss path or nested with the path of another tablespace.

Migrate tablespace data to the tablespace with a valid path.

CheckSctpService (Checks the SCTP service.)

The SCTP service is disabled.

Install and enable the SCTP service.

modprobe sctp

Checking Database Performance

Check Method

Use the gs_checkperf tool provided by openGauss to check hardware performance.

Prerequisites

  • openGauss is running properly.
  • Services are running properly on the database.

Procedure

  1. Log in as the OS user omm to the primary node of the database.

  2. Run the following command to check the openGauss database performance:

    gs_checkperf
    

For details about performance statistical items, see “Server Tools > gs_checkperf > Performance Check Items” in the openGauss Tool Reference.

Examples

Simple performance statistical result is displayed on the screen as follows:

gs_checkperf -i pmk -U omm
Cluster statistics information:
    Host CPU busy time ratio                     :    1.43       %
    MPPDB CPU time % in busy time                :    1.88       %
    Shared Buffer Hit ratio                      :    99.96      %
    In-memory sort ratio                         :    100.00     %
    Physical Reads                               :    4
    Physical Writes                              :    25
    DB size                                      :    70         MB
    Total Physical writes                        :    25
    Active SQL count                             :    2
    Session count                                :    3

Exception Handling

After you use the gs_checkperf tool to check the cluster performance, if the performance is abnormal, troubleshoot the issue by following instructions in this section.

Table 1 Cluster-level performance status

Abnormal Status

Solution

High CPU usage of hosts

1. Add high-performance CPUs, or replace current CPUs with them.

2. Run the top command to check which system processes cause high CPU usage, and run the kill command to stop unused processes.

top

High CPU usage of GaussDB Kernel

1. Add high-performance CPUs, or replace current CPUs with them.

2. Run the top command to check which database processes cause high CPU usage, and run the kill command to stop unused processes.

top

3. Use the gs_expand tool to add new hosts to lower the CPU usage.

Low hit ratio of the shared memory

1. Expand the memory.

2. Run the following command to check the OS configuration file /etc/sysctl.conf and increase the value of kernel.shmmax.

vim /etc/sysctl.conf

Low in-memory sort ratio

Expand the memory.

High I/O and disk usage

1. Replace current disks with high-performance ones.

2. Adjust the data layout to evenly distribute I/O requests to all the physical disks.

3. Run VACUUM FULL for the entire database.

vacuum full;

4. Clean up the disk space.

5. Reduce the number of concurrent connections.

Transaction statistics

Query the pg_stat_activity system catalog and disconnect unnecessary connections. (Log in to the database and run the postgres=# \d+ pg_stat_activity; command.)

Table 2 Node-level performance status

Abnormal Status

Solution

High CPU usage

1. Add high-performance CPUs, or replace current CPUs with them.

2. Run the top command to check which system processes cause high CPU usage, and run the kill command to stop unused processes.

top

High memory usage

Expand or clean up the memory.

High I/O usage

1. Replace current disks with high-performance ones.

2. Clean up the disk space.

3. Use memory read/write to replace as much disk I/O as possible, putting frequently accessed files or data in the memory.

Table 3 Session/process-level performance status

Abnormal Status

Solution

High CPU, memory, and I/O usage

Check which processes cause high CPU, memory, or I/O usage. If they are unnecessary processes, kill them; otherwise, analyze the specific cause of high usage. For example, if SQL statement execution occupies much memory, check whether the SQL statements need optimization.

Table 4 SSD performance status

Abnormal Status

Solution

SSD read/write fault

Run the following command to check whether SSD is faulty. If yes, analyze the specific cause.

gs_checkperf -i SSD -U omm

Checking and Deleting Logs

You are advised to check OS logs and database run logs monthly for monitoring system status and troubleshooting, and to delete database run logs monthly for saving disk space.

Checking OS Logs

You are advised to monthly check OS logs to detect and prevent potential OS problems.

Procedure

Run the following command to check OS log files:

vim /var/log/messages

(Pay attention to words like kernel, error, and fatal in logs generated within the last month and handle the problems based on the alarm information.)

Checking openGauss Run Logs

A database can still run when errors occur during the execution of some operations. However, data may be inconsistent before and after the error occurrences. Therefore, you are advised to monthly check openGauss run logs to detect potential problems in time.

Prerequisites

  • The host used for collecting logs is running properly, and the network connection is normal. Database installation users trust each other.
  • An OS tool (for example, gstack) that the log collection tool requires has been installed. If it is not installed, an error message is displayed, and this collection item is skipped.

Procedure

  1. Log in as the OS user omm to the primary node of the database.

  2. Run the following command to collect database logs:

    gs_collector --begin-time="20160616 01:01" --end-time="20160616 23:59"
    

    In the command, 20160616 01:01 indicates the start time of the log and 20160616 23:59 indicates the end time of the log.

  3. Based on command output in 2, access the related log collection directory, decompress collected database logs, and check these logs.

    Assume that collected logs are stored in /opt/gaussdb/tmp/gaussdba_mppdb/collector_20160726_105158.tar.gz.

    tar -xvzf /opt/gaussdb/tmp/gaussdba_mppdb/collector_20160726_105158.tar.gz 
    cd /opt/gaussdb/tmp/gaussdba_mppdb/collector_20160726_105158
    

Examples

  • Run the gs_collector command together with parameters --begin-time and --end-time:

    gs_collector --begin-time="20160616 01:01" --end-time="20160616 23:59"
    

    If information similar to the following is displayed, the logs have been archived:

    Successfully collected  files 
    All results are stored in /tmp/gaussdba_mppdb/collector_20160616_175615.tar.gz.
    
  • Run the gs_collector command together with parameters --begin-time, --end-time, and -h:

    gs_collector --begin-time="20160616 01:01" --end-time="20160616 23:59" -h plat2
    

    If information similar to the following is displayed, the logs have been archived:

    Successfully collected  files
    All results are stored in /tmp/gaussdba_mppdb/collector_20160616_190225.tar.gz.
    
  • Run the gs_collector command together with parameters --begin-time, --end-time, and -f:

    gs_collector --begin-time="20160616 01:01" --end-time="20160616 23:59" -f /opt/software/gaussdb/output
    

    If information similar to the following is displayed, the logs have been archived:

    Successfully collected  files
    All results are stored in /opt/software/gaussdb/output/collector_20160616_190511.tar.gz.
    
  • Run the gs_collector command together with parameters --begin-time, --end-time, and --keyword:

    gs_collector --begin-time="20160616 01:01" --end-time="20160616 23:59" --keyword="os"
    

    If information similar to the following is displayed, the logs have been archived:

    Successfully collected files.
    All results are stored in /tmp/gaussdba_mppdb/collector_20160616_190836.tar.gz.
    
  • Run the gs_collector command together with parameters --begin-time, --end-time, and -o:

    gs_collector --begin-time="20160616 01:01" --end-time="20160616 23:59" -o /opt/software/gaussdb/output
    

    If information similar to the following is displayed, the logs have been archived:

    Successfully collected files.
    All results are stored in /opt/software/gaussdb/output/collector_20160726_113711.tar.gz.
    
  • Run the gs_collector command together with parameters --begin-time, --end-time, and -l (the file name extension must be .log):

    gs_collector --begin-time="20160616 01:01" --end-time="20160616 23:59" -l /opt/software/gaussdb/logfile.log
    

    If information similar to the following is displayed, the logs have been archived:

    Successfully collected files.
    All results are stored in /opt/software/gaussdb/output/collector_20160726_113711.tar.gz.
    

Cleaning Run Logs

A large number of run logs will be generated during database running and occupy huge disk space. You are advised to delete expired run logs and retain logs generated within one month.

Procedure

  1. Log in as the OS user omm to any host in the GaussDB Kernel cluster.

  2. Clean logs.

    a. Back up logs generated over one month ago to other disks.

    b. Access the directory where logs are stored.

    cd $GAUSSLOG
    

    c. Access the corresponding sub-directory and run the following command to delete logs generated one month ago:

    rm log name
    

    The naming convention of a log file is postgresql-year-month-day_HHMMSS.

Checking Time Consistency

Database transaction consistency is guaranteed by a logical clock and is not affected by OS time. However, OS time inconsistency will lead to problems, such as abnormal backend O&M and monitoring functions. Therefore, you are advised to monthly check time consistency among nodes.

Procedure

  1. Log in as the OS user omm to any host in the GaussDB Kernel cluster.

  2. Create a configuration file for recording each cluster node. (You can specify the mpphosts file directory randomly. It is recommended that the file be stored in the /tmp directory.)

    vim /tmp/mpphosts
    

    Add the host name of each node.

    plat1
    plat2
    plat3
    
  3. Save the configuration file.

    :wq!
    
  4. Run the following command and write the time on each node into the /tmp/sys_ctl-os1.log file:

    for ihost in `cat /tmp/mpphosts`; do ssh -n -q $ihost "hostname;date"; done > /tmp/sys_ctl-os1.log
    
  5. Check time consistency between the nodes based on the command output. The time difference should not exceed 30s.

    cat /tmp/sys_ctl-os1.log
    plat1
    Thu Feb  9 16:46:38 CST 2017
    plat2
    Thu Feb  9 16:46:49 CST 2017
    plat3
    Thu Feb  9 16:46:14 CST 2017
    

Checking the Number of Application Connections

If the number of connections between applications and the database exceeds the maximum value, new connections cannot be established. You are advised to daily check the number of connections, release idle connections in time, or increase the allowed maximum number of connections.

Procedure

  1. Log in as the OS user omm to the primary node of the database.

  2. Run the following command to connect to the database:

    gsql -d postgres -p 8000
    

    postgres is the name of the database to be connected, and 8000 is the port number of the database primary node.

    If information similar to the following is displayed, the connection succeeds:

    gsql ((openGauss 1.0 build 290d125f) compiled at 2020-05-08 02:59:43 commit 2143 last mr 131
    Non-SSL connection (SSL connection is recommended when requiring high-security)
    Type "help" for help.
    
    postgres=# 
    
  3. Run the following SQL statement to check the number of connections:

    postgres=# SELECT count(*) FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS s;
    

    Information similar to the following is displayed. 2 indicates that two applications are connected to the database.

    count
    -------
         2
    (1 row)
    
  4. View the allowed maximum connections.

    postgres=# SHOW max_connections;
    

    Information similar to the following is displayed. 200 indicates the currently allowed maximum number of connections.

     max_connections 
    -----------------
     200
    (1 row)
    

Exception Handling

If the number of connections in the command output is close to the value of max_connections of the database, delete existing connections or change the upper limit based on site requirements.

  1. Run the following SQL statement to view information about connections whose state is set to idle, and state_change column is not updated for a long time.

    postgres=# SELECT * FROM pg_stat_activity where state='idle' order by state_change;
    

    Information similar to the following is displayed:

     datid | datname  |       pid       | usesysid | usename  | application_name |  client_addr  
     | client_hostname | client_port |         backend_start         | xact_start |          quer
    y_start          |         state_change          | waiting | enqueue | state | resource_pool 
    |                    query                     
    -------+----------+-----------------+----------+----------+------------------+---------------
    -+-----------------+-------------+-------------------------------+------------+--------------
    -----------------+-------------------------------+---------+---------+-------+---------------
    +----------------------------------------------
     13626 | postgres | 140390162233104 |       10 | gaussdba |                  |               
     |                 |          -1 | 2016-07-15 14:08:59.474118+08 |            | 2016-07-15 14
    :09:04.496769+08 | 2016-07-15 14:09:04.496975+08 | f       |         | idle  | default_pool  
    | select count(group_name) from pgxc_group;
     13626 | postgres | 140390132872976 |       10 | gaussdba | cn_5002          | 10.180.123.163
     |                 |       48614 | 2016-07-15 14:11:16.014871+08 |            | 2016-07-15 14
    :21:17.346045+08 | 2016-07-15 14:21:17.346095+08 | f       |         | idle  | default_pool  
    | SET SESSION AUTHORIZATION DEFAULT;RESET ALL;
    (2 rows)
    
  2. Release idle connections.

    Check each connection and release them after obtaining approval from the users of the connections. Run the following SQL command to release a connection using pid obtained in the previous step:

    postgres=# SELECT pg_terminate_backend(140390132872976);
    

    Information similar to the following is displayed:

    postgres=# SELECT pg_terminate_backend(140390132872976);
     pg_terminate_backend 
    ----------------------
     t
    (1 row)
    

    If no connections can be released, go to the next step.

  3. Increase the maximum number of connections.

    gs_guc set -D /gaussdb/data/dbnode -c "max_connections= 800"
    

    800 is the new maximum value.

  4. Restart database services to make the new settings take effect.

    NOTE: The restart results in operation interruption. Properly plan the restart to avoid affecting users.

    gs_om -t stop && gs_om -t start
    

Routinely Maintaining Tables

To ensure proper database running, after insert and delete operations, you need to routinely run VACUUM FULL and ANALYZE as appropriate for customer scenarios and update statistics to obtain better performance.

You need to routinely run VACUUM, VACUUM FULL, and ANALYZE to maintain tables, because:

  • VACUUM FULL can be used to reclaim disk space occupied by updated or deleted data and combine small-size data files.
  • VACUUM can be used to maintain a visualized mapping for each table to track pages that contain arrays visible to other active transactions. A common index scan uses the mapping to obtain the corresponding arrays and check whether the arrays are visible to the current transaction. If the arrays cannot be obtained, capture a batch of arrays to check the visibility. Therefore, updating the visualized mapping of a table can accelerate unique index scans.
  • Running VACUUM can avoid original data loss caused by duplicate transaction IDs when the number of executed transactions exceeds the database threshold.
  • ANALYZE can be used to collect statistics on tables in databases. The statistics are stored in the system catalog PG_STATISTIC. Then the query optimizer uses the statistics to work out the most efficient execution plan.

Procedure

  1. Run the VACUUM or VACUUM FULL command to reclaim disk space.

    • VACUUM:

      Run VACUUM for a table.

      postgres=# VACUUM customer;
      
      VACUUM
      

      This statement can be concurrently executed with database operation commands, including SELECT, INSERT, UPDATE, and DELETE; excluding ALTER TABLE.

      Run VACUUM for the table partition.

      postgres=# VACUUM customer_par PARTITION ( P1 );
      
      VACUUM
      
    • VACUUM FULL:

      postgres=# VACUUM FULL customer;
      
      VACUUM
      

      During the command running, exclusive locks need to be added to the table and all other database operations need to be suspended.

  2. Run ANALYZE to update statistics.

    postgres=# ANALYZE customer;
    
    ANALYZE
    

    Run ANALYZE VERBOSE to update statistics and display table information.

    postgres=# ANALYZE VERBOSE customer;
    
    ANALYZE
    

    You can run VACUUM ANALYZE at the same time to optimize the query.

    postgres=# VACUUM ANALYZE customer;
    
    VACUUM
    

    NOTE: VACUUM and ANALYZE cause a substantial increase in I/O traffic, which may affect other active sessions. Therefore, you are advised to set the cost-based vacuum delay feature by specifying the vacuum_cost_delay parameter. For details, see “GUC Parameters > Resource Consumption > Cost-based Vacuum Delay” in the Developer Guide.

  3. Delete a table.

    postgres=# DROP TABLE customer;
    postgres=# DROP TABLE customer_par;
    postgres=# DROP TABLE part;
    

    If the following information is displayed, the tables have been deleted:

    DROP TABLE
    

Maintenance Suggestions

  • Routinely run VACUUM FULL for large tables. If the database performance deteriorates, run VACUUM FULL for the entire database. If the database performance is stable, you are advised to run VACUUM FULL monthly.
  • Routinely run VACUUM FULL on system catalogs, especially PG_ATTRIBUTE.
  • Enable automatic vacuum processes (AUTOVACUUM) in the system. The processes automatically run the VACUUM and ANALYZE statements to reclaim the record space marked as the deleted state and update statistics in the table.

Routinely Recreating an Index

Background

When data deletion is repeatedly performed in the database, index keys will be deleted from the index pages, resulting in index bloat. Recreating an index routinely improves query efficiency.

The database supports B-tree indexes. Recreating a B-tree index routinely helps improve query efficiency.

  • If a large amount of data is deleted, index keys on the index pages will be deleted. As a result, the number of index pages reduces and index bloat occurs. Recreating an index helps reclaim wasted space.
  • In a newly created index, pages with adjacent logical structures tend to have adjacent physical structures. Therefore, a new index achieves a higher access speed than an index that has been updated for multiple times.

Methods

Use either of the following two methods to recreate an index:

  • Run the DROP INDEX statement to delete the index and run the CREATE INDEX statement to create an index.

    When you delete an index, a temporary exclusive lock is added in the parent table to block related read/write operations. During index creation, the write operation is locked, whereas the read operation is not locked and can use only sequential scans.

  • Run REINDEX to recreate an index.

    • When you run the REINDEX TABLE statement to recreate an index, an exclusive lock is added to block related read/write operations.
    • When you run the REINDEX INTERNAL TABLE statement to recreate an index for a desc table (such as column-store cudesc table), an exclusive lock is added to block related read/write operations on the table.

Procedure

Assume the ordinary index areaS_idx exists in the area_id column of the imported table areaS. Use either of the following two methods to recreate an index:

  • Run the DROP INDEX statement to delete the index and run the CREATE INDEX statement to create an index.

    1. Delete the index.

      postgres=# DROP INDEX areaS_idx;
      DROP INDEX
      
    2. Create an index

      postgres=# CREATE INDEX areaS_idx ON areaS (area_id);
      CREATE INDEX
      
  • Run REINDEX to recreate an index.

    • Run REINDEX TABLE to recreate an index.

      postgres=# REINDEX TABLE areaS;
      REINDEX
      
    • Run REINDEX INTERNAL TABLE to recreate an index for a desc table (such as column-store cudesc table).

      postgres=# REINDEX INTERNAL TABLE areaS;
      REINDEX
      

NOTE: Before you recreate an index, you can increase the values of maintenance_work_mem and psort_work_mem to accelerate the index recreation.

Data Security Maintenance Suggestions

To ensure data security in GaussDB Kernel and prevent accidents, such as data loss and illegal data access, read this section carefully.

Preventing Data Loss

You are advised to plan routine physical backup and store backup files in a reliable medium. If a serious error occurs in the system, you can use the backup files to restore the system to the state at the backup point.

Preventing Illegal Data Access

  • You are advised to manage database users based on their permission hierarchies. A database administrator creates users and grants permissions to the users based on service requirements to ensure users properly access the database.
  • You are advised to deploy GaussDB Kernel servers and clients (or applications developed based on the client library) in trusted internal networks. If the servers and clients must be deployed in an untrusted network, enable SSL encryption before services are started to ensure data transmission security. Note that enabling the SSL encryption function compromises database performance.

Preventing System Logs from Leaking Personal Data

  • Delete personal data before sending debug logs to others for analysis.

    NOTE: The log level log_min_messages is set to DEBUGx (x indicates the debug level and the value ranges from 1 to 5). The information recorded in debug logs may contain personal data.

  • Delete personal data before sending system logs to others for analysis. If the execution of a SQL statement fails, the error SQL statement will be recorded in a system log by default. SQL statements may contain personal data.

  • Set log_min_error_statement to PANIC to prevent error SQL statements from being recorded in system logs. However, once the function is disabled, it is difficult to locate fault causes if faults occur.

Feedback
编组 3备份
    openGauss 2024-05-05 00:44:49
    cancel