因redo日志过多强制停机导致集群重启超时问题

一、问题现象

openGauss集群启动超时退出，报错信息如下。

cm_ctl: start cluster.
cm_ctl: start nodeid: 1
cm_ctl: start nodeid: 2
cm_ctl: start nodeid: 3
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
cm_ctl: start cluster failed in (601)s!

HINT: Maybe the cluster is continually being started in the background.
You can wait for a while and check whether the cluster starts, or increase the value of parameter "-t", e.g -t 600.

二、定位方法

查看集群状态，所有节点都处于Standby Starting状态。资源池化下可查看PGDATA目录下的DBstart.log和dms_control.log日志判断dn是否已经启动，查看dms_control.log日志，有以下信息。

Starting dn...
start dn in /usr1/dbuser/install/dss_home success.

查看$GAUSSLOG/pg_log/dn_xxx日志，出现以下信息。

2024-10-11 17:46:12.560 6708f1c0.6311 [unknown] 281458809548704 dn_6001 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  [REDO_LOG_TRACE]ProcessPendingRecords:replayedLsn:30602196424, blockcnt:268435455, WorkerCount:4, readEndLSN:30603794696
2024-10-11 17:46:54.131 6708f1c0.6311 [unknown] 281458809548704 dn_6001 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  [REDO_LOG_TRACE]ProcessPendingRecords:replayedLsn:30602196424, blockcnt:402653183, WorkerCount:4, readEndLSN:30603794696
2024-10-11 17:47:08.195 [unknown] [unknown] localhost 281459307032480 0[0:0#0]  0 [DMS] WARNING:  [SS reform] wait recovery to apply done for 600000000 us.
2024-10-11 17:47:34.682 6708f1c0.6311 [unknown] 281458809548704 dn_6001 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  [REDO_LOG_TRACE]ProcessPendingRecords:replayedLsn:30602196424, blockcnt:536870911, WorkerCount:4, readEndLSN:30603794696
2024-10-11 17:47:39.673 6708f1c0.6311 [unknown] 281458809548704 dn_6001 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  RedoSpeedDiag: 151146675 us, redoBytes:136,preEndPtr:30602196424, readPtr:30602196424, endPtr:30602196560, speed:0 KB/s, totalTime:151146675
2024-10-11 17:47:45.312 6708f1c0.6319 [unknown] 281458464436128 dn_6001 0 dn_6001_6002_6003 00000  0 [DBL_WRT] LOG:  [single flush] [first version] Reset DW file: file_head[dwn 147, start 0], writer pos is 0

说明此时当前节点正在进行redo回放，只需等待集群恢复正常。

三、问题根因

redo日志过多，当前节点正在进行日志回放。

四、解决方法

等待一段时间后，查看$GAUSSLOG/pg_log/dn_xxx日志是否已经恢复完成，然后使用cm_ctl query -Cvidpw命令确认集群状态恢复正常。