因redo日志过多导致的failover恢复时间过长问题

一、问题现象

failover恢复后,集群较长时间处于以下状态。

[  Datanode State   ]

node            node_ip         instance                              state
-----------------------------------------------------------------------------------------
1  openGaussxxx xxx.xxx.xxx.xxx    6001 25400  /.../install/data/dn P Standby Starting
2  openGaussxxx xxx.xxx.xxx.xxx    6002 25400  /.../install/data/dn S Down    Unknown
3  openGaussxxx xxx.xxx.xxx.xxx    6003 25400  /.../install/data/dn S Standby Promoting

定位分析

查看$GAUSSLOG/pg_log/dn_xxx日志,升主节点出现以下信息。

2024-10-12 09:17:16.123 [unknown] [unknown] localhost 281450988154784 0[0:0#0]  0 [BACKEND] LOG:  CheckpointInProgress: ckpt_done=1, ckpt_started=1
2024-10-12 09:17:16.123 [unknown] [unknown] localhost 281450988154784 0[0:0#0]  0 [DMS] LOG:  [SS reform][SS failover] backends exit successfully
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451140722592 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:9, buf_if start from:74304 to:82559, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451240927136 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:0, buf_if start from:0 to:8255, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451122962336 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:10, buf_if start from:82560 to:90815, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451019153312 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:15, buf_if start from:123840 to:132095, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451209469856 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:4, buf_if start from:33024 to:41279, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451093077920 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:12, buf_if start from:99072 to:107327, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451186401184 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:6, buf_if start from:49536 to:57791, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451064766368 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:13, buf_if start from:107328 to:115583, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451041697696 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:14, buf_if start from:115584 to:123839, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451238764448 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:1, buf_if start from:8256 to:16511, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451219431328 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:3, buf_if start from:24768 to:33023, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451232538528 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:2, buf_if start from:16512 to:24767, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451108806560 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:11, buf_if start from:90816 to:99071, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451196886944 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:5, buf_if start from:41280 to:49535, max_buf_id:132095.
2024-10-12 09:17:16.127 [unknown] [unknown] localhost 281451164381088 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:7, buf_if start from:57792 to:66047, max_buf_id:132095.
2024-10-12 09:17:16.128 [unknown] [unknown] localhost 281451149701024 0[0:0#0]  0 [DMS] LOG:  [SS reform] rebuild page: success.rebuild buf thread_index:8, buf_if start from:66048 to:74303, max_buf_id:132095.
2024-10-12 09:17:16.137 [unknown] [unknown] localhost 281450988154784 0[0:0#0]  0 [DMS] LOG:  [SS reform][SS failover] SSFailoverPromoteNotify:send signal to PM to initialize startup thread when DB alive
2024-10-12 09:17:16.138 [unknown] [unknown] localhost 281450988154784 0[0:0#0]  0 [DMS] LOG:  [SS reform][SS failover] wait startup thread start.
...

等待一段时间(具体时间由数据库回放速度决定)后,备节点dn日志出现以下信息,且数据库其他节点状态恢复为Normal。

[SS reform][SS failover] backends exit successfully

三、问题根因

出现这一问题现象是由于redo日志过多,进而造成集群恢复时间过长。

四、解决方法

等待一段时间后,查看$GAUSSLOG/pg_log/dn_xxx日志,日志回放完成,此时集群状态恢复正常。

意见反馈
编组 3备份
    openGauss 2024-12-23 00:51:56
    取消