因数据库启动首轮Reform失败导致节点退出的问题

一、问题现象

在资源池化场景中，当一个稳定集群出现异常，例如有节点重启、节点被踢出集群、节点加入集群等，DMS需要处理这一情况，将集群从不稳定状态转变为稳定状态，这一过程在资源池化架构下称之为reform。其中，某一数据库节点启动时触发的reform称为该节点的first reform（或启动轮reform）。在reform期间，如果出现新的节点状态变化，例如有节点在reform期间起停，则会导致本轮reform失败，DMS会基于当前的集群状态重新组织新的reform，用于将集群变为新的稳定状态。在目前DMS的设计中，数据库节点的启动轮reform对于该节点是不允许失败的，如果某个节点的启动轮reform失败，DMS会促使该节点退出，然后由CM重新拉起该节点，并由DMS触发新的一轮reform，将集群变为新的稳定状态。

二、定位方法

以三节点集群为例，记为0节点、1节点、2节点，其中0节点为主节点，执行kill -9 xxx强行停止0节点，在0节点状态为Starting时，执行kill -9 xxx强行停止1节点或2节点。会发现0节点再次重启。

查看相应时间段的pg_log日志（目录：$GAUSSLOG/pg_log/dn_600X），可以发现如下告警信息 WARNING: [SS reform][db sync wait] Node:0 first-round reform failed, shutdown now，数据库节点发现first-round reform失败后，就会向postmaster线程发送SIGTERM信号，然后走数据库退出流程。

2024-09-19 17:14:12.592 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  [SS reform][db sync wait] Node:0 first-round reform failed, shutdown now
2024-09-19 17:14:12.592 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  received fast shutdown request
2024-09-19 17:14:12.592 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  send to startup shutdown request

2024-09-19 17:14:16.557 66ebeb64.6536 [unknown] 281429379690368 dn_6001 0 dn_6001_6002_6003 00000  0 [DBL_WRT] LOG:  Double write exit
2024-09-19 17:14:16.557 66ebeb64.6536 [unknown] 281429379690368 dn_6001 0 dn_6001_6002_6003 00000  0 [DBL_WRT] LOG:  Double write exit
2024-09-19 17:14:16.558 66ebeb64.6536 [unknown] 281429379690368 dn_6001 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  database system is shut down
2024-09-19 17:14:16.568  [postmaster][reaper][281466178961424] LOG: checkpoint thread exit and nowait for sharestorage
2024-09-19 17:14:16.568 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  All obsArch and obsBarrier exit.
2024-09-19 17:14:16.581 66ebeae8.6534 [unknown] 281449383309184 dn_6001 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  auditor thread exit, id : 0
2024-09-19 17:14:16.581 66ebeae8.6534 [unknown] 281449383309184 dn_6001 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  auditor shutting down
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  All Audit threads exit.
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  All obsArch and obsBarrier exit.
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  page workers all exit or not open parallel redo
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  all build walsenders exited
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [SS] notify dms auxiliary thread exit
2024-09-19 17:14:16.597 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [SS] dms auxiliary thread exit
2024-09-19 17:14:16.597 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  DMS uninit worker threads, DRC, errdesc and DL
2024-09-19 17:14:37.776 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Shutting down MOT Engine
2024-09-19 17:14:37.776 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: All background tasks stopped
2024-09-19 17:14:37.776 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Destroying the Checkpoint Manager
2024-09-19 17:14:37.778 [MOT] <TID:3224506/-----> <SID:-----/-----> [INFO]     <Checkpoint>         CheckpointWorker0 - Exiting
2024-09-19 17:14:37.778 [MOT] <TID:3224510/-----> <SID:-----/-----> [INFO]     <Checkpoint>         CheckpointWorker1 - Exiting
2024-09-19 17:14:37.778 [MOT] <TID:3224520/-----> <SID:-----/-----> [INFO]     <Checkpoint>         CheckpointWorker2 - Exiting
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Destroying the Recovery Manager
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Destroying the Redo Log Handler
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: All tables cleared
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Key surrogate manager destroyed
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Table manager destroyed
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Session manager destroyed
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <Memory>             Kernel allocation MM Layer Pre-Shutdown Report report:
  NUMA Interleaved memory usage: Current = 1 MB, Peak = 1 MB

查看相应时间段的DMS日志（目录：$GAUSSLOG/pg_log/DMS/run），DMS频繁报错2节点通信异常，最后打印日志INFO>[mes]: disconnect node 2.，可以看出是由于2节点断开连接，本轮reform失败，由于0节点是first-round reform，因此0节点退出。

UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_send_pipe_nolock priority=4, inst_id=2, channel_id=31 [mes_tcp.c:420]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_send_pipe priority=4, inst_id=2, channel_id=31 [mes_tcp.c:405]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_pipe:inst_id=2,channel_id=31, prio=4, recv pipe closed [mes_tcp.c:429]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|ERROR>[mes] epoll_ctl event Del failed [fd=667],errno=9. [mes_recv.c:220]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_recv_pipe_nolock priority=5, inst_id=2, channel_id=31 [mes_tcp.c:198]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_recv_pipe priority=5, inst_id=2, channel_id=31 [mes_tcp.c:183]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_send_pipe priority=5, inst_id=2, channel_id=31 [mes_tcp.c:405]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_pipe:inst_id=2,channel_id=31, prio=5, recv pipe closed [mes_tcp.c:429]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|ERROR>[mes] epoll_ctl event Del failed [fd=668],errno=9. [mes_recv.c:220]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_recv_pipe_nolock priority=6, inst_id=2, channel_id=31 [mes_tcp.c:198]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_recv_pipe priority=6, inst_id=2, channel_id=31 [mes_tcp.c:183]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_send_pipe priority=6, inst_id=2, channel_id=31 [mes_tcp.c:405]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_pipe:inst_id=2,channel_id=31, prio=6, recv pipe closed [mes_tcp.c:429]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes]: disconnect node 2. [mes_func.c:1954]
UTC+8 2024-09-19 17:14:37.358|DMS|3140211|INFO>[mes] mes_uninit success [mes_func.c:1504]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[mes] start timer, sleep_time:100000 [cm_timer.c:107]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[LOG] file '/usr1/ertao/openGauss/cluster/log/gaussdb/ertao/pg_log/DMS/run/dms.rlog' is added [cm_log.c:899]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[LOG] file '/usr1/ertao/openGauss/cluster/log/gaussdb/ertao/pg_log/DMS/debug/dms.dlog' is added [cm_log.c:899]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[DMS] dms_init start [dms_process.c:1387]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[DMS] dms_set_global_dms start [dms_process.c:1339]

0节点退出后，cm会检测出0节点退出，自动拉起0节点，集群最后会恢复稳定状态。

[xxxxxxxx@xxxxxxxxx111 dn_6001]$ cm_ctl query -Cv
[  CMServer State   ]

node            instance state
--------------------------------
1  xxxxxxxxx111 1        Standby
2  xxxxxxxxx135 2        Standby
3  xxxxxxxxx137 3        Primary


[ Defined Resource State ]

node            res_name instance  state
------------------------------------------
1  xxxxxxxxx111 dms_res  6001      OnLine
2  xxxxxxxxx135 dms_res  6002      OnLine
3  xxxxxxxxx137 dms_res  6003      OnLine
1  xxxxxxxxx111 dss      20001     OnLine
2  xxxxxxxxx135 dss      20002     OnLine
3  xxxxxxxxx137 dss      20003     OnLine

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node            instance state            | node            instance state            | node            instance state
------------------------------------------------------------------------------------------------------------------------------------
1  xxxxxxxxx111 6001     P Primary Normal | 2  xxxxxxxxx135 6002     S Standby Normal | 3  xxxxxxxxx137 6003     S Standby Normal

三、问题根因

在目前DMS的设计中，数据库节点的启动轮reform对于该节点是不允许失败的，如果某节点的启动轮reform失败，DMS会控制该节点退出，然后由CM重新拉起该节点，并由DMS触发新的一轮reform，将集群变为新的稳定状态。在此过程中，用户不需要手动介入，只需要等待reform结束，即cm_ctl query -Cv查询结果，各节点状态都为Normal。

四、解决方案

该问题为reform特殊场景，不需要手动接入，只需要等待reform结束即可。