因数据库启动首轮Reform失败导致节点退出的问题

一、问题现象

在资源池化场景中,当一个稳定集群出现异常,例如有节点重启、节点被踢出集群、节点加入集群等,DMS需要处理这一情况,将集群从不稳定状态转变为稳定状态,这一过程在资源池化架构下称之为reform。其中,某一数据库节点启动时触发的reform称为该节点的first reform(或启动轮reform)。在reform期间,如果出现新的节点状态变化,例如有节点在reform期间起停,则会导致本轮reform失败,DMS会基于当前的集群状态重新组织新的reform,用于将集群变为新的稳定状态。在目前DMS的设计中,数据库节点的启动轮reform对于该节点是不允许失败的,如果某个节点的启动轮reform失败,DMS会促使该节点退出,然后由CM重新拉起该节点,并由DMS触发新的一轮reform,将集群变为新的稳定状态。

二、定位方法

以三节点集群为例,记为0节点、1节点、2节点,其中0节点为主节点,执行kill -9 xxx强行停止0节点,在0节点状态为Starting时,执行kill -9 xxx强行停止1节点或2节点。会发现0节点再次重启。

查看相应时间段的pg_log日志(目录:$GAUSSLOG/pg_log/dn_600X),可以发现如下告警信息 WARNING: [SS reform][db sync wait] Node:0 first-round reform failed, shutdown now,数据库节点发现first-round reform失败后,就会向postmaster线程发送SIGTERM信号,然后走数据库退出流程。

2024-09-19 17:14:12.592 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  [SS reform][db sync wait] Node:0 first-round reform failed, shutdown now
2024-09-19 17:14:12.592 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  received fast shutdown request
2024-09-19 17:14:12.592 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  send to startup shutdown request
2024-09-19 17:14:16.557 66ebeb64.6536 [unknown] 281429379690368 dn_6001 0 dn_6001_6002_6003 00000  0 [DBL_WRT] LOG:  Double write exit
2024-09-19 17:14:16.557 66ebeb64.6536 [unknown] 281429379690368 dn_6001 0 dn_6001_6002_6003 00000  0 [DBL_WRT] LOG:  Double write exit
2024-09-19 17:14:16.558 66ebeb64.6536 [unknown] 281429379690368 dn_6001 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  database system is shut down
2024-09-19 17:14:16.568  [postmaster][reaper][281466178961424] LOG: checkpoint thread exit and nowait for sharestorage
2024-09-19 17:14:16.568 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  All obsArch and obsBarrier exit.
2024-09-19 17:14:16.581 66ebeae8.6534 [unknown] 281449383309184 dn_6001 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  auditor thread exit, id : 0
2024-09-19 17:14:16.581 66ebeae8.6534 [unknown] 281449383309184 dn_6001 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  auditor shutting down
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  All Audit threads exit.
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  All obsArch and obsBarrier exit.
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  page workers all exit or not open parallel redo
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  all build walsenders exited
2024-09-19 17:14:16.595 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [SS] notify dms auxiliary thread exit
2024-09-19 17:14:16.597 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [SS] dms auxiliary thread exit
2024-09-19 17:14:16.597 66ebeab1.1 [unknown] 281466178961424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  DMS uninit worker threads, DRC, errdesc and DL
2024-09-19 17:14:37.776 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Shutting down MOT Engine
2024-09-19 17:14:37.776 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: All background tasks stopped
2024-09-19 17:14:37.776 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Destroying the Checkpoint Manager
2024-09-19 17:14:37.778 [MOT] <TID:3224506/-----> <SID:-----/-----> [INFO]     <Checkpoint>         CheckpointWorker0 - Exiting
2024-09-19 17:14:37.778 [MOT] <TID:3224510/-----> <SID:-----/-----> [INFO]     <Checkpoint>         CheckpointWorker1 - Exiting
2024-09-19 17:14:37.778 [MOT] <TID:3224520/-----> <SID:-----/-----> [INFO]     <Checkpoint>         CheckpointWorker2 - Exiting
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Destroying the Recovery Manager
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Destroying the Redo Log Handler
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: All tables cleared
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Key surrogate manager destroyed
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Table manager destroyed
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <System>             Shutdown: Session manager destroyed
2024-09-19 17:14:37.782 [MOT] <TID:3140211/04092> <SID:-----/-----> [INFO]     <Memory>             Kernel allocation MM Layer Pre-Shutdown Report report:
  NUMA Interleaved memory usage: Current = 1 MB, Peak = 1 MB

查看相应时间段的DMS日志(目录:$GAUSSLOG/pg_log/DMS/run),DMS频繁报错2节点通信异常,最后打印日志INFO>[mes]: disconnect node 2.,可以看出是由于2节点断开连接,本轮reform失败,由于0节点是first-round reform,因此0节点退出。

UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_send_pipe_nolock priority=4, inst_id=2, channel_id=31 [mes_tcp.c:420]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_send_pipe priority=4, inst_id=2, channel_id=31 [mes_tcp.c:405]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_pipe:inst_id=2,channel_id=31, prio=4, recv pipe closed [mes_tcp.c:429]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|ERROR>[mes] epoll_ctl event Del failed [fd=667],errno=9. [mes_recv.c:220]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_recv_pipe_nolock priority=5, inst_id=2, channel_id=31 [mes_tcp.c:198]
UTC+8 2024-09-19 17:14:37.353|DMS|3140211|INFO>[mes] mes_close_recv_pipe priority=5, inst_id=2, channel_id=31 [mes_tcp.c:183]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_send_pipe priority=5, inst_id=2, channel_id=31 [mes_tcp.c:405]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_pipe:inst_id=2,channel_id=31, prio=5, recv pipe closed [mes_tcp.c:429]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|ERROR>[mes] epoll_ctl event Del failed [fd=668],errno=9. [mes_recv.c:220]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_recv_pipe_nolock priority=6, inst_id=2, channel_id=31 [mes_tcp.c:198]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_recv_pipe priority=6, inst_id=2, channel_id=31 [mes_tcp.c:183]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_send_pipe priority=6, inst_id=2, channel_id=31 [mes_tcp.c:405]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes] mes_close_pipe:inst_id=2,channel_id=31, prio=6, recv pipe closed [mes_tcp.c:429]
UTC+8 2024-09-19 17:14:37.354|DMS|3140211|INFO>[mes]: disconnect node 2. [mes_func.c:1954]
UTC+8 2024-09-19 17:14:37.358|DMS|3140211|INFO>[mes] mes_uninit success [mes_func.c:1504]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[mes] start timer, sleep_time:100000 [cm_timer.c:107]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[LOG] file '/usr1/ertao/openGauss/cluster/log/gaussdb/ertao/pg_log/DMS/run/dms.rlog' is added [cm_log.c:899]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[LOG] file '/usr1/ertao/openGauss/cluster/log/gaussdb/ertao/pg_log/DMS/debug/dms.dlog' is added [cm_log.c:899]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[DMS] dms_init start [dms_process.c:1387]
UTC+8 2024-09-19 17:14:42.459|DMS|3452914|INFO>[DMS] dms_set_global_dms start [dms_process.c:1339]

0节点退出后,cm会检测出0节点退出,自动拉起0节点,集群最后会恢复稳定状态。

[xxxxxxxx@xxxxxxxxx111 dn_6001]$ cm_ctl query -Cv
[  CMServer State   ]

node            instance state
--------------------------------
1  xxxxxxxxx111 1        Standby
2  xxxxxxxxx135 2        Standby
3  xxxxxxxxx137 3        Primary


[ Defined Resource State ]

node            res_name instance  state
------------------------------------------
1  xxxxxxxxx111 dms_res  6001      OnLine
2  xxxxxxxxx135 dms_res  6002      OnLine
3  xxxxxxxxx137 dms_res  6003      OnLine
1  xxxxxxxxx111 dss      20001     OnLine
2  xxxxxxxxx135 dss      20002     OnLine
3  xxxxxxxxx137 dss      20003     OnLine

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node            instance state            | node            instance state            | node            instance state
------------------------------------------------------------------------------------------------------------------------------------
1  xxxxxxxxx111 6001     P Primary Normal | 2  xxxxxxxxx135 6002     S Standby Normal | 3  xxxxxxxxx137 6003     S Standby Normal

三、问题根因

在目前DMS的设计中,数据库节点的启动轮reform对于该节点是不允许失败的,如果某节点的启动轮reform失败,DMS会控制该节点退出,然后由CM重新拉起该节点,并由DMS触发新的一轮reform,将集群变为新的稳定状态。在此过程中,用户不需要手动介入,只需要等待reform结束,即cm_ctl query -Cv查询结果,各节点状态都为Normal

四、解决方案

该问题为reform特殊场景,不需要手动接入,只需要等待reform结束即可。

意见反馈
编组 3备份
    openGauss 2024-12-23 00:51:56
    取消