因资源实例状态异常导致节点被clean重启的问题

一、问题现象

如果观察到某资源实例（比如dss，dms等）突然OffLine，然后间隔一段时间后，查询集群状态发现节点已经被踢出，数据库状态进入Manually stopped状态，如下所示：

[ Defined Resource State ]

node            node_ip         res_name instance  state  
----------------------------------------------------------
1  openGauss111 NodeIP_1    dms_res  6001      OnLine 
2  openGauss135 NodeIP_2    dms_res  6002      OnLine 
3  openGauss137 NodeIP_3    dms_res  6003      OnLine 
1  openGauss111 NodeIP_1    dss      20001     OffLine
2  openGauss135 NodeIP_2    dss      20002     OnLine 
3  openGauss137 NodeIP_3    dss      20003     OnLine 



[ Defined Resource State ]

node            node_ip         res_name instance  state  
----------------------------------------------------------
1  openGauss111 NodeIP_1    dms_res  6001      OffLine
2  openGauss135 NodeIP_2    dms_res  6002      OnLine 
3  openGauss137 NodeIP_3    dms_res  6003      OnLine 
1  openGauss111 NodeIP_1    dss      20001     OnLine
2  openGauss135 NodeIP_2    dss      20002     OnLine 
3  openGauss137 NodeIP_3    dss      20003     OnLine 



cm_ctl query -Cvipdw

[  CMServer State   ]

node            node_ip         instance                                                state
-----------------------------------------------------------------------------------------------
1  openGauss111 NodeIP_1    1    /usr2/test_0925_install/omm/openGauss/cm/cm_server Standby
2  openGauss135 NodeIP_2    2    /usr2/test_0925_install/omm/openGauss/cm/cm_server Primary
3  openGauss137 NodeIP_3    3    /usr2/test_0925_install/omm/openGauss/cm/cm_server Standby


[ Defined Resource State ]

node            node_ip         res_name instance  state  
----------------------------------------------------------
1  openGauss111 NodeIP_1    dms_res  6001      Deleted
2  openGauss135 NodeIP_2    dms_res  6002      OnLine 
3  openGauss137 NodeIP_3    dms_res  6003      OnLine 
1  openGauss111 NodeIP_1    dss      20001     Deleted
2  openGauss135 NodeIP_2    dss      20002     OnLine 
3  openGauss137 NodeIP_3    dss      20003     OnLine 

[   Cluster State   ]

cluster_state   : Degraded
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node            node_ip         instance                                              state
---------------------------------------------------------------------------------------------------------
1  openGauss111 NodeIP_1    6001 20128  /usr2/test_0925_install/omm/openGauss/dn1 P Down    Manually stopped
2  openGauss135 NodeIP_2    6002 20128  /usr2/test_0925_install/omm/openGauss/dn1 S Primary Normal
3  openGauss137 NodeIP_3    6003 20128  /usr2/test_0925_install/omm/openGauss/dn1 S Standby Normal

二、定位方法

首先进入cm_server主节点的$GAUSSLOG/cm/cm_server，打开key-event日志，搜索kick out关键字，找到最后一次该节点被踢出的记录，并记录该次踢出时间。
```
2024-10-10 15:43:46.137 tid=3356184 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(1) kick out.]
```

通过上一步记录的时间，寻找该节点被踢出的原因。

2024-10-10 15:43:46.137 tid=3356184 MaxClusterAb LOG: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(1) kick out.]
2024-10-10 15:43:46.137 tid=3356184 MaxClusterAb LOG: node(1) stat (unavail).
2024-10-10 15:43:46.137 tid=3356184 MaxClusterAb LOG: kick out result: node(1) res inst manual stop or report timeout.

如果观察到'res inst manual stop or report timeout'字样的描述则说明是有资源实例不正常所导致的节点被踢出。

搜索restart关键字，观察有没有类似如下字样的日志：

2024-10-10 15:43:42.442 tid=2498489 StartAndStop LOG: cur inst(20001) isreg stat=(4), and reg failed, restart failed.
2024-10-10 15:43:42.442 tid=2498489 StartAndStop LOG: res(dss) inst(20001) has been restart (4) times, restart more than (5) time will manually stop.
2024-10-10 15:43:43.792 tid=2498489 StartAndStop LOG: cur inst(20001) isreg stat=(4), and reg failed, restart failed.
2024-10-10 15:43:43.792 tid=2498489 StartAndStop LOG: res(dss) inst(20001) has been restart (5) times, restart more than (5) time will manually stop.
2024-10-10 15:43:45.117 tid=2498489 StartAndStop LOG: res(dss) inst(20001) is offline, but restart times (5) >= limit (5), can't do restart again, will do manually stop.

或者如下字样：

2024-10-09 11:31:57.208 tid=1143779 StartAndStop LOG: res(dms_res) inst(6001) has been restart (2) times, restart more than (5) time will manually stop.

则证明CM所监控的资源实例出现故障导致节点被踢出集群。

需要通过对应资源实例的日志对故障进行进一步排查。

三、根因分析

通常情况下，cm_agent在重启周期30s内，重启资源实例超过5次都失败，则认为资源实例状态异常，此时会调用脚本将资源实例clean掉，并将该节点踢出集群。

四、解决方案

首先排查$GAUSSHOEM/bin下的dss_conctrl.sh和dms_conctrl.sh有无异常。
进一步观察$DSS_HOME下的dss日志或$GAUSSLOG下的pg_log日志，进一步排查资源实例故障的原因。