因文件、目录不存在导致集群启动失败的问题

一、问题现象

若openGauss资源池化集群启动或重启失败,有如下报错信息:

cm_ctl: start cluster failed in (601)s!

HINT: Maybe the cluster is continually being started in the background.

You can wait for a while and check whether the cluster starts, or increase the value of parameter "-t", e.g -t 600.

报告启动集群失败,10min 超时。

使用cm_ctl query -Cvipd查询集群状态后显示CMServer State中所有节点正常,Datanode State中有的节点的状态为Down Manually stopped, cluster_stateDegraded

二、定位方法

  1. 登录故障节点机器,进入$GAUSSLOG/cm/cm_agent目录下,寻找该节点最近时间点的cm_agent日志,发现如下报错信息:

    2024-10-10 09:26:02.270 tid=2015951  LOG: gaussdb state file "/.../.../dn1/gaussdb.state" is not exist, could not get the build infomation: No such file or directory
    
    2024-10-10 09:26:02.746 tid=2015996 DiskUsageCheck ERROR: [GetDiskUsageForPath][line:908] GetDiskUsageForPath /.../.../dn1 disk usage failed! errno:2 err:No such file or directory.
    
    2024-10-10 09:26:02.880 tid=2015948  ERROR: [get_connection: 1526]: fail to read pid file (/.../.../dn1/postmaster.pid).
    2024-10-10 09:26:02.880 tid=2015948  ERROR: failed to connect to datanode:/.../...//dn1
    

    绝对路径/.../.../dn1就是$PGDATA,上述报错信息表明问题是$PGDATA目录不存在。

  2. 使用cd $PGDATA命令却回显以下信息,印证了问题是$PGDATA目录不存在。

    [  ~]$ cd $PGDATA
    -bash: cd: /.../.../dn1: No such file or directory
    
  3. 若在cm_agent中未发现明确的报错信息,但能进入$PGDATA目录下,则查看DBstart.log日志,发现如下报错信息:

    2024-10-10 09:55:35.343 67073417.1 [unknown] 281471761580048 [unknown] 0 dn_6001_6002 58P01  0 [BACKEND] LOG:  could not open configuration file "/.../.../dn1/pg_hba.conf": No such file or directory
    
    2024-10-10 09:55:35.543 67073417.1 [unknown] 281471761580048 [unknown] 0 dn_6001_6002 58P01  0 [BACKEND] LOG:  could not open configuration file "/.../.../dn1/pg_hba.conf": No such file or directory
    
    2024-10-10 09:55:35.743 67073417.1 [unknown] 281471761580048 [unknown] 0 dn_6001_6002 58P01  0 [BACKEND] LOG:  could not open configuration file "/.../.../dn1/pg_hba.conf": No such file or directory
    
    2024-10-10 09:55:35.743 67073417.1 [unknown] 281471761580048 [unknown] 0 dn_6001_6002 42809  0 [BACKEND] FATAL:  could not load pg_hba.conf
    

    这个数据库启动日志的报错明确说明了pg_hba.conf文件不存在。

三、问题根因

系统配置文件、目录缺失导致集群启动或重启失败。

四、解决方案

若还存在状态正常的节点机器,可以将正常节点机器的目录或文件复制到故障节点机器上;若没有正常节点机器,可以卸载然后重新进行安装启动。

意见反馈
编组 3备份
    openGauss 2025-04-26 10:07:32
    取消