因重建集群备机失败导致备份恢复失败的问题
一、问题表现:
行存压缩表,备份恢复后,重建集群备机失败:
- 连接数据库并建立普通表并插入数据:
drop table if exists t_alter_rowcompress_0059 cascade;
create table t_alter_rowcompress_0059(columnone, integer,columntwo char(50), columnthree varchar(50), columnfour integer, columnfive char(50), columnsix varchar(50), columnseven char(50), columneight char(50), columnnine varchar(50), columnten varchar(50), columneleven char(50), columntwelve char(50), columnthirteen varchar(50), columnfourteen char(50), columnfifteem varchar(50));
insert into t_alter_rowcompress_0059 values(generate_series(0, 1000), 'test',substring(md5(random()::text), 1, 16), 2, substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16), substring(md5(random()::text), 1, 16));
- 修改普通表为压缩表:
alter table t_alter_rowcompress_0059 set (compresstype=2, compress_chunk_size=1024, compress_prealloc_chunks=6, compress_byte_convert=true, compress_diff_convert=true, compress_level=6); checkpoint;
- 新建备份目录并初始化:
gs_probackup init -B /home/zcz_ct/f_aler_rowcompress_0059
- 在备份路径内初始化一个新的备份实例:
gs_probackup add-instance -B /home/zcz_ct/f_aler_rowcompress_0059 -D /usr1/zczct_install/omm/openGauss/dn1 --instance=pro1
- 进行全量备份:
gs_probackup backup -B /home/zcz_ct/f_aler_rowcompress_0059 -b full --stream --instance=pro1 --skip-block-validation --no-validate -p 24512 -d postgres
- 删除表,停止数据库的主节点:
drop table if exists t_alter_rowcompress_0059;
- 执行restore恢复流程:
gs_probackup restore -B /home/zcz_ct/f_aler_rowcompress_0059 --instance=pro1 --incremental-mode=checksum -D /usr1/zczct_install/omm/openGauss/dn1;
- 启动数据库主节点,在备节点执行重建备机:
gs_ctl build -D /usr1/zczct_install/omm/openGauss/dn1
二、定位方法:
首先通过备机恢复的打印可以分析出,在读取页面的某个block时出现了错误,最后导致整个页面的checksum计算错误校验不通过而失败。如下所示:
[2024-08-08 16:24:29.120][501509][dn_6001_6002 6003][gs rewind]: path: base/16384/47443_compress, type: O, action: 1 Begin fetching files
[2024-08-08 16:25:29.427][501509][dn 6001 6002 6003][gs rewind]: unexpected result while fetching remote files: ERROR: can not read actual block 0, error code: 18446744073709551613,CONTEXT: referenced column: jsongs rewind receive FATAL, it will exit
[2024-08-08 16:25:29.428][501509][dn_6001 6002 6003][gs_ctl]: inc build failed.
[2024-08-08 16:25:29.428][501509][dn 6001 6002 6003][gs ctl]: Get repl auth mode is and repl uuid is
[2024-08-08 16:25:29.434][501509][dn 6001 6002 6003][gs rewind]: last build uncompleted, change to full build.gs rewind receive FATAL, it will exit
[2024-08-08 16:25:29.434][501509][dn 6001 6002 6003][gs ctl]: inc build failed.
[2024-08-08 16:25:29.435][501509][dn 6001 6002 6003][gs ctl]: Get repl auth mode is and repl uuid is
[2024-08-08 16:25:29.441][501509][dn_6001_6002_6003][gs_rewind]: last build uncompleted, change to full build.gs rewind receive FATAL, it will exit.palled plinq ou :[110s6]
[2024-08-08 16:25:29.441][501509][dn 6001 6002 6003][gs ctl]: inc build failed, change to full build.
[2024-08-08 16:25:29.441][501509][dn_6001_6002 6003][gs_ctl]: set gaussdb state file when auto build build:db state(BUILDING_STATE), server mode(STANDBY_MODE), build mode(FULL_BUILD).
[2024-08-08 16:25:29.442][501509][dn 6001 6002 6003][gs ctl]: Get repl auth mode is and repl uuid is
[2024-08-08 16:25:29.447][501509][dn 6001 6002 6003][gs ctl]: connected to server success, build started.
[2024-08-08 16:25:29.521][501509][dn 6001 6002 6003][gs ctl]: clear old target dir success
[2024-08-08 16:25:29.521][501509][dn 6001 6002 6003][gs ctl]: create build tag file success
[2024-08-08 16:25:29.521][501509][dn_6001_60026003][gs_ctl]: create build tag file again successcr
[2024-08-08 16:25:29.521][501509][dn 6001 6002 6003][gs ctl]: get system identifier success
[2024-08-08 16:25:29.521][501509][dn 6001 6002 6003][gs ctl]: create backup label success
[2024-08-08 16:25:29.590][501509][dn_6001_6002 6003][gs_ctl]: xlog start point: 0/24004690
[2024-08-08 16:25:29.590][501509][dn 6001 6002 6003][gs_ctl]: begin build tablespace listae
[2024-08-08 16:25:29.590][501509][dn 6001 6002 6003][gs ctl]: finish build tablespace list
[2024-08-08 16:25:29.590][501509][dn_6001_6002_6003][gs_ctl]: begin get xlog by xlogstreamar
[2024-08-08 16:25:29.590][501509][dn 6001 6002 6003][gs ctl]: starting background WAL receiver
[2024-08-08 16:25:29.590][501509][dn 6001 6002 6003][gs_ctl]: starting walreceiver
[2024-08-08 16:25:29.592][501509][dn 6001 6002 6003][gs_ctl]: begin receive tar filese
[2024-08-08 16:25:29.598][501509][dn 6001 6002 6003][gs ctl]: check identify system success
[2024-08-08 16:25:29.598][501509][dn 6001 6002 6003][gs ctl]: send START_REPLICATION 0/24000000 success
Begin Receiving files
Progress: [=== ==] 100% (148468/148468KB). (1/1)tablespaces. Receive files
Finish Receiving files
[2024-08-08 16:26:29.858][501509][dn_6001_6002_6003][gs_ctl]: could not get WAL end position from server: FATAL: checksum error
因为重建备机的过程不涉及对表的修改,由此可以推断出是压缩表在备份过程中损坏了;通过向上分析,先使用pagehack工具分析备份恢复的文件。发现恢复后的文件已经损坏:
-rW- 1 zcz ct zcz ct 4 Aug 30 15:24 PG VERSION
$ pagehack -f 16445_compressa
free(): double free detected in tcache 2
Aborted (core dumped)
$ pagehack -f 16448_compress
free(): double free detected in tcache 2
Aborted (core dumped)
通过GDB跟踪备份恢复过程的写文件操作:
Breakpoint 5, CfsReadCompressedPage (dst=0xffffffffb208 "", destLen=8192, extent offset blkno=0, cfsReadStruct=0xffffffffb180, blockNum=0) at ofs_tools.cpp:70
70 in ofs tools.cpp
(qdb)p cfsExtentAddress
$2 = (CfsExtentAddress *) 0xfffff7fd000c
(gdb)p *cfsExtentAddress
$3 = {checksum = 1050652, nchunks = 8 '\b', allocated_chunks = 8 '\b', Chunknos = 0xfffff7fd0012}
(gdb)p *cfsExtentAddress.chunknos
$4 = 1
(gdb)p *cfsExtentAddress.chunknos+1
$5 = 2
(gdb) p *cfsExtentAddress.(chunknos+1)
A syntax error in expression, near (chunknos+1)
(gdb) p *(cfsExtentAddress.chunknos )+1
$6 = 2
(gdb) p *(cfsExtentAddress.chunknos+7)
$9 = 8
(gdb) p *(cfsExtentAddress.chunknos+8 )
$10 = 0
通过跟踪restore阶段的gdb发现是在向新的文件里写数据时;计算压缩页面的size错误;而导致的checksum计算值不同;通过分析CalRealWriteSize;发现在处理压缩与非压缩页面时都是返回一个固定值导致的;
三、问题根因:
备份恢复过程中,重写压缩文件的流程出错;因为压缩标记的原因,导致部分数据块返回的size大小不正确;也导致后面重建备机过程中的校验checksum不正确;
四、解决方案:
需要在GetSizeOfCprsHeadData函数对每个页面的block进行准确的计算即可;所以删除掉针对带COMP_ASIGNMENT标记页面的特殊处理;还是通过计算来向新的页面写入正确数据,就可以保证checksum值的准确性;
意见反馈