故障情况:
- OS:CentOS 7
- 最初,提示 SMART 错误,某盘不能读取或写入
- 接着,用户目录内文件不能访问
- 最后,重启,不能启动到系统
处理
准备工作
- 利用别的机器将 Centos 7 安装在一个 SSD 移动硬盘上,注意建立卷组时命名要与原机器不同,否则启动后需要修改卷组名称
- 将 SSD 移动硬盘插到故障机 USB 接口
- 重启,选择 USB 启动,进入 CentOS 系统
检查系统
进入系统,列出卷组信息
[rheo@gen ~]$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 192M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 26G 0 part
├─vg01-root 253:0 0 20G 0 lvm /
├─vg01-swap 253:1 0 2G 0 lvm [SWAP]
└─vg01-home 253:6 0 4G 0 lvm /home
sdb 8:16 0 931.5G 0 disk
└─sdb1 8:17 0 931.5G 0 part
├─centos-root 253:2 0 422.6G 0 lvm
├─centos-home 253:3 0 5.6T 0 lvm
└─centos-opt 253:5 0 379.5G 0 lvm
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 200M 0 part
├─sdc2 8:34 0 1G 0 part
└─sdc3 8:35 0 1.8T 0 part
├─centos-root 253:2 0 422.6G 0 lvm
├─centos-home 253:3 0 5.6T 0 lvm
├─centos-swap 253:4 0 15.5G 0 lvm
└─centos-opt 253:5 0 379.5G 0 lvm
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm
系统情况如下:
- /dev/sda 为当前系统所在的 USB SSD 移动硬盘
- /dev/sdb 为三星的固态硬盘 870 EVO 1TB
- /dev/sdc 为原系统根目录所在的硬盘
- /dev/sdd 为某独立盘,分区为 ext4
- /dev/sde-f 为卷组 centos 中的 /home 目录的一部分
尝试挂载旧的系统
[rheo@gen ~]$ sudo mount /dev/centos/root /mnt/lv_root/
mount: /dev/mapper/centos-root: can't read superblock
[rheo@gen ~]$ sudo mount /dev/centos/home /mnt/lv_home/
[rheo@gen ~]$ sudo mkdir /mnt/lv_opt
[rheo@gen ~]$ sudo mount /dev/centos/opt /mnt/lv_opt/
mount: /dev/mapper/centos-opt: can't read superblock
[rheo@gen ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 4.0K 32G 1% /dev/shm
tmpfs 32G 20M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/mapper/vg01-root 20G 4.1G 16G 21% /
/dev/sda2 1.1G 173M 874M 17% /boot
/dev/sda1 192M 12M 181M 6% /boot/efi
/dev/mapper/vg01-home 4.0G 145M 3.9G 4% /home
tmpfs 6.3G 40K 6.3G 1% /run/user/1000
/dev/mapper/centos-home 5.6T 5.0T 611G 90% /mnt/lv_home
可以看到,旧的 root 和 opt 都不能挂载,而 home 仍然能挂载。
插一块 4T 硬盘到机器上
[rheo@gen ~]$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 192M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 26G 0 part
├─vg01-root 253:0 0 20G 0 lvm /
├─vg01-swap 253:1 0 2G 0 lvm [SWAP]
└─vg01-home 253:6 0 4G 0 lvm /home
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 200M 0 part
├─sdc2 8:34 0 1G 0 part
└─sdc3 8:35 0 1.8T 0 part
├─centos-root 253:2 0 422.6G 0 lvm
├─centos-home 253:3 0 5.6T 0 lvm /mnt/lv_home
├─centos-swap 253:4 0 15.5G 0 lvm
└─centos-opt 253:5 0 379.5G 0 lvm
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm /mnt/lv_home
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm /mnt/lv_home
sdg 8:96 0 3.7T 0 disk
对 /dev/sdg 进行分区
[rheo@gen ~]$ sudo fdisk /dev/sdg
Welcome to fdisk (util-linux 2.23.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0x23a98ff1.
WARNING: The size of this disk is 4.0 TB (4000787030016 bytes).
DOS partition table format can not be used on drives for volumes
larger than (2199023255040 bytes) for 512-byte sectors. Use parted(1) and GUID
partition table format (GPT).
The device presents a logical sector size that is smaller than
the physical sector size. Aligning to a physical sector (or optimal
I/O) size boundary is recommended, or performance may be impacted.
Command (m for help): p
Disk /dev/sdg: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x23a98ff1
Device Boot Start End Blocks Id System
Command (m for help): G
Building a new GPT disklabel (GUID: 9CADE58B-0615-4431-8512-3A95FF6A9A72)
Command (m for help): n
Partition number (1-128, default 1):
First sector (2048-7814037134, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-7814037134, default 7814037134): +2T
Created partition 1
Command (m for help): n
Partition number (2-128, default 2):
First sector (4294969344-7814037134, default 4294969344):
Last sector, +sectors or +size{K,M,G,T,P} (4294969344-7814037134, default 7814037134):
Created partition 2
Command (m for help): p
Disk /dev/sdg: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: 9CADE58B-0615-4431-8512-3A95FF6A9A72
# Start End Size Type Name
1 2048 4294969343 2T Linux filesyste
2 4294969344 7814037134 1.7T Linux filesyste
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
列出设备
[rheo@gen ~]$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 192M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 26G 0 part
├─vg01-root 253:0 0 20G 0 lvm /
├─vg01-swap 253:1 0 2G 0 lvm [SWAP]
└─vg01-home 253:6 0 4G 0 lvm /home
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 200M 0 part
├─sdc2 8:34 0 1G 0 part
└─sdc3 8:35 0 1.8T 0 part
├─centos-root 253:2 0 422.6G 0 lvm
├─centos-home 253:3 0 5.6T 0 lvm /mnt/lv_home
├─centos-swap 253:4 0 15.5G 0 lvm
└─centos-opt 253:5 0 379.5G 0 lvm
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm /mnt/lv_home
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm /mnt/lv_home
sdg 8:96 0 3.7T 0 disk
├─sdg1 8:97 0 2T 0 part
└─sdg2 8:98 0 1.7T 0 part
卸载旧的 /home 系统
[rheo@gen ~]$ sudo umount /dev/centos/home
重新启动后
[rheo@gen ~]$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 192M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 26G 0 part
├─vg01-root 253:0 0 20G 0 lvm /
├─vg01-swap 253:1 0 2G 0 lvm [SWAP]
└─vg01-home 253:6 0 4G 0 lvm /home
sdb 8:16 0 931.5G 0 disk
└─sdb1 8:17 0 931.5G 0 part
├─centos-root 253:2 0 422.6G 0 lvm
├─centos-home 253:3 0 5.6T 0 lvm
└─centos-opt 253:5 0 379.5G 0 lvm
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 200M 0 part
├─sdc2 8:34 0 1G 0 part
└─sdc3 8:35 0 1.8T 0 part
├─centos-root 253:2 0 422.6G 0 lvm
├─centos-home 253:3 0 5.6T 0 lvm
├─centos-swap 253:4 0 15.5G 0 lvm
└─centos-opt 253:5 0 379.5G 0 lvm
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part
sde 8:64 0 3.7T 0 disk
├─sde1 8:65 0 2T 0 part
└─sde2 8:66 0 1.7T 0 part
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm
sdg 8:96 0 1.8T 0 disk
└─sdg1 8:97 0 1.8T 0 part
└─centos-home 253:3 0 5.6T 0 lvm
/dev/sdg 变成了 /dev/sde,创建 xfs 文件系统
[rheo@gen ~]$ sudo mkfs.xfs /dev/sde2
meta-data=/dev/sde2 isize=512 agcount=4, agsize=109970869 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=439883473, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=214786, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
创建目录,并挂载 /dev/sde2
[rheo@gen ~]$ sudo mkdir /mnt/sde2
[rheo@gen ~]$ sudo mount /dev/sde2 /mnt/sde2/
使用 dd 进行将源 (损坏) 盘进行备份
[rheo@gen ~]$ sudo dd if=/dev/sdh1 of=/mnt/sde2/sdc.img conv=noerror
[sudo] password for rheo:
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 5.48293 s, 16.2 MB/s
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 9.45846 s, 9.4 MB/s
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 13.4746 s, 6.6 MB/s
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 17.4715 s, 5.1 MB/s
发现不能读取某些数据,但坚持完成备份,花了不少时间,大约 8 个小时。然后反数据恢复到新的一个 4T 的 SSD 硬盘 (/dev/sdh)。再备份该 SSD 硬盘
[rheo@gen ~]$ sudo dd if=/dev/sdh1 of=/run/media/rheo/cce2b7fa-4187-4c2a-9390-5634b47988b0/sdc1.img conv=noerror
1953523087+0 records in
1953523087+0 records out
1000203820544 bytes (1.0 TB) copied, 7827.84 s, 128 MB/s
速度很快。以为可以读取部分数据,尝试备份 meta 数据
[rheo@gen ~]$ sudo xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump
Metadata CRC error detected at xfs_agf block 0x2b1201808/0x1000
xfs_metadump: cannot init perag data (-74). Continuing anyway.
...
xfs_metadump: cannot read superblock for ag 5
Metadata CRC error detected at xfs_agf block 0x7d00008/0x1000
Metadata CRC error detected at xfs_agi block 0x7d00010/0x1000
Metadata CRC error detected at xfs_agfl block 0x7d00018/0x1000
/sbin/xfs_metadump: line 33: 12441 Segmentation fault (core dumped) xfs_db$DBOPTS -i -p xfs_metadump -c "metadump$OPTS $2" $1
不能完成 meta 备份,猜测是数据不完整。使用 ddrescue 读取已经 /dev/sdh1 (这是 dd 恢复后的),无法读取到错误信息,因为 dd 已经把错误使用 0 替代了。
[rheo@gen hd-8T]$ sudo ddrescue -f -n /dev/centos/root /mnt/hd-8T/centos-root.rescue.img root-rescue.log
GNU ddrescue 1.27
Press Ctrl-C to interrupt
ipos: 453769 MB, non-trimmed: 0 B, current rate: 112 MB/s
opos: 453769 MB, non-scraped: 0 B, average rate: 176 MB/s
non-tried: 0 B, bad-sector: 0 B, error rate: 0 B/s
rescued: 453769 MB, bad areas: 0, run time: 42m 44s
pct rescued: 100.00%, read errors: 0, remaining time: n/a
time since last successful read: n/a
Copying non-tried blocks... Pass 1 (forwards)
Finished
使用 centos-root.rescue.img 创建 loop 设备
[rheo@gen hd-8T]$ sudo losetup --find --show /mnt/hd-8T/centos-root.rescue.img
使用 UFS explorer 扫描转移的新盘,
- 在 /home/ 下只扫描出来 56M 文件
- 在 /opt/ 下扫描出来 485G 文件
- 在 / 下扫描出来 17G 文件
插上原出问题的 SSD 硬盘,使用 ddrescue 读取出该原磁盘
[rheo@gen ~]$ sudo ddrescue -d -f -r3 /dev/sdb1 /mnt/hd-8T/ssd.img ssd-rescue.log
GNU ddrescue 1.27
Press Ctrl-C to interrupt
ipos: 1000 GB, non-trimmed: 15532 kB, current rate: 272 MB/s
opos: 1000 GB, non-scraped: 0 B, average rate: 140 MB/s
non-tried: 2667 MB, bad-sector: 0 B, error rate: 0 B/s
rescued: 997521 MB, bad areas: 0, run time: 1h 58m 31s
pct rescued: 99.73%, read errors: 237, remaining time: 11s
time since last successful read: 0s
Copying non-tried blocks... Pass 1 (forwards)
ipos: 98697 kB, non-trimmed: 30277 kB, current rate: 18677 kB/s
opos: 98697 kB, non-scraped: 0 B, average rate: 139 MB/s
non-tried: 1338 MB, bad-sector: 0 B, error rate: 196 kB/s
rescued: 998834 MB, bad areas: 0, run time: 1h 59m 30s
pct rescued: 99.86%, read errors: 462, remaining time: 55s
time since last successful read: 0s
Copying non-tried blocks... Pass 2 (backwards)
ipos: 941692 MB, non-trimmed: 416874 kB, current rate: 8454 kB/s
opos: 941692 MB, non-scraped: 0 B, average rate: 115 MB/s
non-tried: 0 B, bad-sector: 0 B, error rate: 262 kB/s
rescued: 999786 MB, bad areas: 0, run time: 2h 24m 42s
pct rescued: 99.95%, read errors: 6361, remaining time: 10m
time since last successful read: 0s
Copying non-tried blocks... Pass 5 (forwards)
ipos: 941692 MB, non-trimmed: 0 B, current rate: 81920 B/s
opos: 941692 MB, non-scraped: 332089 kB, average rate: 101 MB/s
non-tried: 0 B, bad-sector: 1391 kB, error rate: 1024 B/s
rescued: 999870 MB, bad areas: 2717, run time: 2h 43m 33s
pct rescued: 99.96%, read errors: 9079, remaining time: 51m
time since last successful read: 0s
Trimming failed blocks... (forwards)
ipos: 90021 kB, non-trimmed: 0 B, current rate: 512 B/s
opos: 90021 kB, non-scraped: 331689 kB, average rate: 101 MB/s
non-tried: 0 B, bad-sector: 1593 kB, error rate: 3072 B/s
rescued: 999870 MB, bad areas: 2743, run time: 2h 44m 39s
pct rescued: 99.96%, read errors: 9473, remaining time: 1d 11h 13m
time since last successful read: 0s
Scraping failed blocks... (forwards)
发现有不少错误,无论如何,还是读完数据了。
升级 xfsprogs 到 5.0,xfs_repaire 仍然无法解决 superblock 问题
使用 ddrescue 得到的 img 文件,恢复其到另一个磁盘的分区,然后通过编辑 /etc/lvm/lvm.conf 屏蔽坏掉的盘 /dev/sdb1/ , 然后执行以下操作,重新扫描卷组
[rheo@gen ~]$ sudo vgchange -an centos
WARNING: Device mismatch detected for centos/root which is accessing /dev/sdb1 instead of /dev/sdh1.
WARNING: Device mismatch detected for centos/home which is accessing /dev/sdb1 instead of /dev/sdh1.
WARNING: Device mismatch detected for centos/opt which is accessing /dev/sdb1 instead of /dev/sdh1.
0 logical volume(s) in volume group "centos" now active
[rheo@gen ~]$ sudo pvscan
Error reading device /dev/sdb at 0 length 512.
Error reading device /dev/sdb at 0 length 4.
Error reading device /dev/sdb at 4096 length 4.
Error reading device /dev/sdb1 at 0 length 4.
Error reading device /dev/sdb1 at 4096 length 4.
PV /dev/sda3 VG vg01 lvm2 [25.90 GiB / 0 free]
PV /dev/sdc3 VG centos lvm2 [<1.82 TiB / 0 free]
PV /dev/sdf1 VG centos lvm2 [<1.82 TiB / 0 free]
PV /dev/sdg1 VG centos lvm2 [<1.82 TiB / 0 free]
PV /dev/sdh1 VG centos lvm2 [<931.51 GiB / 0 free]
Total: 5 [6.39 TiB] / in use: 5 [6.39 TiB] / in no VG: 0 [0 ]
[rheo@gen ~]$ sudo vgscan
Reading volume groups from cache.
Found volume group "vg01" using metadata type lvm2
Found volume group "centos" using metadata type lvm2
[rheo@gen ~]$ sudo lvscan
ACTIVE '/dev/vg01/root' [19.97 GiB] inherit
ACTIVE '/dev/vg01/home' [3.99 GiB] inherit
ACTIVE '/dev/vg01/swap' [<1.94 GiB] inherit
inactive '/dev/centos/root' [<422.61 GiB] inherit
inactive '/dev/centos/home' [<5.57 TiB] inherit
inactive '/dev/centos/swap' [15.50 GiB] inherit
inactive '/dev/centos/opt' [<379.46 GiB] inherit
[rheo@gen ~]$ sudo vgchange -ay centos
4 logical volume(s) in volume group "centos" now active
[rheo@gen ~]$ sudo vgscan
Reading volume groups from cache.
Found volume group "vg01" using metadata type lvm2
Found volume group "centos" using metadata type lvm2
[rheo@gen ~]$ sudo vgs
VG #PV #LV #SN Attr VSize VFree
centos 4 4 0 wz--n- <6.37t 0
vg01 1 3 0 wz--n- 25.90g 0
[rheo@gen ~]$ sudo mount /dev/centos/home /mnt/lv_home/
[rheo@gen ~]$ ls /mnt/lv_home/
amin autossh azhang fhuang rdu swang sxu wzeng xhuang xye yliu ywang yyang yzhou zdu zliu
[rheo@gen ~]$ sudo umount /mnt/lv_home
[rheo@gen ~]$ sudo mount /dev/centos/opt /mnt/lv_opt/
[rheo@gen ~]$ ls /mnt/lv_opt/
ansys_inc code_saturne lammps recoverX SALOME-9.3.0-CO7-SRC WindowsImageBackup
aster google recoverB rh SALOME-9.3.0-CO7-SRC.tgz zoom
[rheo@gen ~]$ sudo umount /mnt/lv_opt
[rheo@gen ~]$ sudo mount /dev/centos/root /mnt/lv_root/
[rheo@gen ~]$ ls /mnt/lv_root
bin dev home lib64 mnt proc recoverA root sbin srv tmp var
boot etc lib media opt public rheoData run share sys usr
[rheo@gen ~]$ sudo umount /dev/centos/root
现在,看起来所有的数据都还在,赶紧做备份工作,最要紧是先把数据备份下来, 使用 xfs_metadump 备份元数据
[root@gen hd-8T]$ su
[root@gen hd-8T]# xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump |tee /mnt/hd-8T/home-backup.log
^Z
[1]+ Stopped xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump | tee /mnt/hd-8T/home-backup.log
[root@gen hd-8T]# bg
[1]+ xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump | tee /mnt/hd-8T/home-backup.log &
[root@gen hd-8T]# exit
exit
还可以做个检查,使用 xfs_repair -n
[rheo@gen mnt]$ sudo xfs_repair -n /dev/centos/opt
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- 08:24:14: scanning filesystem freespace - 16 of 16 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 08:24:14: scanning agi unlinked lists - 16 of 16 allocation groups done
- process known inodes and perform inode discovery...
- agno = 15
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- 08:24:37: process known inodes and inode discovery - 711488 of 711488 inodes done
- process newly discovered inodes...
- 08:24:37: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 08:24:37: setting up duplicate extent list - 16 of 16 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 14
- agno = 12
- agno = 7
- agno = 6
- agno = 11
- agno = 9
- agno = 8
- agno = 13
- agno = 10
- agno = 4
- agno = 15
- agno = 5
- 08:24:37: check for inodes claiming duplicate blocks - 711488 of 711488 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
- 08:24:50: verify and correct link counts - 16 of 16 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
[rheo@gen mnt]$ sudo xfs_repair -n /dev/centos/root
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- 08:25:37: scanning filesystem freespace - 34 of 34 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 08:25:37: scanning agi unlinked lists - 34 of 34 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 30
- agno = 15
- agno = 16
- agno = 31
- agno = 32
- agno = 17
- agno = 33
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- 08:25:53: process known inodes and inode discovery - 463040 of 463040 inodes done
- process newly discovered inodes...
- 08:25:53: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 08:25:53: setting up duplicate extent list - 34 of 34 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 5
- agno = 2
- agno = 13
- agno = 16
- agno = 1
- agno = 24
- agno = 28
- agno = 31
- agno = 10
- agno = 32
- agno = 11
- agno = 12
- agno = 3
- agno = 14
- agno = 15
- agno = 18
- agno = 19
- agno = 4
- agno = 17
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 6
- agno = 25
- agno = 26
- agno = 27
- agno = 7
- agno = 29
- agno = 30
- agno = 8
- agno = 33
- agno = 9
- 08:25:54: check for inodes claiming duplicate blocks - 463040 of 463040 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
- 08:26:10: verify and correct link counts - 34 of 34 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
貌似没有问题,检查备份文件大小
[rheo@gen hd-8T]$ ls -lah
total 981G
drwxr-xr-x 2 root root 172 May 17 11:34 .
drwxr-xr-x. 9 root root 113 May 16 18:42 ..
-rw-r--r-- 1 root root 48G May 17 10:37 centos-home.metadump
-rw-r--r-- 1 root root 444M May 17 11:35 centos-opt.metadump
-rw-r--r-- 1 root root 296M May 17 11:32 centos-root.metadump
-rw-r--r-- 1 root root 0 May 17 07:31 home-backup.log
-rw-r--r-- 1 root root 0 May 17 11:34 opt-backup.log
-rw-r--r-- 1 root root 0 May 17 11:24 root-backup.log
-rw-r--r-- 1 root root 932G May 16 22:35 ssd.img
使用 metadump 恢复元数据到一个新的文件
[rheo@gen hd-8T]$ sudo xfs_mdrestore /mnt/hd-4T-part2/centos-home.metadump /mnt/hd-8T/centos-home.img
从 metadump 恢复 img 后,再挂载成 loop,出现乱码问题。
[rheo@gen hd-8T]$ sudo losetup --find --show ./centos-home.img
/dev/loop1
[rheo@gen hd-8T]$ sudo mount /dev/loop1 /mnt/lv_home/
[rheo@gen hd-8T]$ ls /mnt/lv_home/ -lah
total 68K
drwxr-xr-x 18 root root 276 Nov 7 2022 .
drwxr-xr-x. 9 root root 113 May 16 18:42 ..
drwxr-x--- 7 1015 rheo 214 Feb 15 21:06 amin
drwx------ 6 1004 1005 140 Jun 8 2019 Gp?oqCG
drwxr-xr-x 27 1003 rheo 4.0K May 6 10:34 n?han?
drwxr-xr-x 55 rheo rheo 8.0K May 10 02:15 rdu
drwxr-x--- 23 1012 rheo 4.0K May 9 22:55 R?ual1
drwxr-xr-x 10 1011 rheo 4.0K Jun 9 2022 sxu
drwxr-xr-x 10 1005 rheo 4.0K Nov 25 2021 ?wan`
drwxr-xr-x 6 1006 rheo 156 Jun 12 2019 ?wan`
drwxr-xr-x 18 1002 rheo 4.0K Sep 14 2022 xye
drwx------ 8 1014 rheo 232 Sep 2 2022 ?yan`
drwx------ 16 1010 rheo 4.0K Apr 6 20:44 yliu
drwxr-xr-x 35 1007 rheo 8.0K May 10 02:12 zdu
drwx------ 7 1013 rheo 216 Sep 2 2022 ?zen`
drwxr-xr-x 13 1001 rheo 278 Apr 19 2022 ?zhor
drwxr-xr-x 30 1008 rheo 8.0K Mar 30 15:24 zliu
drwxr-x--- 28 1009 rheo 4.0K May 9 14:42 z?uao?
尝试挂载其他盘,仍然有问题
[rheo@gen hd-4T-part2]$ sudo mount /dev/loop2 /mnt/lv_opt/
[rheo@gen hd-4T-part2]$ ls /mnt/lv_opt/
3^OogiC/ nGpzR_KspN6ry7jG8CU UryNGb0UROC0B^O^SCN@/ [P^A<
4bonN2aujzfJamy^KR^W^S"/ N^Ammre/ xYjaldF^DVf^O]/
h9c^Ow0^RD/ rh/ XYpe^A�Qw^D/
Hwp^Ou4cF/ ^Astet/ zoom/
卸载 /dev/loop2, 执行 xfs_repair /dev/loop2 后,仍然有问题
[rheo@gen hd-4T-part2]$ sudo xfs_repair /dev/loop0
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- 12:42:57: scanning filesystem freespace - 16 of 16 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- 12:42:57: scanning agi unlinked lists - 16 of 16 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 15
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- 12:42:58: process known inodes and inode discovery - 711488 of 711488 inodes done
- process newly discovered inodes...
- 12:42:58: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 12:42:58: setting up duplicate extent list - 16 of 16 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 12
- agno = 3
- agno = 5
- agno = 7
- agno = 6
- agno = 8
- agno = 9
- agno = 10
- agno = 4
- agno = 11
- agno = 15
- agno = 13
- agno = 14
- 12:42:59: check for inodes claiming duplicate blocks - 711488 of 711488 inodes done
Phase 5 - rebuild AG headers and trees...
- 12:42:59: rebuild AG headers and trees - 16 of 16 allocation groups done
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
- 12:42:59: verify and correct link counts - 16 of 16 allocation groups done
done
[rheo@gen hd-4T-part2]$ ls /mnt/lv_opt/
3?ogiC h9c?w0?D nGpzR_KspN6ry7jG8CU?[P?< rh UryNGb0UROC0B??CN@ XYpe??Qw?
4bonN2aujzfJamy?R??" Hwp?u4cF N?mmre ?stet xYjaldF?Vf?] zoom
卸载,尝试重新挂载
[rheo@gen hd-4T-part2]$ sudo mount /dev/loop0 /mnt/lv_opt/
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
而直接挂载原 lv,却没有问题
[rheo@gen hd-4T-part2]$ ls /mnt/lv_opt/
ansys_inc code_saturne lammps recoverX SALOME-9.3.0-CO7-SRC WindowsImageBackup
aster google recoverB rh SALOME-9.3.0-CO7-SRC.tgz zoom
第二天早上,
20230520 morning
开始重新处理系统
[rheo@gen ~]$ sudo pvscan
[sudo] password for rheo:
PV /dev/sda3 VG vg01 lvm2 [25.90 GiB / 0 free]
PV /dev/sdb3 VG centos lvm2 [<1.82 TiB / 0 free]
PV /dev/sde1 VG centos lvm2 [<1.82 TiB / 0 free]
PV /dev/sdf1 VG centos lvm2 [<1.82 TiB / 0 free]
PV /dev/sdc1 VG centos lvm2 [<931.51 GiB / 0 free]
Total: 5 [6.39 TiB] / in use: 5 [6.39 TiB] / in no VG: 0 [0 ]
[rheo@gen ~]$ sudo vgscan
Reading volume groups from cache.
Found volume group "vg01" using metadata type lvm2
Found volume group "centos" using metadata type lvm2
[rheo@gen ~]$ sudo lvscan
ACTIVE '/dev/vg01/root' [19.97 GiB] inherit
ACTIVE '/dev/vg01/home' [3.99 GiB] inherit
ACTIVE '/dev/vg01/swap' [<1.94 GiB] inherit
ACTIVE '/dev/centos/root' [<422.61 GiB] inherit
ACTIVE '/dev/centos/home' [<5.57 TiB] inherit
ACTIVE '/dev/centos/swap' [15.50 GiB] inherit
ACTIVE '/dev/centos/opt' [<379.46 GiB] inherit
尝试修复系统,此时只是空跑一下,并不真正执行修复操作
[rheo@gen ~]$ sudo xfs_repair -n /dev/centos/root
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used. Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
- scan filesystem freespace and inode maps...
agi unlinked bucket 10 is 647882 in ag 3 (inode=101311178)
sb_ifree 1631, counted 1654
sb_fdblocks 14323967, counted 14336708
- 08:11:59: scanning filesystem freespace - 34 of 34 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 08:11:59: scanning agi unlinked lists - 34 of 34 allocation groups done
- process known inodes and perform inode discovery...
- agno = 15
- agno = 30
- agno = 0
- agno = 31
- agno = 16
- agno = 17
- agno = 32
- agno = 18
- agno = 33
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- 08:12:11: process known inodes and inode discovery - 463168 of 463168 inodes done
- process newly discovered inodes...
- 08:12:11: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 08:12:11: setting up duplicate extent list - 34 of 34 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 3
- agno = 1
- agno = 9
- agno = 5
- agno = 8
- agno = 11
- agno = 12
- agno = 20
- agno = 17
- agno = 22
- agno = 27
- agno = 6
- agno = 15
- agno = 18
- agno = 14
- agno = 16
- agno = 19
- agno = 4
- agno = 10
- agno = 21
- agno = 23
- agno = 25
- agno = 13
- agno = 24
- agno = 26
- agno = 7
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- agno = 33
- agno = 32
- 08:12:11: check for inodes claiming duplicate blocks - 463168 of 463168 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
disconnected inode 101311178, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 101311178 nlinks from 0 to 1
- 08:12:24: verify and correct link counts - 34 of 34 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
[rheo@gen ~]$ sudo xfs_repair -n /dev/centos/opt
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used. Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
- scan filesystem freespace and inode maps...
sb_icount 711488, counted 711296
sb_ifree 476, counted 539
sb_fdblocks 41964358, counted 43819941
- 08:14:28: scanning filesystem freespace - 16 of 16 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 08:14:28: scanning agi unlinked lists - 16 of 16 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 15
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
imap claims a free inode 884571138 is in use, would correct imap and clear inode
- agno = 14
- 08:15:05: process known inodes and inode discovery - 711296 of 711488 inodes done
- process newly discovered inodes...
- 08:15:05: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 08:15:05: setting up duplicate extent list - 16 of 16 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 12
- agno = 15
- agno = 4
- agno = 5
- agno = 8
- agno = 10
- agno = 1
- agno = 11
- agno = 13
- agno = 9
- agno = 3
- agno = 7
- agno = 14
- agno = 6
entry "f534542424.gz" at block 2 offset 4040 in directory inode 884553803 references free inode 884571138
would clear inode number in entry at offset 4040...
- 08:15:06: check for inodes claiming duplicate blocks - 711296 of 711488 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
entry "f534542424.gz" in directory inode 884553803 points to free inode 884571138, would junk entry
bad hash table for directory inode 884553803 (no data entry): would rebuild
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
- 08:15:29: verify and correct link counts - 16 of 16 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
尝试真正修复系统,发现无法修复
[rheo@gen ~]$ sudo xfs_repair /dev/centos/opt
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
尝试挂载 home 和 opt,并卸载
[rheo@gen ~]$ sudo mount /dev/centos/home /mnt/lv_home/
[rheo@gen ~]$ sudo umount /mnt/lv_home
[rheo@gen ~]$ sudo mount /dev/centos/opt /mnt/lv_opt/
[rheo@gen ~]$ sudo umount /mnt/lv_opt
umount: /mnt/lv_opt: not mounted
再次尝试修复 opt 系统
[rheo@gen ~]$ sudo xfs_repair /dev/centos/opt
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- 08:17:24: scanning filesystem freespace - 16 of 16 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- 08:17:24: scanning agi unlinked lists - 16 of 16 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 15
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
imap claims a free inode 884571138 is in use, correcting imap and clearing inode
cleared inode 884571138
- agno = 14
- 08:18:02: process known inodes and inode discovery - 711296 of 711296 inodes done
- process newly discovered inodes...
- 08:18:02: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 08:18:02: setting up duplicate extent list - 16 of 16 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 1
- agno = 0
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 8
- agno = 7
- agno = 10
- agno = 9
- agno = 14
- agno = 12
- agno = 11
- agno = 13
- agno = 15
entry "f534542424.gz" at block 2 offset 4040 in directory inode 884553803 references free inode 884571138
clearing inode number in entry at offset 4040...
- 08:18:02: check for inodes claiming duplicate blocks - 711296 of 711296 inodes done
Phase 5 - rebuild AG headers and trees...
- 08:18:02: rebuild AG headers and trees - 16 of 16 allocation groups done
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
bad hash table for directory inode 884553803 (no data entry): rebuilding
rebuilding directory inode 884553803
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
- 08:18:26: verify and correct link counts - 16 of 16 allocation groups done
done
尝试修复 root 系统
[rheo@gen ~]$ sudo xfs_repair /dev/centos/root
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
不行啊!!!
使用普通手段修复无望,最后试 -L 方式,在此之前,已经将 root 进行了 ddrescue 备份
[rheo@gen hd-4T-part2]$ sudo xfs_repair -L /dev/centos/root
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
agi unlinked bucket 10 is 647882 in ag 3 (inode=101311178)
sb_ifree 1631, counted 1654
sb_fdblocks 14323967, counted 14336708
- 07:03:56: scanning filesystem freespace - 34 of 34 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- 07:03:56: scanning agi unlinked lists - 34 of 34 allocation groups done
- process known inodes and perform inode discovery...
- agno = 15
- agno = 0
- agno = 30
- agno = 16
- agno = 31
- agno = 32
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 33
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- 07:04:08: process known inodes and inode discovery - 463168 of 463168 inodes done
- process newly discovered inodes...
- 07:04:08: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 07:04:08: setting up duplicate extent list - 34 of 34 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 3
- agno = 2
- agno = 1
- agno = 11
- agno = 16
- agno = 19
- agno = 24
- agno = 27
- agno = 30
- agno = 10
- agno = 33
- agno = 5
- agno = 13
- agno = 4
- agno = 14
- agno = 18
- agno = 6
- agno = 20
- agno = 22
- agno = 21
- agno = 15
- agno = 23
- agno = 7
- agno = 25
- agno = 17
- agno = 26
- agno = 28
- agno = 8
- agno = 29
- agno = 9
- agno = 31
- agno = 32
- agno = 12
- 07:04:08: check for inodes claiming duplicate blocks - 463168 of 463168 inodes done
Phase 5 - rebuild AG headers and trees...
- 07:04:08: rebuild AG headers and trees - 34 of 34 allocation groups done
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
disconnected inode 101311178, moving to lost+found
Phase 7 - verify and correct link counts...
- 07:04:21: verify and correct link counts - 34 of 34 allocation groups done
Maximum metadata LSN (1109:24056) is ahead of log (1:8).
Format log to cycle 1112.
done
再次使用 -n 进行空跑
[rheo@gen hd-4T-part2]$ sudo xfs_repair -n /dev/centos/root
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- 07:05:22: scanning filesystem freespace - 34 of 34 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 07:05:22: scanning agi unlinked lists - 34 of 34 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 15
- agno = 30
- agno = 31
- agno = 16
- agno = 32
- agno = 17
- agno = 18
- agno = 33
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- 07:05:34: process known inodes and inode discovery - 463168 of 463168 inodes done
- process newly discovered inodes...
- 07:05:34: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 07:05:34: setting up duplicate extent list - 34 of 34 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 4
- agno = 2
- agno = 11
- agno = 18
- agno = 6
- agno = 30
- agno = 9
- agno = 33
- agno = 10
- agno = 13
- agno = 12
- agno = 3
- agno = 8
- agno = 16
- agno = 19
- agno = 5
- agno = 15
- agno = 14
- agno = 21
- agno = 20
- agno = 23
- agno = 22
- agno = 24
- agno = 25
- agno = 17
- agno = 26
- agno = 28
- agno = 27
- agno = 29
- agno = 31
- agno = 7
- agno = 1
- agno = 32
- 07:05:35: check for inodes claiming duplicate blocks - 463168 of 463168 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
- 07:05:47: verify and correct link counts - 34 of 34 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
貌似没有错误了,尝试挂载
[rheo@gen hd-4T-part2]$ sudo mount /dev/centos/root /mnt/lv_root/
[rheo@gen hd-4T-part2]$ ls /mnt/lv_root/
bin dev home lib64 media opt public rheoData run share sys usr
boot etc lib lost+found mnt proc recoverA root sbin srv tmp var
目录完好,重新启动,
重启之后,获取了如下数据,看起来较早时间就有了一些问题,不能获取某些数据?
[rheo@gen hd-4T-part2]$ mail
U145 [email protected] Tue Apr 25 19:25 2501/183561 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U148 [email protected] Tue May 2 21:01 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
U154 [email protected] Tue May 2 21:03 2588/189510 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U156 [email protected] Wed May 10 02:44 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
U159 [email protected] Wed May 10 02:46 2591/190989 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U160 [email protected] Wed May 10 02:46 2501/183561 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U161 [email protected] Wed May 10 02:47 2706/202886 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U162 [email protected] Wed May 10 02:47 2588/189510 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U163 [email protected] Wed May 10 02:54 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
U164 [email protected] Wed May 10 02:54 5537/538747 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
U166 [email protected] Wed May 10 02:55 2591/190989 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U167 [email protected] Wed May 10 02:55 2501/183561 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U168 [email protected] Wed May 10 02:56 2706/202886 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U169 [email protected] Wed May 10 02:56 2588/189510 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U170 root Thu May 11 02:53 19/635 "Health"
U171 [email protected] Fri May 19 21:32 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
U174 [email protected] Fri May 19 21:34 2591/190989 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
U176 [email protected] Fri May 19 22:46 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
>N182 [email protected] Fri May 19 22:52 2587/189500 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
N183 [email protected] Fri May 19 23:03 2440/176390 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
N184 [email protected] Fri May 19 23:03 5536/538737 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
N186 [email protected] Fri May 19 23:05 2590/190979 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
N187 [email protected] Fri May 19 23:06 2500/183551 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
N188 [email protected] Fri May 19 23:07 2705/202876 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
N189 [email protected] Fri May 19 23:08 2587/189500 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
N190 root Fri May 19 23:33 18/618 "FailedOpenDevice"
N191 [email protected] Sun May 21 07:13 2440/176390 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
N192 [email protected] Sun May 21 07:13 5536/538737 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
查询其他错误
[root@gen ~]# dmesg | egrep rror
[ 1.531000] ERST: Error Record Serialization Table (ERST) support is initialized.
[ 12.182562] ACPI Error: No handler for Region [SYSI] (ffff9bfbf4ac3bd0) [IPMI] (20130517/evregion-162)
[ 12.182572] ACPI Error: Region IPMI (ID=7) has no handler (20130517/exfldio-305)
[ 12.182579] ACPI Error: Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff9c0374d98618), AE_NOT_EXIST (20130517/psparse-536)
[ 12.182595] ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff9c0374d98578), AE_NOT_EXIST (20130517/psparse-536)
以为是 acpi 错误,
# help: https://www.suse.com/ja-jp/support/kb/doc/?id=000017865
[root@nd2 home]# modprobe acpi_ipmi
启动到原机系统上,上新盘,重新获取 block 设备情况
[root@gen ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
└─sda1 8:1 0 1.8T 0 part
sdb 8:16 0 1.8T 0 disk
├─sdb1 8:17 0 200M 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part /boot
└─sdb3 8:19 0 1.8T 0 part
├─centos-root 253:0 0 422.6G 0 lvm /
├─centos-swap 253:1 0 15.5G 0 lvm [SWAP]
├─centos-home 253:5 0 5.6T 0 lvm
└─centos-opt 253:6 0 379.5G 0 lvm /opt
sdc 8:32 0 3.7T 0 disk
├─sdc1 8:33 0 2T 0 part
│ ├─centos-root 253:0 0 422.6G 0 lvm /
│ ├─centos-home 253:5 0 5.6T 0 lvm
│ └─centos-opt 253:6 0 379.5G 0 lvm /opt
└─sdc2 8:34 0 1.7T 0 part /mnt/hd-4T-part2
sdd 8:48 0 7.3T 0 disk
└─sdd1 8:49 0 7.3T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
sdg 8:96 0 465.8G 0 disk
├─sdg1 8:97 0 192M 0 part
├─sdg2 8:98 0 1G 0 part
└─sdg3 8:99 0 26G 0 part
├─vg01-root 253:2 0 20G 0 lvm
├─vg01-home 253:3 0 4G 0 lvm
└─vg01-swap 253:4 0 2G 0 lvm
sdh 8:112 0 7.3T 0 disk
创建新的物理卷,并将卷组扩展到新的物理卷
[root@gen ~]# pvcreate /dev/sdh
Physical volume "/dev/sdh" successfully created.
[root@gen ~]# vg
vgcfgbackup vgck vgdb vgextend vgmerge vgremove vgscan
vgcfgrestore vgconvert vgdisplay vgimport vgmknodes vgrename vgsplit
vgchange vgcreate vgexport vgimportclone vgreduce vgs
[root@gen ~]# vgextend centos /dev/sdh
Volume group "centos" successfully extended
[root@gen ~]# vgs
VG #PV #LV #SN Attr VSize VFree
centos 5 4 0 wz--n- 13.64t <7.28t
vg01 1 3 0 wz--n- 25.90g 0
将 /home 从 ddrescue 备份的盘移出
[root@gen ~]# pvmove -n home /dev/sdc1
/dev/sdc1: Moved: 0.01%
/dev/sdc1: Moved: 0.89%
/dev/sdc1: Moved: 1.79%
...
/dev/sdc1: Moved: 100.00%
在其他 terminal 检查
[root@gen ~]# vgs
VG #PV #LV #SN Attr VSize VFree
centos 5 4 0 wz--n- 13.64t 7.00t
vg01 1 3 0 wz--n- 25.90g 0
[root@gen ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
sdc 8:32 0 3.7T 0 disk
├─sdc1 8:33 0 2T 0 part
│ ├─centos-root 253:0 0 422.6G 0 lvm /
│ ├─centos-opt 253:6 0 379.5G 0 lvm /opt
│ └─centos-pvmove0 253:7 0 279.5G 0 lvm
│ └─centos-home 253:5 0 5.6T 0 lvm
└─sdc2 8:34 0 1.7T 0 part
sdd 8:48 0 7.3T 0 disk
└─sdd1 8:49 0 7.3T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
...
sdh 8:112 0 7.3T 0 disk
└─centos-pvmove0 253:7 0 279.5G 0 lvm
└─centos-home 253:5 0 5.6T 0 lvm
可以看到,在 pvmove 移动过程中,系统建立了临时的 centos-pvmove0 卷来存放该移动的数据。
完成以后
[root@gen ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
sdc 8:32 0 3.7T 0 disk
├─sdc1 8:33 0 2T 0 part
│ ├─centos-root 253:0 0 422.6G 0 lvm /
│ └─centos-opt 253:6 0 379.5G 0 lvm /opt
└─sdc2 8:34 0 1.7T 0 part /mnt/hd-4T-part2
sdd 8:48 0 7.3T 0 disk
└─sdd1 8:49 0 7.3T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
...
sdh 8:112 0 7.3T 0 disk
└─centos-home 253:5 0 5.6T 0 lvm
继续移动原来与 root 在同一磁盘的部分
[root@gen ~]# pvmove -n home /dev/sdb3 /dev/sdh
检查
[root@gen ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
sdb 8:16 0 1.8T 0 disk
├─sdb1 8:17 0 200M 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part /boot
└─sdb3 8:19 0 1.8T 0 part
├─centos-root 253:0 0 422.6G 0 lvm /
├─centos-swap 253:1 0 15.5G 0 lvm [SWAP]
├─centos-opt 253:6 0 379.5G 0 lvm /opt
└─centos-pvmove0 253:7 0 1.7T 0 lvm
└─centos-home 253:5 0 5.6T 0 lvm
...
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
...
sdh 8:112 0 7.3T 0 disk
├─centos-home 253:5 0 5.6T 0 lvm
└─centos-pvmove0 253:7 0 1.7T 0 lvm
└─centos-home 253:5 0 5.6T 0 lvm
完成过程中,
[root@gen ~]# lvmdiskscan
/dev/sda1 [ <1.82 TiB]
/dev/sdb1 [ 200.00 MiB]
/dev/sdb2 [ 1.00 GiB]
/dev/sdb3 [ <1.82 TiB] LVM physical volume
/dev/sdc1 [ 2.00 TiB] LVM physical volume
/dev/sdc2 [ <1.64 TiB]
/dev/sdd1 [ <7.28 TiB]
/dev/sde1 [ <1.82 TiB] LVM physical volume
/dev/sdf1 [ <1.82 TiB] LVM physical volume
/dev/sdg1 [ <192.00 MiB]
/dev/sdg2 [ 1.03 GiB]
/dev/sdg3 [ <25.94 GiB] LVM physical volume
/dev/sdh [ <7.28 TiB] LVM physical volume
0 disks
7 partitions
1 LVM physical volume whole disk
5 LVM physical volumes
把 /home 从 /dev/sdb3 移出后
[root@gen ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
sdb 8:16 0 1.8T 0 disk
├─sdb1 8:17 0 200M 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part /boot
└─sdb3 8:19 0 1.8T 0 part
├─centos-root 253:0 0 422.6G 0 lvm /
├─centos-swap 253:1 0 15.5G 0 lvm [SWAP]
└─centos-opt 253:6 0 379.5G 0 lvm /opt
sdc 8:32 0 3.7T 0 disk
├─sdc1 8:33 0 2T 0 part
│ ├─centos-root 253:0 0 422.6G 0 lvm /
│ └─centos-opt 253:6 0 379.5G 0 lvm /opt
└─sdc2 8:34 0 1.7T 0 part /mnt/hd-4T-part2
sdd 8:48 0 7.3T 0 disk
└─sdd1 8:49 0 7.3T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─centos-home 253:5 0 5.6T 0 lvm
...
sdh 8:112 0 7.3T 0 disk
└─centos-home 253:5 0 5.6T 0 lvm
把数据分离出来成为 dusg 卷组
[root@gen ~]# vgsplit centos dusg /dev/sde1 /dev/sdf1 /dev/sdh
New volume group "dusg" successfully split from "centos"
把 /root 放回 /dev/sdb3
[root@gen ~]# pvmove -n root /dev/sdc1 /dev/sdb3 &
[root@gen ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
sdb 8:16 0 1.8T 0 disk
├─sdb1 8:17 0 200M 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part /boot
└─sdb3 8:19 0 1.8T 0 part
├─centos-root 253:0 0 422.6G 0 lvm /
├─centos-swap 253:1 0 15.5G 0 lvm [SWAP]
├─centos-pvmove0 253:5 0 372.6G 0 lvm
│ └─centos-root 253:0 0 422.6G 0 lvm /
└─centos-opt 253:6 0 379.5G 0 lvm /opt
sdc 8:32 0 3.7T 0 disk
├─sdc1 8:33 0 2T 0 part
│ ├─centos-pvmove0 253:5 0 372.6G 0 lvm
│ │ └─centos-root 253:0 0 422.6G 0 lvm /
│ └─centos-opt 253:6 0 379.5G 0 lvm /opt
└─sdc2 8:34 0 1.7T 0 part /mnt/hd-4T-part2
...
继续转移 opt
[root@gen ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 1.8T 0 disk
├─sdb1 8:17 0 200M 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part /boot
└─sdb3 8:19 0 1.8T 0 part
├─centos-root 253:0 0 422.6G 0 lvm /
├─centos-swap 253:1 0 15.5G 0 lvm [SWAP]
├─centos-pvmove0 253:5 0 279.5G 0 lvm
│ └─centos-opt 253:6 0 379.5G 0 lvm /opt
└─centos-opt 253:6 0 379.5G 0 lvm /opt
sdc 8:32 0 3.7T 0 disk
├─sdc1 8:33 0 2T 0 part
│ └─centos-pvmove0 253:5 0 279.5G 0 lvm
│ └─centos-opt 253:6 0 379.5G 0 lvm /opt
└─sdc2 8:34 0 1.7T 0 part /mnt/hd-4T-part2
...
完成后
[rheo@gen ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
└─sda1 8:1 0 1.8T 0 part /rheoData
sdb 8:16 0 1.8T 0 disk
├─sdb1 8:17 0 200M 0 part /boot/efi
├─sdb2 8:18 0 1G 0 part /boot
└─sdb3 8:19 0 1.8T 0 part
├─centos-root 253:0 0 422.6G 0 lvm /
├─centos-swap 253:1 0 15.5G 0 lvm [SWAP]
└─centos-opt 253:6 0 379.5G 0 lvm /opt
sdc 8:32 0 3.7T 0 disk
├─sdc1 8:33 0 2T 0 part
└─sdc2 8:34 0 1.7T 0 part /mnt/hd-4T-part2
sdd 8:48 0 7.3T 0 disk
└─sdd1 8:49 0 7.3T 0 part /mnt/hd-8T
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.8T 0 part
└─dusg-home 253:7 0 5.6T 0 lvm /home
sdf 8:80 0 1.8T 0 disk
└─sdf1 8:81 0 1.8T 0 part
└─dusg-home 253:7 0 5.6T 0 lvm /home
sdg 8:96 0 465.8G 0 disk
├─sdg1 8:97 0 192M 0 part
├─sdg2 8:98 0 1G 0 part
└─sdg3 8:99 0 26G 0 part
├─vg01-root 253:2 0 20G 0 lvm
├─vg01-home 253:3 0 4G 0 lvm
└─vg01-swap 253:4 0 2G 0 lvm
sdh 8:112 0 7.3T 0 disk
└─dusg-home 253:7 0 5.6T 0 lvm /home
移除 centos 卷中的 /dev/sdc1
[root@gen ~]# vgreduce centos /dev/sdc1
Removed "/dev/sdc1" from volume group "centos"
[root@gen rheo]# pvscan
PV /dev/sdg3 VG vg01 lvm2 [25.90 GiB / 0 free]
PV /dev/sde1 VG dusg lvm2 [<1.82 TiB / 0 free]
PV /dev/sdf1 VG dusg lvm2 [<1.82 TiB / 0 free]
PV /dev/sdh VG dusg lvm2 [<7.28 TiB / <5.35 TiB free]
PV /dev/sdb3 VG centos lvm2 [<1.82 TiB / <1.02 TiB free]
PV /dev/sdc1 lvm2 [2.00 TiB]
Total: 6 [<14.76 TiB] / in use: 5 [<12.76 TiB] / in no VG: 1 [2.00 TiB]
移除 /dev/sdc1 物理卷
[root@gen rheo]# pvremove /dev/sdc1
Labels on physical volume "/dev/sdc1" successfully wiped.
[root@gen rheo]# pvscan
PV /dev/sdg3 VG vg01 lvm2 [25.90 GiB / 0 free]
PV /dev/sde1 VG dusg lvm2 [<1.82 TiB / 0 free]
PV /dev/sdf1 VG dusg lvm2 [<1.82 TiB / 0 free]
PV /dev/sdh VG dusg lvm2 [<7.28 TiB / <5.35 TiB free]
PV /dev/sdb3 VG centos lvm2 [<1.82 TiB / <1.02 TiB free]
Total: 5 [<12.76 TiB] / in use: 5 [<12.76 TiB] / in no VG: 0 [0 ]
至此,系统主要部分已经整理完成,把数据卷组独立到新的物理卷上,做到了和系统的卷的分割。
总结
- 系统和数据要做到物理卷的不同
- 尽量做好备份,无论是系统还是数据
- 出现问题很可能是 SSD 的锅,当然,使用其他 SSD 也很久,没有出现过这种问题。当然,为了避免,在重要系统或数据上尽量不要使用 SSD ,或者,至少得有一个备份
备份工具
- ddrescue 可以尽可能地读取源数据进行对拷,同时速度也较快,在读取数据错误时多次读取,以期尽可能地恢复数据。这是本文所载故障能恢复的主要工具,主要使用 dd 类似的工具的原因是因为盘坏了,而不是 xfs 系统的问题。dd 速度慢,存在的问题是,如果数据读取错误,并不会尝试多次读取,而是以 0 取代。
- xfs_metadump 可以备份元数据,其实也就是文件的名称,而没有内容,使用 xfs_mdrestore 后可以看到文件和目录的名字,而无法显示内容
故障恢复步骤
- 找到是哪一块盘出现问题
- 可以用另一个系统启动后,使用 ddrescue 工具备份数据至文件,并使用 dd 将该文件做到新的盘上
- 编辑 /etc/lvm/lvm.conf 过滤旧的硬盘,因为新旧盘上的 UUID 完全相同
- 备份数据
- 使用 xfs_repair 修复系统,先使用 xfs_repair -n 空跑,再使用 xfs_repair 修复,若无法修复,最后手段是采用 xfs_repair -L ,有可能会损坏数据
- 修复完成后,看是否能够正常挂载,读取数据,若没有问题,将旧的盘取出,以免重启系统后出现 UUID 相当的情况。
其他状况
遇到 lvm 系统的版本对比不同,导致无法启动的情况