系统故障与恢复记实

故障情况:

  • OS:CentOS 7
  • 最初,提示 SMART 错误,某盘不能读取或写入
  • 接着,用户目录内文件不能访问
  • 最后,重启,不能启动到系统

处理

准备工作

  1. 利用别的机器将 Centos 7 安装在一个 SSD 移动硬盘上,注意建立卷组时命名要与原机器不同,否则启动后需要修改卷组名称
  2. 将 SSD 移动硬盘插到故障机 USB 接口
  3. 重启,选择 USB 启动,进入 CentOS 系统

检查系统

进入系统,列出卷组信息

[rheo@gen ~]$ sudo lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0 465.8G  0 disk 
├─sda1            8:1    0   192M  0 part /boot/efi
├─sda2            8:2    0     1G  0 part /boot
└─sda3            8:3    0    26G  0 part 
  ├─vg01-root   253:0    0    20G  0 lvm  /
  ├─vg01-swap   253:1    0     2G  0 lvm  [SWAP]
  └─vg01-home   253:6    0     4G  0 lvm  /home
sdb               8:16   0 931.5G  0 disk 
└─sdb1            8:17   0 931.5G  0 part 
  ├─centos-root 253:2    0 422.6G  0 lvm  
  ├─centos-home 253:3    0   5.6T  0 lvm  
  └─centos-opt  253:5    0 379.5G  0 lvm  
sdc               8:32   0   1.8T  0 disk 
├─sdc1            8:33   0   200M  0 part 
├─sdc2            8:34   0     1G  0 part 
└─sdc3            8:35   0   1.8T  0 part 
  ├─centos-root 253:2    0 422.6G  0 lvm  
  ├─centos-home 253:3    0   5.6T  0 lvm  
  ├─centos-swap 253:4    0  15.5G  0 lvm  
  └─centos-opt  253:5    0 379.5G  0 lvm  
sdd               8:48   0   1.8T  0 disk 
└─sdd1            8:49   0   1.8T  0 part 
sde               8:64   0   1.8T  0 disk 
└─sde1            8:65   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  
sdf               8:80   0   1.8T  0 disk 
└─sdf1            8:81   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  

系统情况如下:

  • /dev/sda 为当前系统所在的 USB SSD 移动硬盘
  • /dev/sdb 为三星的固态硬盘 870 EVO 1TB
  • /dev/sdc 为原系统根目录所在的硬盘
  • /dev/sdd 为某独立盘,分区为 ext4
  • /dev/sde-f 为卷组 centos 中的 /home 目录的一部分

尝试挂载旧的系统

[rheo@gen ~]$ sudo mount /dev/centos/root /mnt/lv_root/
mount: /dev/mapper/centos-root: can't read superblock
[rheo@gen ~]$ sudo mount /dev/centos/home /mnt/lv_home/
[rheo@gen ~]$ sudo mkdir /mnt/lv_opt
[rheo@gen ~]$ sudo mount /dev/centos/opt /mnt/lv_opt/
mount: /dev/mapper/centos-opt: can't read superblock
[rheo@gen ~]$ df -h
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                  32G     0   32G   0% /dev
tmpfs                     32G  4.0K   32G   1% /dev/shm
tmpfs                     32G   20M   32G   1% /run
tmpfs                     32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/vg01-root     20G  4.1G   16G  21% /
/dev/sda2                1.1G  173M  874M  17% /boot
/dev/sda1                192M   12M  181M   6% /boot/efi
/dev/mapper/vg01-home    4.0G  145M  3.9G   4% /home
tmpfs                    6.3G   40K  6.3G   1% /run/user/1000
/dev/mapper/centos-home  5.6T  5.0T  611G  90% /mnt/lv_home

可以看到,旧的 root 和 opt 都不能挂载,而 home 仍然能挂载。

插一块 4T 硬盘到机器上

[rheo@gen ~]$ sudo lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0 465.8G  0 disk 
├─sda1            8:1    0   192M  0 part /boot/efi
├─sda2            8:2    0     1G  0 part /boot
└─sda3            8:3    0    26G  0 part 
  ├─vg01-root   253:0    0    20G  0 lvm  /
  ├─vg01-swap   253:1    0     2G  0 lvm  [SWAP]
  └─vg01-home   253:6    0     4G  0 lvm  /home
sdc               8:32   0   1.8T  0 disk 
├─sdc1            8:33   0   200M  0 part 
├─sdc2            8:34   0     1G  0 part 
└─sdc3            8:35   0   1.8T  0 part 
  ├─centos-root 253:2    0 422.6G  0 lvm  
  ├─centos-home 253:3    0   5.6T  0 lvm  /mnt/lv_home
  ├─centos-swap 253:4    0  15.5G  0 lvm  
  └─centos-opt  253:5    0 379.5G  0 lvm  
sdd               8:48   0   1.8T  0 disk 
└─sdd1            8:49   0   1.8T  0 part 
sde               8:64   0   1.8T  0 disk 
└─sde1            8:65   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  /mnt/lv_home
sdf               8:80   0   1.8T  0 disk 
└─sdf1            8:81   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  /mnt/lv_home
sdg               8:96   0   3.7T  0 disk 

对 /dev/sdg 进行分区

[rheo@gen ~]$ sudo fdisk /dev/sdg
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0x23a98ff1.

WARNING: The size of this disk is 4.0 TB (4000787030016 bytes).
DOS partition table format can not be used on drives for volumes
larger than (2199023255040 bytes) for 512-byte sectors. Use parted(1) and GUID 
partition table format (GPT).

The device presents a logical sector size that is smaller than
the physical sector size. Aligning to a physical sector (or optimal
I/O) size boundary is recommended, or performance may be impacted.

Command (m for help): p

Disk /dev/sdg: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x23a98ff1

   Device Boot      Start         End      Blocks   Id  System

Command (m for help): G
Building a new GPT disklabel (GUID: 9CADE58B-0615-4431-8512-3A95FF6A9A72)

Command (m for help): n
Partition number (1-128, default 1): 
First sector (2048-7814037134, default 2048): 
Last sector, +sectors or +size{K,M,G,T,P} (2048-7814037134, default 7814037134): +2T
Created partition 1

Command (m for help): n
Partition number (2-128, default 2): 
First sector (4294969344-7814037134, default 4294969344): 
Last sector, +sectors or +size{K,M,G,T,P} (4294969344-7814037134, default 7814037134): 
Created partition 2

Command (m for help): p

Disk /dev/sdg: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: 9CADE58B-0615-4431-8512-3A95FF6A9A72

#         Start          End    Size  Type            Name
 1         2048   4294969343      2T  Linux filesyste 
 2   4294969344   7814037134    1.7T  Linux filesyste 

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

列出设备

[rheo@gen ~]$ sudo lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0 465.8G  0 disk 
├─sda1            8:1    0   192M  0 part /boot/efi
├─sda2            8:2    0     1G  0 part /boot
└─sda3            8:3    0    26G  0 part 
  ├─vg01-root   253:0    0    20G  0 lvm  /
  ├─vg01-swap   253:1    0     2G  0 lvm  [SWAP]
  └─vg01-home   253:6    0     4G  0 lvm  /home
sdc               8:32   0   1.8T  0 disk 
├─sdc1            8:33   0   200M  0 part 
├─sdc2            8:34   0     1G  0 part 
└─sdc3            8:35   0   1.8T  0 part 
  ├─centos-root 253:2    0 422.6G  0 lvm  
  ├─centos-home 253:3    0   5.6T  0 lvm  /mnt/lv_home
  ├─centos-swap 253:4    0  15.5G  0 lvm  
  └─centos-opt  253:5    0 379.5G  0 lvm  
sdd               8:48   0   1.8T  0 disk 
└─sdd1            8:49   0   1.8T  0 part 
sde               8:64   0   1.8T  0 disk 
└─sde1            8:65   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  /mnt/lv_home
sdf               8:80   0   1.8T  0 disk 
└─sdf1            8:81   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  /mnt/lv_home
sdg               8:96   0   3.7T  0 disk 
├─sdg1            8:97   0     2T  0 part 
└─sdg2            8:98   0   1.7T  0 part 

卸载旧的 /home 系统

[rheo@gen ~]$ sudo umount /dev/centos/home

重新启动后

[rheo@gen ~]$ sudo lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0 465.8G  0 disk 
├─sda1            8:1    0   192M  0 part /boot/efi
├─sda2            8:2    0     1G  0 part /boot
└─sda3            8:3    0    26G  0 part 
  ├─vg01-root   253:0    0    20G  0 lvm  /
  ├─vg01-swap   253:1    0     2G  0 lvm  [SWAP]
  └─vg01-home   253:6    0     4G  0 lvm  /home
sdb               8:16   0 931.5G  0 disk 
└─sdb1            8:17   0 931.5G  0 part 
  ├─centos-root 253:2    0 422.6G  0 lvm  
  ├─centos-home 253:3    0   5.6T  0 lvm  
  └─centos-opt  253:5    0 379.5G  0 lvm  
sdc               8:32   0   1.8T  0 disk 
├─sdc1            8:33   0   200M  0 part 
├─sdc2            8:34   0     1G  0 part 
└─sdc3            8:35   0   1.8T  0 part 
  ├─centos-root 253:2    0 422.6G  0 lvm  
  ├─centos-home 253:3    0   5.6T  0 lvm  
  ├─centos-swap 253:4    0  15.5G  0 lvm  
  └─centos-opt  253:5    0 379.5G  0 lvm  
sdd               8:48   0   1.8T  0 disk 
└─sdd1            8:49   0   1.8T  0 part 
sde               8:64   0   3.7T  0 disk 
├─sde1            8:65   0     2T  0 part 
└─sde2            8:66   0   1.7T  0 part 
sdf               8:80   0   1.8T  0 disk 
└─sdf1            8:81   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  
sdg               8:96   0   1.8T  0 disk 
└─sdg1            8:97   0   1.8T  0 part 
  └─centos-home 253:3    0   5.6T  0 lvm  

/dev/sdg 变成了 /dev/sde,创建 xfs 文件系统

[rheo@gen ~]$ sudo mkfs.xfs /dev/sde2
meta-data=/dev/sde2              isize=512    agcount=4, agsize=109970869 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=439883473, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=214786, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

创建目录,并挂载 /dev/sde2

[rheo@gen ~]$ sudo mkdir /mnt/sde2
[rheo@gen ~]$ sudo mount /dev/sde2 /mnt/sde2/

使用 dd 进行将源 (损坏) 盘进行备份

[rheo@gen ~]$ sudo dd if=/dev/sdh1 of=/mnt/sde2/sdc.img conv=noerror
[sudo] password for rheo: 
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 5.48293 s, 16.2 MB/s
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 9.45846 s, 9.4 MB/s
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 13.4746 s, 6.6 MB/s
dd: error reading ‘/dev/sdh1’: Input/output error
173440+0 records in
173440+0 records out
88801280 bytes (89 MB) copied, 17.4715 s, 5.1 MB/s

发现不能读取某些数据,但坚持完成备份,花了不少时间,大约 8 个小时。然后反数据恢复到新的一个 4T 的 SSD 硬盘 (/dev/sdh)。再备份该 SSD 硬盘

[rheo@gen ~]$ sudo dd if=/dev/sdh1 of=/run/media/rheo/cce2b7fa-4187-4c2a-9390-5634b47988b0/sdc1.img conv=noerror
1953523087+0 records in
1953523087+0 records out
1000203820544 bytes (1.0 TB) copied, 7827.84 s, 128 MB/s

速度很快。以为可以读取部分数据,尝试备份 meta 数据

[rheo@gen ~]$ sudo xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump
Metadata CRC error detected at xfs_agf block 0x2b1201808/0x1000
xfs_metadump: cannot init perag data (-74). Continuing anyway.
... 
xfs_metadump: cannot read superblock for ag 5
Metadata CRC error detected at xfs_agf block 0x7d00008/0x1000
Metadata CRC error detected at xfs_agi block 0x7d00010/0x1000
Metadata CRC error detected at xfs_agfl block 0x7d00018/0x1000
/sbin/xfs_metadump: line 33: 12441 Segmentation fault      (core dumped) xfs_db$DBOPTS -i -p xfs_metadump -c "metadump$OPTS $2" $1

不能完成 meta 备份,猜测是数据不完整。使用 ddrescue 读取已经 /dev/sdh1 (这是 dd 恢复后的),无法读取到错误信息,因为 dd 已经把错误使用 0 替代了。

[rheo@gen hd-8T]$ sudo ddrescue -f -n /dev/centos/root /mnt/hd-8T/centos-root.rescue.img root-rescue.log
GNU ddrescue 1.27
Press Ctrl-C to interrupt
     ipos:  453769 MB, non-trimmed:        0 B,  current rate:    112 MB/s
     opos:  453769 MB, non-scraped:        0 B,  average rate:    176 MB/s
non-tried:        0 B,  bad-sector:        0 B,    error rate:       0 B/s
  rescued:  453769 MB,   bad areas:        0,        run time:     42m 44s
pct rescued:  100.00%, read errors:        0,  remaining time:         n/a
                              time since last successful read:         n/a
Copying non-tried blocks... Pass 1 (forwards)
Finished                                    

使用 centos-root.rescue.img 创建 loop 设备

[rheo@gen hd-8T]$ sudo losetup --find --show /mnt/hd-8T/centos-root.rescue.img

使用 UFS explorer 扫描转移的新盘,

  • 在 /home/ 下只扫描出来 56M 文件
  • 在 /opt/ 下扫描出来 485G 文件
  • 在 / 下扫描出来 17G 文件

插上原出问题的 SSD 硬盘,使用 ddrescue 读取出该原磁盘

[rheo@gen ~]$ sudo ddrescue -d -f -r3 /dev/sdb1 /mnt/hd-8T/ssd.img ssd-rescue.log
GNU ddrescue 1.27
Press Ctrl-C to interrupt
     ipos:    1000 GB, non-trimmed:   15532 kB,  current rate:    272 MB/s
     opos:    1000 GB, non-scraped:        0 B,  average rate:    140 MB/s
non-tried:    2667 MB,  bad-sector:        0 B,    error rate:       0 B/s
  rescued:  997521 MB,   bad areas:        0,        run time:  1h 58m 31s
pct rescued:   99.73%, read errors:      237,  remaining time:         11s
                              time since last successful read:          0s
Copying non-tried blocks... Pass 1 (forwards)
     ipos:   98697 kB, non-trimmed:   30277 kB,  current rate:  18677 kB/s
     opos:   98697 kB, non-scraped:        0 B,  average rate:    139 MB/s
non-tried:    1338 MB,  bad-sector:        0 B,    error rate:    196 kB/s
  rescued:  998834 MB,   bad areas:        0,        run time:  1h 59m 30s
pct rescued:   99.86%, read errors:      462,  remaining time:         55s
                              time since last successful read:          0s
Copying non-tried blocks... Pass 2 (backwards)
     ipos:  941692 MB, non-trimmed:  416874 kB,  current rate:   8454 kB/s
     opos:  941692 MB, non-scraped:        0 B,  average rate:    115 MB/s
non-tried:        0 B,  bad-sector:        0 B,    error rate:    262 kB/s
  rescued:  999786 MB,   bad areas:        0,        run time:  2h 24m 42s
pct rescued:   99.95%, read errors:     6361,  remaining time:         10m
                              time since last successful read:          0s
Copying non-tried blocks... Pass 5 (forwards) 
     ipos:  941692 MB, non-trimmed:        0 B,  current rate:   81920 B/s
     opos:  941692 MB, non-scraped:  332089 kB,  average rate:    101 MB/s
non-tried:        0 B,  bad-sector:    1391 kB,    error rate:    1024 B/s
  rescued:  999870 MB,   bad areas:     2717,        run time:  2h 43m 33s
pct rescued:   99.96%, read errors:     9079,  remaining time:         51m
                              time since last successful read:          0s
Trimming failed blocks... (forwards)         
     ipos:   90021 kB, non-trimmed:        0 B,  current rate:     512 B/s
     opos:   90021 kB, non-scraped:  331689 kB,  average rate:    101 MB/s
non-tried:        0 B,  bad-sector:    1593 kB,    error rate:    3072 B/s
  rescued:  999870 MB,   bad areas:     2743,        run time:  2h 44m 39s
pct rescued:   99.96%, read errors:     9473,  remaining time:  1d 11h 13m
                              time since last successful read:          0s
Scraping failed blocks... (forwards)

发现有不少错误,无论如何,还是读完数据了。

升级 xfsprogs 到 5.0,xfs_repaire 仍然无法解决 superblock 问题

使用 ddrescue 得到的 img 文件,恢复其到另一个磁盘的分区,然后通过编辑 /etc/lvm/lvm.conf 屏蔽坏掉的盘 /dev/sdb1/ , 然后执行以下操作,重新扫描卷组

[rheo@gen ~]$ sudo vgchange -an centos
  WARNING: Device mismatch detected for centos/root which is accessing /dev/sdb1 instead of /dev/sdh1.
  WARNING: Device mismatch detected for centos/home which is accessing /dev/sdb1 instead of /dev/sdh1.
  WARNING: Device mismatch detected for centos/opt which is accessing /dev/sdb1 instead of /dev/sdh1.
  0 logical volume(s) in volume group "centos" now active
[rheo@gen ~]$ sudo pvscan
  Error reading device /dev/sdb at 0 length 512.
  Error reading device /dev/sdb at 0 length 4.
  Error reading device /dev/sdb at 4096 length 4.
  Error reading device /dev/sdb1 at 0 length 4.
  Error reading device /dev/sdb1 at 4096 length 4.
  PV /dev/sda3   VG vg01            lvm2 [25.90 GiB / 0    free]
  PV /dev/sdc3   VG centos          lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdf1   VG centos          lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdg1   VG centos          lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdh1   VG centos          lvm2 [<931.51 GiB / 0    free]
  Total: 5 [6.39 TiB] / in use: 5 [6.39 TiB] / in no VG: 0 [0   ]
[rheo@gen ~]$ sudo vgscan
  Reading volume groups from cache.
  Found volume group "vg01" using metadata type lvm2
  Found volume group "centos" using metadata type lvm2
[rheo@gen ~]$ sudo lvscan
  ACTIVE            '/dev/vg01/root' [19.97 GiB] inherit
  ACTIVE            '/dev/vg01/home' [3.99 GiB] inherit
  ACTIVE            '/dev/vg01/swap' [<1.94 GiB] inherit
  inactive          '/dev/centos/root' [<422.61 GiB] inherit
  inactive          '/dev/centos/home' [<5.57 TiB] inherit
  inactive          '/dev/centos/swap' [15.50 GiB] inherit
  inactive          '/dev/centos/opt' [<379.46 GiB] inherit
[rheo@gen ~]$ sudo vgchange -ay centos
  4 logical volume(s) in volume group "centos" now active
[rheo@gen ~]$ sudo vgscan
  Reading volume groups from cache.
  Found volume group "vg01" using metadata type lvm2
  Found volume group "centos" using metadata type lvm2
[rheo@gen ~]$ sudo vgs
  VG     #PV #LV #SN Attr   VSize  VFree
  centos   4   4   0 wz--n- <6.37t    0 
  vg01     1   3   0 wz--n- 25.90g    0 
[rheo@gen ~]$ sudo mount /dev/centos/home /mnt/lv_home/
[rheo@gen ~]$ ls /mnt/lv_home/
amin  autossh  azhang  fhuang  rdu  swang  sxu  wzeng  xhuang  xye  yliu  ywang  yyang  yzhou  zdu  zliu
[rheo@gen ~]$ sudo umount /mnt/lv_home 
[rheo@gen ~]$ sudo mount /dev/centos/opt /mnt/lv_opt/
[rheo@gen ~]$ ls /mnt/lv_opt/
ansys_inc  code_saturne  lammps    recoverX  SALOME-9.3.0-CO7-SRC      WindowsImageBackup
aster      google        recoverB  rh        SALOME-9.3.0-CO7-SRC.tgz  zoom
[rheo@gen ~]$ sudo umount /mnt/lv_opt
[rheo@gen ~]$ sudo mount /dev/centos/root /mnt/lv_root/
[rheo@gen ~]$ ls /mnt/lv_root
bin   dev  home  lib64  mnt  proc    recoverA  root  sbin   srv  tmp  var
boot  etc  lib   media  opt  public  rheoData  run   share  sys  usr
[rheo@gen ~]$ sudo umount /dev/centos/root

现在,看起来所有的数据都还在,赶紧做备份工作,最要紧是先把数据备份下来, 使用 xfs_metadump 备份元数据

[root@gen hd-8T]$ su
[root@gen hd-8T]# xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump |tee /mnt/hd-8T/home-backup.log
^Z
[1]+  Stopped                 xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump | tee /mnt/hd-8T/home-backup.log
[root@gen hd-8T]# bg
[1]+ xfs_metadump /dev/mapper/centos-home /mnt/hd-8T/centos-home.metadump | tee /mnt/hd-8T/home-backup.log &
[root@gen hd-8T]# exit
exit

还可以做个检查,使用 xfs_repair -n

[rheo@gen mnt]$ sudo xfs_repair -n /dev/centos/opt 
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - 08:24:14: scanning filesystem freespace - 16 of 16 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 08:24:14: scanning agi unlinked lists - 16 of 16 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 15
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 08:24:37: process known inodes and inode discovery - 711488 of 711488 inodes done
        - process newly discovered inodes...
        - 08:24:37: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 08:24:37: setting up duplicate extent list - 16 of 16 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 14
        - agno = 12
        - agno = 7
        - agno = 6
        - agno = 11
        - agno = 9
        - agno = 8
        - agno = 13
        - agno = 10
        - agno = 4
        - agno = 15
        - agno = 5
        - 08:24:37: check for inodes claiming duplicate blocks - 711488 of 711488 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
        - 08:24:50: verify and correct link counts - 16 of 16 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
[rheo@gen mnt]$ sudo xfs_repair -n /dev/centos/root 
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - 08:25:37: scanning filesystem freespace - 34 of 34 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 08:25:37: scanning agi unlinked lists - 34 of 34 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 30
        - agno = 15
        - agno = 16
        - agno = 31
        - agno = 32
        - agno = 17
        - agno = 33
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 08:25:53: process known inodes and inode discovery - 463040 of 463040 inodes done
        - process newly discovered inodes...
        - 08:25:53: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 08:25:53: setting up duplicate extent list - 34 of 34 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 5
        - agno = 2
        - agno = 13
        - agno = 16
        - agno = 1
        - agno = 24
        - agno = 28
        - agno = 31
        - agno = 10
        - agno = 32
        - agno = 11
        - agno = 12
        - agno = 3
        - agno = 14
        - agno = 15
        - agno = 18
        - agno = 19
        - agno = 4
        - agno = 17
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 6
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 7
        - agno = 29
        - agno = 30
        - agno = 8
        - agno = 33
        - agno = 9
        - 08:25:54: check for inodes claiming duplicate blocks - 463040 of 463040 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
        - 08:26:10: verify and correct link counts - 34 of 34 allocation groups done
No modify flag set, skipping filesystem flush and exiting.

貌似没有问题,检查备份文件大小

[rheo@gen hd-8T]$ ls -lah
total 981G
drwxr-xr-x  2 root root  172 May 17 11:34 .
drwxr-xr-x. 9 root root  113 May 16 18:42 ..
-rw-r--r--  1 root root  48G May 17 10:37 centos-home.metadump
-rw-r--r--  1 root root 444M May 17 11:35 centos-opt.metadump
-rw-r--r--  1 root root 296M May 17 11:32 centos-root.metadump
-rw-r--r--  1 root root    0 May 17 07:31 home-backup.log
-rw-r--r--  1 root root    0 May 17 11:34 opt-backup.log
-rw-r--r--  1 root root    0 May 17 11:24 root-backup.log
-rw-r--r--  1 root root 932G May 16 22:35 ssd.img

使用 metadump 恢复元数据到一个新的文件

[rheo@gen hd-8T]$ sudo xfs_mdrestore /mnt/hd-4T-part2/centos-home.metadump /mnt/hd-8T/centos-home.img

从 metadump 恢复 img 后,再挂载成 loop,出现乱码问题。

[rheo@gen hd-8T]$ sudo losetup --find --show ./centos-home.img 
/dev/loop1
[rheo@gen hd-8T]$ sudo mount /dev/loop1 /mnt/lv_home/
[rheo@gen hd-8T]$ ls /mnt/lv_home/ -lah
total 68K
drwxr-xr-x  18 root root  276 Nov  7  2022 .
drwxr-xr-x.  9 root root  113 May 16 18:42 ..
drwxr-x---   7 1015 rheo  214 Feb 15 21:06 amin
drwx------   6 1004 1005  140 Jun  8  2019 Gp?oqCG
drwxr-xr-x  27 1003 rheo 4.0K May  6 10:34 n?han?
drwxr-xr-x  55 rheo rheo 8.0K May 10 02:15 rdu
drwxr-x---  23 1012 rheo 4.0K May  9 22:55 R?ual1
drwxr-xr-x  10 1011 rheo 4.0K Jun  9  2022 sxu
drwxr-xr-x  10 1005 rheo 4.0K Nov 25  2021 ?wan`
drwxr-xr-x   6 1006 rheo  156 Jun 12  2019 ?wan`
drwxr-xr-x  18 1002 rheo 4.0K Sep 14  2022 xye
drwx------   8 1014 rheo  232 Sep  2  2022 ?yan`
drwx------  16 1010 rheo 4.0K Apr  6 20:44 yliu
drwxr-xr-x  35 1007 rheo 8.0K May 10 02:12 zdu
drwx------   7 1013 rheo  216 Sep  2  2022 ?zen`
drwxr-xr-x  13 1001 rheo  278 Apr 19  2022 ?zhor
drwxr-xr-x  30 1008 rheo 8.0K Mar 30 15:24 zliu
drwxr-x---  28 1009 rheo 4.0K May  9 14:42 z?uao?

尝试挂载其他盘,仍然有问题

[rheo@gen hd-4T-part2]$ sudo mount /dev/loop2 /mnt/lv_opt/
[rheo@gen hd-4T-part2]$ ls /mnt/lv_opt/
3^OogiC/                 nGpzR_KspN6ry7jG8CU      UryNGb0UROC0B^O^SCN@/    [P^A<
4bonN2aujzfJamy^KR^W^S"/ N^Ammre/                 xYjaldF^DVf^O]/          
h9c^Ow0^RD/              rh/                      XYpe^A�Qw^D/             
Hwp^Ou4cF/               ^Astet/                  zoom/                    

卸载 /dev/loop2, 执行 xfs_repair /dev/loop2 后,仍然有问题

[rheo@gen hd-4T-part2]$ sudo xfs_repair  /dev/loop0
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - 12:42:57: scanning filesystem freespace - 16 of 16 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 12:42:57: scanning agi unlinked lists - 16 of 16 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 15
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 12:42:58: process known inodes and inode discovery - 711488 of 711488 inodes done
        - process newly discovered inodes...
        - 12:42:58: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 12:42:58: setting up duplicate extent list - 16 of 16 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 12
        - agno = 3
        - agno = 5
        - agno = 7
        - agno = 6
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 4
        - agno = 11
        - agno = 15
        - agno = 13
        - agno = 14
        - 12:42:59: check for inodes claiming duplicate blocks - 711488 of 711488 inodes done
Phase 5 - rebuild AG headers and trees...
        - 12:42:59: rebuild AG headers and trees - 16 of 16 allocation groups done
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
        - 12:42:59: verify and correct link counts - 16 of 16 allocation groups done
done

[rheo@gen hd-4T-part2]$ ls /mnt/lv_opt/
3?ogiC                h9c?w0?D  nGpzR_KspN6ry7jG8CU?[P?<  rh     UryNGb0UROC0B??CN@  XYpe??Qw?
4bonN2aujzfJamy?R??"  Hwp?u4cF  N?mmre                    ?stet  xYjaldF?Vf?]        zoom

卸载,尝试重新挂载

[rheo@gen hd-4T-part2]$ sudo mount /dev/loop0 /mnt/lv_opt/
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

而直接挂载原 lv,却没有问题

[rheo@gen hd-4T-part2]$ ls /mnt/lv_opt/
ansys_inc  code_saturne  lammps    recoverX  SALOME-9.3.0-CO7-SRC      WindowsImageBackup
aster      google        recoverB  rh        SALOME-9.3.0-CO7-SRC.tgz  zoom

第二天早上,
20230520 morning
开始重新处理系统

[rheo@gen ~]$ sudo pvscan
[sudo] password for rheo: 
  PV /dev/sda3   VG vg01            lvm2 [25.90 GiB / 0    free]
  PV /dev/sdb3   VG centos          lvm2 [<1.82 TiB / 0    free]
  PV /dev/sde1   VG centos          lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdf1   VG centos          lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdc1   VG centos          lvm2 [<931.51 GiB / 0    free]
  Total: 5 [6.39 TiB] / in use: 5 [6.39 TiB] / in no VG: 0 [0   ]

[rheo@gen ~]$ sudo vgscan
  Reading volume groups from cache.
  Found volume group "vg01" using metadata type lvm2
  Found volume group "centos" using metadata type lvm2

[rheo@gen ~]$ sudo lvscan
  ACTIVE            '/dev/vg01/root' [19.97 GiB] inherit
  ACTIVE            '/dev/vg01/home' [3.99 GiB] inherit
  ACTIVE            '/dev/vg01/swap' [<1.94 GiB] inherit
  ACTIVE            '/dev/centos/root' [<422.61 GiB] inherit
  ACTIVE            '/dev/centos/home' [<5.57 TiB] inherit
  ACTIVE            '/dev/centos/swap' [15.50 GiB] inherit
  ACTIVE            '/dev/centos/opt' [<379.46 GiB] inherit

尝试修复系统,此时只是空跑一下,并不真正执行修复操作

[rheo@gen ~]$ sudo xfs_repair -n /dev/centos/root
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
agi unlinked bucket 10 is 647882 in ag 3 (inode=101311178)
sb_ifree 1631, counted 1654
sb_fdblocks 14323967, counted 14336708
        - 08:11:59: scanning filesystem freespace - 34 of 34 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 08:11:59: scanning agi unlinked lists - 34 of 34 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 15
        - agno = 30
        - agno = 0
        - agno = 31
        - agno = 16
        - agno = 17
        - agno = 32
        - agno = 18
        - agno = 33
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 08:12:11: process known inodes and inode discovery - 463168 of 463168 inodes done
        - process newly discovered inodes...
        - 08:12:11: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 08:12:11: setting up duplicate extent list - 34 of 34 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 1
        - agno = 9
        - agno = 5
        - agno = 8
        - agno = 11
        - agno = 12
        - agno = 20
        - agno = 17
        - agno = 22
        - agno = 27
        - agno = 6
        - agno = 15
        - agno = 18
        - agno = 14
        - agno = 16
        - agno = 19
        - agno = 4
        - agno = 10
        - agno = 21
        - agno = 23
        - agno = 25
        - agno = 13
        - agno = 24
        - agno = 26
        - agno = 7
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - agno = 33
        - agno = 32
        - 08:12:11: check for inodes claiming duplicate blocks - 463168 of 463168 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 101311178, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 101311178 nlinks from 0 to 1
        - 08:12:24: verify and correct link counts - 34 of 34 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
[rheo@gen ~]$ sudo xfs_repair -n /dev/centos/opt
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_icount 711488, counted 711296
sb_ifree 476, counted 539
sb_fdblocks 41964358, counted 43819941
        - 08:14:28: scanning filesystem freespace - 16 of 16 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 08:14:28: scanning agi unlinked lists - 16 of 16 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 15
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
imap claims a free inode 884571138 is in use, would correct imap and clear inode
        - agno = 14
        - 08:15:05: process known inodes and inode discovery - 711296 of 711488 inodes done
        - process newly discovered inodes...
        - 08:15:05: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 08:15:05: setting up duplicate extent list - 16 of 16 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 12
        - agno = 15
        - agno = 4
        - agno = 5
        - agno = 8
        - agno = 10
        - agno = 1
        - agno = 11
        - agno = 13
        - agno = 9
        - agno = 3
        - agno = 7
        - agno = 14
        - agno = 6
entry "f534542424.gz" at block 2 offset 4040 in directory inode 884553803 references free inode 884571138
    would clear inode number in entry at offset 4040...
        - 08:15:06: check for inodes claiming duplicate blocks - 711296 of 711488 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
entry "f534542424.gz" in directory inode 884553803 points to free inode 884571138, would junk entry
bad hash table for directory inode 884553803 (no data entry): would rebuild
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
        - 08:15:29: verify and correct link counts - 16 of 16 allocation groups done
No modify flag set, skipping filesystem flush and exiting.

尝试真正修复系统,发现无法修复

[rheo@gen ~]$ sudo xfs_repair /dev/centos/opt
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

尝试挂载 home 和 opt,并卸载

[rheo@gen ~]$ sudo mount /dev/centos/home /mnt/lv_home/
[rheo@gen ~]$ sudo umount /mnt/lv_home 
[rheo@gen ~]$ sudo mount /dev/centos/opt /mnt/lv_opt/
[rheo@gen ~]$ sudo umount /mnt/lv_opt 
umount: /mnt/lv_opt: not mounted

再次尝试修复 opt 系统

[rheo@gen ~]$ sudo xfs_repair /dev/centos/opt
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - 08:17:24: scanning filesystem freespace - 16 of 16 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 08:17:24: scanning agi unlinked lists - 16 of 16 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 15
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
imap claims a free inode 884571138 is in use, correcting imap and clearing inode
cleared inode 884571138
        - agno = 14
        - 08:18:02: process known inodes and inode discovery - 711296 of 711296 inodes done
        - process newly discovered inodes...
        - 08:18:02: process newly discovered inodes - 16 of 16 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 08:18:02: setting up duplicate extent list - 16 of 16 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 8
        - agno = 7
        - agno = 10
        - agno = 9
        - agno = 14
        - agno = 12
        - agno = 11
        - agno = 13
        - agno = 15
entry "f534542424.gz" at block 2 offset 4040 in directory inode 884553803 references free inode 884571138
    clearing inode number in entry at offset 4040...
        - 08:18:02: check for inodes claiming duplicate blocks - 711296 of 711296 inodes done
Phase 5 - rebuild AG headers and trees...
        - 08:18:02: rebuild AG headers and trees - 16 of 16 allocation groups done
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
bad hash table for directory inode 884553803 (no data entry): rebuilding
rebuilding directory inode 884553803
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
        - 08:18:26: verify and correct link counts - 16 of 16 allocation groups done
done

尝试修复 root 系统

[rheo@gen ~]$ sudo xfs_repair /dev/centos/root 
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

不行啊!!!

使用普通手段修复无望,最后试 -L 方式,在此之前,已经将 root 进行了 ddrescue 备份

[rheo@gen hd-4T-part2]$ sudo xfs_repair -L /dev/centos/root 
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
agi unlinked bucket 10 is 647882 in ag 3 (inode=101311178)
sb_ifree 1631, counted 1654
sb_fdblocks 14323967, counted 14336708
        - 07:03:56: scanning filesystem freespace - 34 of 34 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 07:03:56: scanning agi unlinked lists - 34 of 34 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 15
        - agno = 0
        - agno = 30
        - agno = 16
        - agno = 31
        - agno = 32
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 33
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 07:04:08: process known inodes and inode discovery - 463168 of 463168 inodes done
        - process newly discovered inodes...
        - 07:04:08: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 07:04:08: setting up duplicate extent list - 34 of 34 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 2
        - agno = 1
        - agno = 11
        - agno = 16
        - agno = 19
        - agno = 24
        - agno = 27
        - agno = 30
        - agno = 10
        - agno = 33
        - agno = 5
        - agno = 13
        - agno = 4
        - agno = 14
        - agno = 18
        - agno = 6
        - agno = 20
        - agno = 22
        - agno = 21
        - agno = 15
        - agno = 23
        - agno = 7
        - agno = 25
        - agno = 17
        - agno = 26
        - agno = 28
        - agno = 8
        - agno = 29
        - agno = 9
        - agno = 31
        - agno = 32
        - agno = 12
        - 07:04:08: check for inodes claiming duplicate blocks - 463168 of 463168 inodes done
Phase 5 - rebuild AG headers and trees...
        - 07:04:08: rebuild AG headers and trees - 34 of 34 allocation groups done
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 101311178, moving to lost+found
Phase 7 - verify and correct link counts...
        - 07:04:21: verify and correct link counts - 34 of 34 allocation groups done
Maximum metadata LSN (1109:24056) is ahead of log (1:8).
Format log to cycle 1112.
done

再次使用 -n 进行空跑

[rheo@gen hd-4T-part2]$ sudo xfs_repair -n /dev/centos/root 
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - 07:05:22: scanning filesystem freespace - 34 of 34 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 07:05:22: scanning agi unlinked lists - 34 of 34 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 15
        - agno = 30
        - agno = 31
        - agno = 16
        - agno = 32
        - agno = 17
        - agno = 18
        - agno = 33
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 07:05:34: process known inodes and inode discovery - 463168 of 463168 inodes done
        - process newly discovered inodes...
        - 07:05:34: process newly discovered inodes - 34 of 34 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 07:05:34: setting up duplicate extent list - 34 of 34 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 4
        - agno = 2
        - agno = 11
        - agno = 18
        - agno = 6
        - agno = 30
        - agno = 9
        - agno = 33
        - agno = 10
        - agno = 13
        - agno = 12
        - agno = 3
        - agno = 8
        - agno = 16
        - agno = 19
        - agno = 5
        - agno = 15
        - agno = 14
        - agno = 21
        - agno = 20
        - agno = 23
        - agno = 22
        - agno = 24
        - agno = 25
        - agno = 17
        - agno = 26
        - agno = 28
        - agno = 27
        - agno = 29
        - agno = 31
        - agno = 7
        - agno = 1
        - agno = 32
        - 07:05:35: check for inodes claiming duplicate blocks - 463168 of 463168 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
        - 07:05:47: verify and correct link counts - 34 of 34 allocation groups done
No modify flag set, skipping filesystem flush and exiting.

貌似没有错误了,尝试挂载

[rheo@gen hd-4T-part2]$ sudo mount /dev/centos/root /mnt/lv_root/
[rheo@gen hd-4T-part2]$ ls /mnt/lv_root/
bin   dev  home  lib64       media  opt   public    rheoData  run   share  sys  usr
boot  etc  lib   lost+found  mnt    proc  recoverA  root      sbin  srv    tmp  var

目录完好,重新启动,
重启之后,获取了如下数据,看起来较早时间就有了一些问题,不能获取某些数据?

[rheo@gen hd-4T-part2]$ mail
 U145 [email protected]  Tue Apr 25 19:25 2501/183561 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U148 [email protected]  Tue May  2 21:01 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 U154 [email protected]  Tue May  2 21:03 2588/189510 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U156 [email protected]  Wed May 10 02:44 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 U159 [email protected]  Wed May 10 02:46 2591/190989 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U160 [email protected]  Wed May 10 02:46 2501/183561 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U161 [email protected]  Wed May 10 02:47 2706/202886 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U162 [email protected]  Wed May 10 02:47 2588/189510 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U163 [email protected]  Wed May 10 02:54 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 U164 [email protected]  Wed May 10 02:54 5537/538747 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 U166 [email protected]  Wed May 10 02:55 2591/190989 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U167 [email protected]  Wed May 10 02:55 2501/183561 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U168 [email protected]  Wed May 10 02:56 2706/202886 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U169 [email protected]  Wed May 10 02:56 2588/189510 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U170 root                  Thu May 11 02:53  19/635   "Health"
 U171 [email protected]  Fri May 19 21:32 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 U174 [email protected]  Fri May 19 21:34 2591/190989 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 U176 [email protected]  Fri May 19 22:46 2441/176400 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
>N182 [email protected]  Fri May 19 22:52 2587/189500 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 N183 [email protected]  Fri May 19 23:03 2440/176390 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 N184 [email protected]  Fri May 19 23:03 5536/538737 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 N186 [email protected]  Fri May 19 23:05 2590/190979 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 N187 [email protected]  Fri May 19 23:06 2500/183551 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 N188 [email protected]  Fri May 19 23:07 2705/202876 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 N189 [email protected]  Fri May 19 23:08 2587/189500 "[abrt] : BUG: unable to handle kernel paging request at 0000000000007980"
 N190 root                  Fri May 19 23:33  18/618   "FailedOpenDevice"
 N191 [email protected]  Sun May 21 07:13 2440/176390 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"
 N192 [email protected]  Sun May 21 07:13 5536/538737 "[abrt] : BUG: unable to handle kernel paging request at 000000000001e250"

查询其他错误

[root@gen ~]# dmesg | egrep rror
[    1.531000] ERST: Error Record Serialization Table (ERST) support is initialized.
[   12.182562] ACPI Error: No handler for Region [SYSI] (ffff9bfbf4ac3bd0) [IPMI] (20130517/evregion-162)
[   12.182572] ACPI Error: Region IPMI (ID=7) has no handler (20130517/exfldio-305)
[   12.182579] ACPI Error: Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff9c0374d98618), AE_NOT_EXIST (20130517/psparse-536)
[   12.182595] ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff9c0374d98578), AE_NOT_EXIST (20130517/psparse-536)

以为是 acpi 错误,

# help: https://www.suse.com/ja-jp/support/kb/doc/?id=000017865
[root@nd2 home]# modprobe acpi_ipmi

启动到原机系统上,上新盘,重新获取 block 设备情况

[root@gen ~]# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0   1.8T  0 disk 
└─sda1            8:1    0   1.8T  0 part 
sdb               8:16   0   1.8T  0 disk 
├─sdb1            8:17   0   200M  0 part /boot/efi
├─sdb2            8:18   0     1G  0 part /boot
└─sdb3            8:19   0   1.8T  0 part 
  ├─centos-root 253:0    0 422.6G  0 lvm  /
  ├─centos-swap 253:1    0  15.5G  0 lvm  [SWAP]
  ├─centos-home 253:5    0   5.6T  0 lvm  
  └─centos-opt  253:6    0 379.5G  0 lvm  /opt
sdc               8:32   0   3.7T  0 disk 
├─sdc1            8:33   0     2T  0 part 
│ ├─centos-root 253:0    0 422.6G  0 lvm  /
│ ├─centos-home 253:5    0   5.6T  0 lvm  
│ └─centos-opt  253:6    0 379.5G  0 lvm  /opt
└─sdc2            8:34   0   1.7T  0 part /mnt/hd-4T-part2
sdd               8:48   0   7.3T  0 disk 
└─sdd1            8:49   0   7.3T  0 part 
sde               8:64   0   1.8T  0 disk 
└─sde1            8:65   0   1.8T  0 part 
  └─centos-home 253:5    0   5.6T  0 lvm  
sdf               8:80   0   1.8T  0 disk 
└─sdf1            8:81   0   1.8T  0 part 
  └─centos-home 253:5    0   5.6T  0 lvm  
sdg               8:96   0 465.8G  0 disk 
├─sdg1            8:97   0   192M  0 part 
├─sdg2            8:98   0     1G  0 part 
└─sdg3            8:99   0    26G  0 part 
  ├─vg01-root   253:2    0    20G  0 lvm  
  ├─vg01-home   253:3    0     4G  0 lvm  
  └─vg01-swap   253:4    0     2G  0 lvm  
sdh               8:112  0   7.3T  0 disk 

创建新的物理卷,并将卷组扩展到新的物理卷

[root@gen ~]# pvcreate /dev/sdh
  Physical volume "/dev/sdh" successfully created.
[root@gen ~]# vg
vgcfgbackup    vgck           vgdb           vgextend       vgmerge        vgremove       vgscan
vgcfgrestore   vgconvert      vgdisplay      vgimport       vgmknodes      vgrename       vgsplit
vgchange       vgcreate       vgexport       vgimportclone  vgreduce       vgs            
[root@gen ~]# vgextend centos /dev/sdh
  Volume group "centos" successfully extended
[root@gen ~]# vgs
  VG     #PV #LV #SN Attr   VSize  VFree 
  centos   5   4   0 wz--n- 13.64t <7.28t
  vg01     1   3   0 wz--n- 25.90g     0 

将 /home 从 ddrescue 备份的盘移出

[root@gen ~]# pvmove -n home /dev/sdc1
  /dev/sdc1: Moved: 0.01%
  /dev/sdc1: Moved: 0.89%
  /dev/sdc1: Moved: 1.79%
   ...
  /dev/sdc1: Moved: 100.00%

在其他 terminal 检查

[root@gen ~]# vgs
  VG     #PV #LV #SN Attr   VSize  VFree
  centos   5   4   0 wz--n- 13.64t 7.00t
  vg01     1   3   0 wz--n- 25.90g    0 
[root@gen ~]# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
...
sdc                  8:32   0   3.7T  0 disk 
├─sdc1               8:33   0     2T  0 part 
│ ├─centos-root    253:0    0 422.6G  0 lvm  /
│ ├─centos-opt     253:6    0 379.5G  0 lvm  /opt
│ └─centos-pvmove0 253:7    0 279.5G  0 lvm  
│   └─centos-home  253:5    0   5.6T  0 lvm  
└─sdc2               8:34   0   1.7T  0 part 
sdd                  8:48   0   7.3T  0 disk 
└─sdd1               8:49   0   7.3T  0 part 
sde                  8:64   0   1.8T  0 disk 
└─sde1               8:65   0   1.8T  0 part 
  └─centos-home    253:5    0   5.6T  0 lvm  
sdf                  8:80   0   1.8T  0 disk 
└─sdf1               8:81   0   1.8T  0 part 
  └─centos-home    253:5    0   5.6T  0 lvm  
...
sdh                  8:112  0   7.3T  0 disk 
└─centos-pvmove0   253:7    0 279.5G  0 lvm  
  └─centos-home    253:5    0   5.6T  0 lvm  

可以看到,在 pvmove 移动过程中,系统建立了临时的 centos-pvmove0 卷来存放该移动的数据。
完成以后

[root@gen ~]# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
...
sdc               8:32   0   3.7T  0 disk 
├─sdc1            8:33   0     2T  0 part 
│ ├─centos-root 253:0    0 422.6G  0 lvm  /
│ └─centos-opt  253:6    0 379.5G  0 lvm  /opt
└─sdc2            8:34   0   1.7T  0 part /mnt/hd-4T-part2
sdd               8:48   0   7.3T  0 disk 
└─sdd1            8:49   0   7.3T  0 part 
sde               8:64   0   1.8T  0 disk 
└─sde1            8:65   0   1.8T  0 part 
  └─centos-home 253:5    0   5.6T  0 lvm  
sdf               8:80   0   1.8T  0 disk 
└─sdf1            8:81   0   1.8T  0 part 
  └─centos-home 253:5    0   5.6T  0 lvm  
...
sdh               8:112  0   7.3T  0 disk 
└─centos-home   253:5    0   5.6T  0 lvm  

继续移动原来与 root 在同一磁盘的部分

[root@gen ~]# pvmove -n home /dev/sdb3 /dev/sdh

检查

[root@gen ~]# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
...
sdb                  8:16   0   1.8T  0 disk 
├─sdb1               8:17   0   200M  0 part /boot/efi
├─sdb2               8:18   0     1G  0 part /boot
└─sdb3               8:19   0   1.8T  0 part 
  ├─centos-root    253:0    0 422.6G  0 lvm  /
  ├─centos-swap    253:1    0  15.5G  0 lvm  [SWAP]
  ├─centos-opt     253:6    0 379.5G  0 lvm  /opt
  └─centos-pvmove0 253:7    0   1.7T  0 lvm  
    └─centos-home  253:5    0   5.6T  0 lvm  
...
sde                  8:64   0   1.8T  0 disk 
└─sde1               8:65   0   1.8T  0 part 
  └─centos-home    253:5    0   5.6T  0 lvm  
sdf                  8:80   0   1.8T  0 disk 
└─sdf1               8:81   0   1.8T  0 part 
  └─centos-home    253:5    0   5.6T  0 lvm  
...
sdh                  8:112  0   7.3T  0 disk 
├─centos-home      253:5    0   5.6T  0 lvm  
└─centos-pvmove0   253:7    0   1.7T  0 lvm  
  └─centos-home    253:5    0   5.6T  0 lvm  

完成过程中,

[root@gen ~]# lvmdiskscan
  /dev/sda1 [      <1.82 TiB] 
  /dev/sdb1 [     200.00 MiB] 
  /dev/sdb2 [       1.00 GiB] 
  /dev/sdb3 [      <1.82 TiB] LVM physical volume
  /dev/sdc1 [       2.00 TiB] LVM physical volume
  /dev/sdc2 [      <1.64 TiB] 
  /dev/sdd1 [      <7.28 TiB] 
  /dev/sde1 [      <1.82 TiB] LVM physical volume
  /dev/sdf1 [      <1.82 TiB] LVM physical volume
  /dev/sdg1 [    <192.00 MiB] 
  /dev/sdg2 [       1.03 GiB] 
  /dev/sdg3 [     <25.94 GiB] LVM physical volume
  /dev/sdh  [      <7.28 TiB] LVM physical volume
  0 disks
  7 partitions
  1 LVM physical volume whole disk
  5 LVM physical volumes

把 /home 从 /dev/sdb3 移出后

[root@gen ~]# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
...
sdb               8:16   0   1.8T  0 disk 
├─sdb1            8:17   0   200M  0 part /boot/efi
├─sdb2            8:18   0     1G  0 part /boot
└─sdb3            8:19   0   1.8T  0 part 
  ├─centos-root 253:0    0 422.6G  0 lvm  /
  ├─centos-swap 253:1    0  15.5G  0 lvm  [SWAP]
  └─centos-opt  253:6    0 379.5G  0 lvm  /opt
sdc               8:32   0   3.7T  0 disk 
├─sdc1            8:33   0     2T  0 part 
│ ├─centos-root 253:0    0 422.6G  0 lvm  /
│ └─centos-opt  253:6    0 379.5G  0 lvm  /opt
└─sdc2            8:34   0   1.7T  0 part /mnt/hd-4T-part2
sdd               8:48   0   7.3T  0 disk 
└─sdd1            8:49   0   7.3T  0 part 
sde               8:64   0   1.8T  0 disk 
└─sde1            8:65   0   1.8T  0 part 
  └─centos-home 253:5    0   5.6T  0 lvm  
sdf               8:80   0   1.8T  0 disk 
└─sdf1            8:81   0   1.8T  0 part 
  └─centos-home 253:5    0   5.6T  0 lvm  
...
sdh               8:112  0   7.3T  0 disk 
└─centos-home   253:5    0   5.6T  0 lvm  

把数据分离出来成为 dusg 卷组

[root@gen ~]# vgsplit centos dusg /dev/sde1 /dev/sdf1 /dev/sdh
  New volume group "dusg" successfully split from "centos"

把 /root 放回 /dev/sdb3

[root@gen ~]# pvmove -n root /dev/sdc1 /dev/sdb3 &
[root@gen ~]# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
...
sdb                  8:16   0   1.8T  0 disk 
├─sdb1               8:17   0   200M  0 part /boot/efi
├─sdb2               8:18   0     1G  0 part /boot
└─sdb3               8:19   0   1.8T  0 part 
  ├─centos-root    253:0    0 422.6G  0 lvm  /
  ├─centos-swap    253:1    0  15.5G  0 lvm  [SWAP]
  ├─centos-pvmove0 253:5    0 372.6G  0 lvm  
  │ └─centos-root  253:0    0 422.6G  0 lvm  /
  └─centos-opt     253:6    0 379.5G  0 lvm  /opt
sdc                  8:32   0   3.7T  0 disk 
├─sdc1               8:33   0     2T  0 part 
│ ├─centos-pvmove0 253:5    0 372.6G  0 lvm  
│ │ └─centos-root  253:0    0 422.6G  0 lvm  /
│ └─centos-opt     253:6    0 379.5G  0 lvm  /opt
└─sdc2               8:34   0   1.7T  0 part /mnt/hd-4T-part2
...

继续转移 opt

[root@gen ~]# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdb                  8:16   0   1.8T  0 disk 
├─sdb1               8:17   0   200M  0 part /boot/efi
├─sdb2               8:18   0     1G  0 part /boot
└─sdb3               8:19   0   1.8T  0 part 
  ├─centos-root    253:0    0 422.6G  0 lvm  /
  ├─centos-swap    253:1    0  15.5G  0 lvm  [SWAP]
  ├─centos-pvmove0 253:5    0 279.5G  0 lvm  
  │ └─centos-opt   253:6    0 379.5G  0 lvm  /opt
  └─centos-opt     253:6    0 379.5G  0 lvm  /opt
sdc                  8:32   0   3.7T  0 disk 
├─sdc1               8:33   0     2T  0 part 
│ └─centos-pvmove0 253:5    0 279.5G  0 lvm  
│   └─centos-opt   253:6    0 379.5G  0 lvm  /opt
└─sdc2               8:34   0   1.7T  0 part /mnt/hd-4T-part2
...

完成后

[rheo@gen ~]$ lsblk
NAME              MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                 8:0    0   1.8T  0 disk 
└─sda1              8:1    0   1.8T  0 part /rheoData
sdb                 8:16   0   1.8T  0 disk 
├─sdb1              8:17   0   200M  0 part /boot/efi
├─sdb2              8:18   0     1G  0 part /boot
└─sdb3              8:19   0   1.8T  0 part 
  ├─centos-root   253:0    0 422.6G  0 lvm  /
  ├─centos-swap   253:1    0  15.5G  0 lvm  [SWAP]
  └─centos-opt    253:6    0 379.5G  0 lvm  /opt
sdc                 8:32   0   3.7T  0 disk 
├─sdc1              8:33   0     2T  0 part 
└─sdc2              8:34   0   1.7T  0 part /mnt/hd-4T-part2
sdd                 8:48   0   7.3T  0 disk 
└─sdd1              8:49   0   7.3T  0 part /mnt/hd-8T
sde                 8:64   0   1.8T  0 disk 
└─sde1              8:65   0   1.8T  0 part 
  └─dusg-home 253:7    0   5.6T  0 lvm  /home
sdf                 8:80   0   1.8T  0 disk 
└─sdf1              8:81   0   1.8T  0 part 
  └─dusg-home 253:7    0   5.6T  0 lvm  /home
sdg                 8:96   0 465.8G  0 disk 
├─sdg1              8:97   0   192M  0 part 
├─sdg2              8:98   0     1G  0 part 
└─sdg3              8:99   0    26G  0 part 
  ├─vg01-root     253:2    0    20G  0 lvm  
  ├─vg01-home     253:3    0     4G  0 lvm  
  └─vg01-swap     253:4    0     2G  0 lvm  
sdh                 8:112  0   7.3T  0 disk 
└─dusg-home   253:7    0   5.6T  0 lvm  /home

移除 centos 卷中的 /dev/sdc1

[root@gen ~]# vgreduce centos /dev/sdc1
  Removed "/dev/sdc1" from volume group "centos"

[root@gen rheo]# pvscan
  PV /dev/sdg3   VG vg01            lvm2 [25.90 GiB / 0    free]
  PV /dev/sde1   VG dusg        lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdf1   VG dusg        lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdh    VG dusg        lvm2 [<7.28 TiB / <5.35 TiB free]
  PV /dev/sdb3   VG centos          lvm2 [<1.82 TiB / <1.02 TiB free]
  PV /dev/sdc1                      lvm2 [2.00 TiB]
  Total: 6 [<14.76 TiB] / in use: 5 [<12.76 TiB] / in no VG: 1 [2.00 TiB]

移除 /dev/sdc1 物理卷

[root@gen rheo]# pvremove /dev/sdc1
  Labels on physical volume "/dev/sdc1" successfully wiped.
[root@gen rheo]# pvscan
  PV /dev/sdg3   VG vg01            lvm2 [25.90 GiB / 0    free]
  PV /dev/sde1   VG dusg        lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdf1   VG dusg        lvm2 [<1.82 TiB / 0    free]
  PV /dev/sdh    VG dusg        lvm2 [<7.28 TiB / <5.35 TiB free]
  PV /dev/sdb3   VG centos          lvm2 [<1.82 TiB / <1.02 TiB free]
  Total: 5 [<12.76 TiB] / in use: 5 [<12.76 TiB] / in no VG: 0 [0   ]

至此,系统主要部分已经整理完成,把数据卷组独立到新的物理卷上,做到了和系统的卷的分割。

总结

  1. 系统和数据要做到物理卷的不同
  2. 尽量做好备份,无论是系统还是数据
  3. 出现问题很可能是 SSD 的锅,当然,使用其他 SSD 也很久,没有出现过这种问题。当然,为了避免,在重要系统或数据上尽量不要使用 SSD ,或者,至少得有一个备份

备份工具

  1. ddrescue 可以尽可能地读取源数据进行对拷,同时速度也较快,在读取数据错误时多次读取,以期尽可能地恢复数据。这是本文所载故障能恢复的主要工具,主要使用 dd 类似的工具的原因是因为盘坏了,而不是 xfs 系统的问题。dd 速度慢,存在的问题是,如果数据读取错误,并不会尝试多次读取,而是以 0 取代。
  2. xfs_metadump 可以备份元数据,其实也就是文件的名称,而没有内容,使用 xfs_mdrestore 后可以看到文件和目录的名字,而无法显示内容

故障恢复步骤

  1. 找到是哪一块盘出现问题
  2. 可以用另一个系统启动后,使用 ddrescue 工具备份数据至文件,并使用 dd 将该文件做到新的盘上
  3. 编辑 /etc/lvm/lvm.conf 过滤旧的硬盘,因为新旧盘上的 UUID 完全相同
  4. 备份数据
  5. 使用 xfs_repair 修复系统,先使用 xfs_repair -n 空跑,再使用 xfs_repair 修复,若无法修复,最后手段是采用 xfs_repair -L ,有可能会损坏数据
  6. 修复完成后,看是否能够正常挂载,读取数据,若没有问题,将旧的盘取出,以免重启系统后出现 UUID 相当的情况。

其他状况

遇到 lvm 系统的版本对比不同,导致无法启动的情况

   Send article as PDF   

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.