2020-01-26 10:47:30

by Jaco Kroon

[permalink] [raw]
Subject: e2fsck fails with unable to set superblock

Hi,

I've got an 85TB ext4 filesystem which I'm unable to fsck.  The only
cases of same error I could find was from what I can find due to an SD
card "swallowing" writes (ie, the card goes into a read-only mode but
doesn't report write failure).

crowsnest ~ # e2fsck -f /dev/lvm/home

e2fsck 1.45.4 (23-Sep-2019)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
/dev/lvm/home: recovering journal
e2fsck: unable to set superblock flags on /dev/lvm/home


/dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****

/dev/lvm/home: ********** WARNING: Filesystem still has errors **********

I have also (using dumpe2fs) obtained the location of the backup super
blocks and tried same against a few other superblocks using -b.  -y (as
per suggestion from at least one post) make absolutely no difference,
our understanding is that this simply answers yes to all questions, so
we didn't expect this to have impact but decided it was worth a try anyway.

Looking at the code for the unable to set superblock error it looks like
the code is in e2fsck/unix.c, specifically this:

1765     if (ext2fs_has_feature_journal_needs_recovery(sb)) {
1766         if (ctx->options & E2F_OPT_READONLY) {
...
1771         } else {
1772             if (ctx->flags & E2F_FLAG_RESTARTED) {
1773                 /*
1774                  * Whoops, we attempted to run the
1775                  * journal twice.  This should never
1776                  * happen, unless the hardware or
1777                  * device driver is being bogus.
1778                  */
1779                 com_err(ctx->program_name, 0,
1780                     _("unable to set superblock flags "
1781                       "on %s\n"), ctx->device_name);
1782                 fatal_error(ctx, 0);
1783             }

That comment has me somewhat confused.  I'm assuming the implication
there is that e2fsck tried to update the superblock, but after reading
it back, it's either unchanged or still wrong (In line with the
description of the SD card I found online).  None of our arrays are
reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
(we downgraded back to 5.1.15, which we're on currently, after
experiencing major performance issues on 5.3.6 and subsequently 5.4.8
didn't seem to fix those, and the 4.14.13 kernel that was used
previously is known to cause ext4 corruption of the kind we saw on the
other filesystems):

[ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
group 404160 overlaps superblock
[ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!

I created a dumpe2fs file as well:

crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
dumpe2fs 1.45.4 (23-Sep-2019)
dumpe2fs: Block bitmap checksum does not match bitmap while trying to
read '/dev/lvm/home' bitmaps

Available at https://downloads.uls.co.za/85T/dump2fs_home.txt.xz (1.2GB,
md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)

A strace of e2fsck -y -f /dev/lvm/home at
https://downloads.uls.co.za/85T/fsck.strace.txt (13MB,
md5:60aa91b0c47dd2837260218eb774152d)

crowsnest ~ # tune2fs -l /dev/lvm/home
tune2fs 1.45.4 (23-Sep-2019)
Filesystem volume name:   <none>
Last mounted on:          /home
Filesystem UUID:          522a9faf-7992-4888-93d5-7fe49a9762d6
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr filetype meta_bg extent
64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              356515840
Block count:              22817013760
Reserved block count:     0
Free blocks:              6874204745
Free inodes:              202183498
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         512
Inode blocks per group:   32
RAID stride:              128
RAID stripe width:        1024
First meta block group:   2048
Flex block group size:    16
Filesystem created:       Thu Jul 26 12:19:07 2018
Last mount time:          Sat Jan 18 18:58:50 2020
Last write time:          Sun Jan 26 11:38:56 2020
Mount count:              2
Maximum mount count:      -1
Last checked:             Wed Oct 30 17:37:27 2019
Check interval:           0 (<none>)
Lifetime writes:          976 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      876a7d14-bce8-4bef-9569-82e7d573b7aa
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xfbd895e9

Infrastructure:  3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
10TB disks (100TB usable total).  These are combined into a single VG
using LVM, and then carved up into a number of LVs, the largest of which
is this 85TB chunk.  We have tried in the past to carve this into
smaller LVs but failed.  So we're aware that this is very large and not
ideal.

We did experience an assembly issue on one of  the underlying RAID6 PVs,
those have been resolved, and the disk that was giving issues has been
scrubbed and rebuilt.  rom what we can tell based on other file systems,
this did not affect data integrity but we can't make that statement with
100% certainty, as such we are expecting some data loss here but it
would be better if we can recover at least some of this data.

Other filesystems which also resides on the same PV that was affected by
the RAID6 problem either received a clean bill of health, or were
successfully repaired by e2fsck (the system did crash however, it's
unclear whether the RAID6 assembly problem was the cause or merely
another consequence, and as a result, whether the corruption on the
repaired filesystem was a consequence of the kernel or the RAID).

I'm continuing onwards with e2fsck code to try and figure this out, am
hopeful though that someone could perhaps provide some much needed
insight and pointers for me.

Kind Regards,
Jaco


2020-01-26 20:44:37

by Jaco Kroon

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

Hi,

So working through the dumpe2fs file, the group mentioned by dmesg
contains this:

Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
  Group descriptor at 13243514880
  Block bitmap at 0 (bg #0 + 0), csum 0x00000000
  Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
  Inode table at 0-31 (bg #0 + 0)
  0 free blocks, 0 free inodes, 0 directories
  Free blocks: 13243514880-13243547647
  Free inodes: 206929921-206930432

Based on that it's quite simple to see that during the array
reconstruction we apparently wiped a bunch of data blocks with all
zeroes.  This is obviously bad. During reconstruction we had to zero one
of the disks before we could get the array to reassemble. What I'm
wondering is whether this process was a good choice now, and whether the
right disk was zeroed.  Obviously this implies major data loss (at least
4TB, probably more assuming that directory structures may well have been
destroyed as well, maybe less if some of those blocks weren't in use).

I'm hoping that it's possible to recreate these group descriptors (there
are a few of them) to at least point to the correct locations on disk,
and to then attempt a cleanup with e2fsck.  Again, data loss here is to
be expected, but if we can limit it at least that would be great.

There are unfortunately a large bunch of groups affected (128 cases of
64 consecutive group blocks).

32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
4KB/block => 1TB of data lost.  However, this is extremely conservative
seeing that this could include directory structures with cascading effect.

Based on the patterns of the first 64 group descriptors (GDs) it looks
like it should be possible to reconstruct the 8192 affected GDs, or
alternatively possibly "uninit" them
(https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization). 
I'm inclined to reason that it's probably safer to repair in the GDs the
following fields:

bg_block_bitmap_{lo,hi}
bg_inode_bitmap_{lo,hi}
bg_inode_table_{lo,hi}

I'm not sure about:

bg_flags (I'm guessing the safest is to leave this zeroed).
bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).

The following should (as far as my understanding goes) then be "fixable"
by e2fsck:

bg_free_blocks_count_{lo,hi}
bg_free_inodes_count_{lo,hi}
bg_used_dirs_count_{lo,hi}
bg_block_bitmap_csum_{lo,hi}
bg_inode_bitmap_csum_{lo,hi}
bg_itable_unused_{lo,hi}
bg_checksum

And of course, tracking down the GD on disk will be tricky it seems. It
seems some blocks have the GD in the block, and a bunch of others don't
(nor does dumpe2fs say where exactly they are).  There is 2048 blocks of
GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
group 2^17 onwards there are additional groups simply stating "Group
descriptor at ${frist_block_of_group}", so it's unclear how to track
down the GD for a given block group. 
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
does not describe this particularly well either, and there seems to be
confusion w.r.t. flex_bg and meta_bg features and this.

I do have an LVM snapshot of the affected LV currently, so happy to try
things.

Kind Regards,
Jaco

On 2020/01/26 12:21, Jaco Kroon wrote:

> Hi,
>
> I've got an 85TB ext4 filesystem which I'm unable to fsck.  The only
> cases of same error I could find was from what I can find due to an SD
> card "swallowing" writes (ie, the card goes into a read-only mode but
> doesn't report write failure).
>
> crowsnest ~ # e2fsck -f /dev/lvm/home
>
> e2fsck 1.45.4 (23-Sep-2019)
> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
> e2fsck: Group descriptors look bad... trying backup blocks...
> /dev/lvm/home: recovering journal
> e2fsck: unable to set superblock flags on /dev/lvm/home
>
>
> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>
> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>
> I have also (using dumpe2fs) obtained the location of the backup super
> blocks and tried same against a few other superblocks using -b.  -y (as
> per suggestion from at least one post) make absolutely no difference,
> our understanding is that this simply answers yes to all questions, so
> we didn't expect this to have impact but decided it was worth a try anyway.
>
> Looking at the code for the unable to set superblock error it looks like
> the code is in e2fsck/unix.c, specifically this:
>
> 1765     if (ext2fs_has_feature_journal_needs_recovery(sb)) {
> 1766         if (ctx->options & E2F_OPT_READONLY) {
> ...
> 1771         } else {
> 1772             if (ctx->flags & E2F_FLAG_RESTARTED) {
> 1773                 /*
> 1774                  * Whoops, we attempted to run the
> 1775                  * journal twice.  This should never
> 1776                  * happen, unless the hardware or
> 1777                  * device driver is being bogus.
> 1778                  */
> 1779                 com_err(ctx->program_name, 0,
> 1780                     _("unable to set superblock flags "
> 1781                       "on %s\n"), ctx->device_name);
> 1782                 fatal_error(ctx, 0);
> 1783             }
>
> That comment has me somewhat confused.  I'm assuming the implication
> there is that e2fsck tried to update the superblock, but after reading
> it back, it's either unchanged or still wrong (In line with the
> description of the SD card I found online).  None of our arrays are
> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
> (we downgraded back to 5.1.15, which we're on currently, after
> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
> didn't seem to fix those, and the 4.14.13 kernel that was used
> previously is known to cause ext4 corruption of the kind we saw on the
> other filesystems):
>
> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
> group 404160 overlaps superblock
> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>
> I created a dumpe2fs file as well:
>
> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
> dumpe2fs 1.45.4 (23-Sep-2019)
> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
> read '/dev/lvm/home' bitmaps
>
> Available at https://downloads.uls.co.za/85T/dump2fs_home.txt.xz (1.2GB,
> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>
> A strace of e2fsck -y -f /dev/lvm/home at
> https://downloads.uls.co.za/85T/fsck.strace.txt (13MB,
> md5:60aa91b0c47dd2837260218eb774152d)
>
> crowsnest ~ # tune2fs -l /dev/lvm/home
> tune2fs 1.45.4 (23-Sep-2019)
> Filesystem volume name:   <none>
> Last mounted on:          /home
> Filesystem UUID:          522a9faf-7992-4888-93d5-7fe49a9762d6
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr filetype meta_bg extent
> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
> metadata_csum
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              356515840
> Block count:              22817013760
> Reserved block count:     0
> Free blocks:              6874204745
> Free inodes:              202183498
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Group descriptor size:    64
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         512
> Inode blocks per group:   32
> RAID stride:              128
> RAID stripe width:        1024
> First meta block group:   2048
> Flex block group size:    16
> Filesystem created:       Thu Jul 26 12:19:07 2018
> Last mount time:          Sat Jan 18 18:58:50 2020
> Last write time:          Sun Jan 26 11:38:56 2020
> Mount count:              2
> Maximum mount count:      -1
> Last checked:             Wed Oct 30 17:37:27 2019
> Check interval:           0 (<none>)
> Lifetime writes:          976 TB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               256
> Required extra isize:     32
> Desired extra isize:      32
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed:      876a7d14-bce8-4bef-9569-82e7d573b7aa
> Journal backup:           inode blocks
> Checksum type:            crc32c
> Checksum:                 0xfbd895e9
>
> Infrastructure:  3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
> 10TB disks (100TB usable total).  These are combined into a single VG
> using LVM, and then carved up into a number of LVs, the largest of which
> is this 85TB chunk.  We have tried in the past to carve this into
> smaller LVs but failed.  So we're aware that this is very large and not
> ideal.
>
> We did experience an assembly issue on one of  the underlying RAID6 PVs,
> those have been resolved, and the disk that was giving issues has been
> scrubbed and rebuilt.  rom what we can tell based on other file systems,
> this did not affect data integrity but we can't make that statement with
> 100% certainty, as such we are expecting some data loss here but it
> would be better if we can recover at least some of this data.
>
> Other filesystems which also resides on the same PV that was affected by
> the RAID6 problem either received a clean bill of health, or were
> successfully repaired by e2fsck (the system did crash however, it's
> unclear whether the RAID6 assembly problem was the cause or merely
> another consequence, and as a result, whether the corruption on the
> repaired filesystem was a consequence of the kernel or the RAID).
>
> I'm continuing onwards with e2fsck code to try and figure this out, am
> hopeful though that someone could perhaps provide some much needed
> insight and pointers for me.
>
> Kind Regards,
> Jaco
>

2020-01-26 21:46:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

There are backups of all the group descriptors that can be used in such cases, immediately following the backup superblocks.

Failing that, the group descriptors follow a very regular pattern and could be recreated by hand if needed (eg. all the backups were also corrupted for some reason).

Cheers, Andreas

> On Jan 26, 2020, at 13:44, Jaco Kroon <[email protected]> wrote:
>
> Hi,
>
> So working through the dumpe2fs file, the group mentioned by dmesg
> contains this:
>
> Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
> Group descriptor at 13243514880
> Block bitmap at 0 (bg #0 + 0), csum 0x00000000
> Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
> Inode table at 0-31 (bg #0 + 0)
> 0 free blocks, 0 free inodes, 0 directories
> Free blocks: 13243514880-13243547647
> Free inodes: 206929921-206930432
>
> Based on that it's quite simple to see that during the array
> reconstruction we apparently wiped a bunch of data blocks with all
> zeroes. This is obviously bad. During reconstruction we had to zero one
> of the disks before we could get the array to reassemble. What I'm
> wondering is whether this process was a good choice now, and whether the
> right disk was zeroed. Obviously this implies major data loss (at least
> 4TB, probably more assuming that directory structures may well have been
> destroyed as well, maybe less if some of those blocks weren't in use).
>
> I'm hoping that it's possible to recreate these group descriptors (there
> are a few of them) to at least point to the correct locations on disk,
> and to then attempt a cleanup with e2fsck. Again, data loss here is to
> be expected, but if we can limit it at least that would be great.
>
> There are unfortunately a large bunch of groups affected (128 cases of
> 64 consecutive group blocks).
>
> 32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
> 4KB/block => 1TB of data lost. However, this is extremely conservative
> seeing that this could include directory structures with cascading effect.
>
> Based on the patterns of the first 64 group descriptors (GDs) it looks
> like it should be possible to reconstruct the 8192 affected GDs, or
> alternatively possibly "uninit" them
> (https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization).
> I'm inclined to reason that it's probably safer to repair in the GDs the
> following fields:
>
> bg_block_bitmap_{lo,hi}
> bg_inode_bitmap_{lo,hi}
> bg_inode_table_{lo,hi}
>
> I'm not sure about:
>
> bg_flags (I'm guessing the safest is to leave this zeroed).
> bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).
>
> The following should (as far as my understanding goes) then be "fixable"
> by e2fsck:
>
> bg_free_blocks_count_{lo,hi}
> bg_free_inodes_count_{lo,hi}
> bg_used_dirs_count_{lo,hi}
> bg_block_bitmap_csum_{lo,hi}
> bg_inode_bitmap_csum_{lo,hi}
> bg_itable_unused_{lo,hi}
> bg_checksum
>
> And of course, tracking down the GD on disk will be tricky it seems. It
> seems some blocks have the GD in the block, and a bunch of others don't
> (nor does dumpe2fs say where exactly they are). There is 2048 blocks of
> GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
> group 2^17 onwards there are additional groups simply stating "Group
> descriptor at ${frist_block_of_group}", so it's unclear how to track
> down the GD for a given block group.
> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
> does not describe this particularly well either, and there seems to be
> confusion w.r.t. flex_bg and meta_bg features and this.
>
> I do have an LVM snapshot of the affected LV currently, so happy to try
> things.
>
> Kind Regards,
> Jaco
>
>> On 2020/01/26 12:21, Jaco Kroon wrote:
>>
>> Hi,
>>
>> I've got an 85TB ext4 filesystem which I'm unable to fsck. The only
>> cases of same error I could find was from what I can find due to an SD
>> card "swallowing" writes (ie, the card goes into a read-only mode but
>> doesn't report write failure).
>>
>> crowsnest ~ # e2fsck -f /dev/lvm/home
>>
>> e2fsck 1.45.4 (23-Sep-2019)
>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>> e2fsck: Group descriptors look bad... trying backup blocks...
>> /dev/lvm/home: recovering journal
>> e2fsck: unable to set superblock flags on /dev/lvm/home
>>
>>
>> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>>
>> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>>
>> I have also (using dumpe2fs) obtained the location of the backup super
>> blocks and tried same against a few other superblocks using -b. -y (as
>> per suggestion from at least one post) make absolutely no difference,
>> our understanding is that this simply answers yes to all questions, so
>> we didn't expect this to have impact but decided it was worth a try anyway.
>>
>> Looking at the code for the unable to set superblock error it looks like
>> the code is in e2fsck/unix.c, specifically this:
>>
>> 1765 if (ext2fs_has_feature_journal_needs_recovery(sb)) {
>> 1766 if (ctx->options & E2F_OPT_READONLY) {
>> ...
>> 1771 } else {
>> 1772 if (ctx->flags & E2F_FLAG_RESTARTED) {
>> 1773 /*
>> 1774 * Whoops, we attempted to run the
>> 1775 * journal twice. This should never
>> 1776 * happen, unless the hardware or
>> 1777 * device driver is being bogus.
>> 1778 */
>> 1779 com_err(ctx->program_name, 0,
>> 1780 _("unable to set superblock flags "
>> 1781 "on %s\n"), ctx->device_name);
>> 1782 fatal_error(ctx, 0);
>> 1783 }
>>
>> That comment has me somewhat confused. I'm assuming the implication
>> there is that e2fsck tried to update the superblock, but after reading
>> it back, it's either unchanged or still wrong (In line with the
>> description of the SD card I found online). None of our arrays are
>> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
>> (we downgraded back to 5.1.15, which we're on currently, after
>> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
>> didn't seem to fix those, and the 4.14.13 kernel that was used
>> previously is known to cause ext4 corruption of the kind we saw on the
>> other filesystems):
>>
>> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
>> group 404160 overlaps superblock
>> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>>
>> I created a dumpe2fs file as well:
>>
>> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
>> dumpe2fs 1.45.4 (23-Sep-2019)
>> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
>> read '/dev/lvm/home' bitmaps
>>
>> Available at https://downloads.uls.co.za/85T/dump2fs_home.txt.xz (1.2GB,
>> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>>
>> A strace of e2fsck -y -f /dev/lvm/home at
>> https://downloads.uls.co.za/85T/fsck.strace.txt (13MB,
>> md5:60aa91b0c47dd2837260218eb774152d)
>>
>> crowsnest ~ # tune2fs -l /dev/lvm/home
>> tune2fs 1.45.4 (23-Sep-2019)
>> Filesystem volume name: <none>
>> Last mounted on: /home
>> Filesystem UUID: 522a9faf-7992-4888-93d5-7fe49a9762d6
>> Filesystem magic number: 0xEF53
>> Filesystem revision #: 1 (dynamic)
>> Filesystem features: has_journal ext_attr filetype meta_bg extent
>> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
>> metadata_csum
>> Filesystem flags: signed_directory_hash
>> Default mount options: user_xattr acl
>> Filesystem state: clean
>> Errors behavior: Continue
>> Filesystem OS type: Linux
>> Inode count: 356515840
>> Block count: 22817013760
>> Reserved block count: 0
>> Free blocks: 6874204745
>> Free inodes: 202183498
>> First block: 0
>> Block size: 4096
>> Fragment size: 4096
>> Group descriptor size: 64
>> Blocks per group: 32768
>> Fragments per group: 32768
>> Inodes per group: 512
>> Inode blocks per group: 32
>> RAID stride: 128
>> RAID stripe width: 1024
>> First meta block group: 2048
>> Flex block group size: 16
>> Filesystem created: Thu Jul 26 12:19:07 2018
>> Last mount time: Sat Jan 18 18:58:50 2020
>> Last write time: Sun Jan 26 11:38:56 2020
>> Mount count: 2
>> Maximum mount count: -1
>> Last checked: Wed Oct 30 17:37:27 2019
>> Check interval: 0 (<none>)
>> Lifetime writes: 976 TB
>> Reserved blocks uid: 0 (user root)
>> Reserved blocks gid: 0 (group root)
>> First inode: 11
>> Inode size: 256
>> Required extra isize: 32
>> Desired extra isize: 32
>> Journal inode: 8
>> Default directory hash: half_md4
>> Directory Hash Seed: 876a7d14-bce8-4bef-9569-82e7d573b7aa
>> Journal backup: inode blocks
>> Checksum type: crc32c
>> Checksum: 0xfbd895e9
>>
>> Infrastructure: 3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
>> 10TB disks (100TB usable total). These are combined into a single VG
>> using LVM, and then carved up into a number of LVs, the largest of which
>> is this 85TB chunk. We have tried in the past to carve this into
>> smaller LVs but failed. So we're aware that this is very large and not
>> ideal.
>>
>> We did experience an assembly issue on one of the underlying RAID6 PVs,
>> those have been resolved, and the disk that was giving issues has been
>> scrubbed and rebuilt. rom what we can tell based on other file systems,
>> this did not affect data integrity but we can't make that statement with
>> 100% certainty, as such we are expecting some data loss here but it
>> would be better if we can recover at least some of this data.
>>
>> Other filesystems which also resides on the same PV that was affected by
>> the RAID6 problem either received a clean bill of health, or were
>> successfully repaired by e2fsck (the system did crash however, it's
>> unclear whether the RAID6 assembly problem was the cause or merely
>> another consequence, and as a result, whether the corruption on the
>> repaired filesystem was a consequence of the kernel or the RAID).
>>
>> I'm continuing onwards with e2fsck code to try and figure this out, am
>> hopeful though that someone could perhaps provide some much needed
>> insight and pointers for me.
>>
>> Kind Regards,
>> Jaco
>>

2020-01-27 23:16:18

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

On Jan 27, 2020, at 10:52 AM, Jaco Kroon <[email protected]> wrote:
>
> Hi Andreas,
>
> Ok, so there were a few calc errors below. I believe I've rectified those. I've written code to both locate and read the GD, as well as generate the fields I plan to update, comparing the fields I plan to update against those of all known-good blocks. The only issue I've picked up is that I've realised I don't understand bg_flags at all. Extract from output of my program (it uses the read GD as template and then just overwrites the fields I'm calculating).
> Group 65542, bg_flags differs: 0 (actual) != 4 (generated)

The EXT2_BG_{INODE,BLOCK}_UNINIT flags mean that the inode/block bitmaps are
all zero and do not need to be read from disk. That should only be used for
groups that have no blocks or inodes allocated to them.

The EXT2_BG_INODE_ZEROED flag means that the inode table has been zeroed out
on disk after the filesystem has been formatted. It is safest to set this
flag on the group, so that the kernel does not try to zero out the inode tables
again, especially if your bitmaps are not correct.

> Group 65542 (actual):
> Block Bitmap: 2147483654 (0x80000006)
> Inode Bitmap: 2147483670 (0x80000016)
> Inode Table: 2147483872 (0x800000e0)
> Free Blocks: 0 (0x0)
> Free Inodes: 315 (0x13b)
> Dirs Count: 197 (0xc5)
> Flags: 0 (0x0)
> Exclusion Bmap: 0 (0x0)
> Block Bitmap Csum: 629120042 (0x257f9c2a)
> Inode Bitmap Csum: 330504133 (0x13b317c5)
> itable unused: 0 (0x0)
> checksum: 17392 (0x43f0)
> Group 65542 (calculated):
> Block Bitmap: 2147483654 (0x80000006)
> Inode Bitmap: 2147483670 (0x80000016)
> Inode Table: 2147483872 (0x800000e0)
> Free Blocks: 0 (0x0)
> Free Inodes: 315 (0x13b)
> Dirs Count: 197 (0xc5)
> Flags: 4 (0x4)
> Exclusion Bmap: 0 (0x0)
> Block Bitmap Csum: 629120042 (0x257f9c2a)
> Inode Bitmap Csum: 330504133 (0x13b317c5)
> itable unused: 0 (0x0)
> checksum: 17392 (0x43f0)
> Basically, I'm not sure how I should go about "setting" (or unsetting) the bg_flags values. Advise would be greatly appreciated. My best-effort guess after this is to set BOTH the UNINIT flags (to force recalc of the bitmaps),

e2fsck will already regenerate the bitmaps on each full run. Setting the
UNINIT flags will confuse the kernel and should be avoided.

> and to possibly inspect the data at inode_table looking for some kind of "magic" as to whether or not it's already initialized or not...

The group descriptors also have a checksum that needs to be valid.

> My code is available at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c for anyone that cares. Currently some work pending.

It probably makes more sense to include this into e2fsck, so that it will be
usable for everyone. IMHO, there is a design flaw in the meta_bg disk format,
in that the last group(s) of the filesystem may have few/no backup copies, so
cannot easily be recovered in case of an error. There is also the secondary
issue that meta_bg causes the metadata to be spread out over the whole disk,
which causes a *LOT* of seeking on a very large filesystem, on the order of
millions of seeks, which takes a lot of time to read.

> On 2020/01/27 12:24, Jaco Kroon wrote:
>> Hi Andreas,
>>
>> Thank you. The filesystem uses meta_bg and is larger than 16TB, my issue also doesn't exhibit in the first 16TB covered by those blocks.
>> On 2020/01/26 23:45, Andreas Dilger wrote:
>>> There are backups of all the group descriptors that can be used in such cases, immediately following the backup superblocks.
>>>
>>> Failing that, the group descriptors follow a very regular pattern and could be recreated by hand if needed (eg. all the backups were also corrupted for some reason).
>>>
>> You are right. It's however tricky to wrap your head around it. I think I've got it, but if you don't mind double checking me please:
>>
>> Given a group number g. sb = superblock.
>> gds_per_block = 2^(10+sb.s_log_block_size) / sb.s_desc_size = 4096 / 64 = 64
>> gd_block = floor(g / gds_per_block)
>> if (gd_block < ${sb.s_reserved_gdt_blocks})
>> phys_gd_block = gd_block + 1;
>> else
>> phys_gd_block = floor(gd_block / gds_per_block) * gds_per_block * sb.s_blocks_per_group
>>
>> phys_gd_block_offset = sb.s_desc_size * (g % gds_per_block)
>>
>> Backup blocks, are either with every superblock backup (groups 0, and groups being a power of 3, 5 and 7, ie, 0, 1, 3, 5, 7, 9, 25, 27, ...) where gd_block < ${sb.s_reserved_gdt_blocks); or
>> phys_gd_backup1 = phys_gd_block + sb.s_blocks_per_group
>> phys_gd_backup2 = phys_gd_block + sb.s_blocks_per_group * (gds_per_block - 1)
>>
>> offset stays the same.
>>
>> To reconstruct it's critical to fill the following fields, with the required calculations (gd = group descriptor, calculations for groups < 2^17, using 131072 as example):
>>
>> bitmap_group = floor(g / flex_groups) * flex_groups
>> => floor(131072 / 16) * 16 => 131072
>> gd.bg_block_bitmap = bitmap_group * blocks_per_group + g % flex_groups
>> => 131072 * 3768 + 131072 % 16
>> => 493879296
>> if (bitmap_group % gds_per_block == 0) /* if bitmap_group also houses a meta group block */
>> gd.bg_block_bitmap++;
>> => if (131072 % 64 == 0)
>> => if (0 == 0)
>> => gd.bg_block_bitmap = 493879297
>>
>> gd.bg_inode_bitmap = gd.bg_block_bitmap + flex_groups
>> => gd.bg_inode_bitmap = 493879297 + 16 = 493879313
>> gd.bg_inode_table = gd.bg_inode_bitmap + flex_groups - (g % flex_groups) + (g % flex_groups * inode_blocks_per_group)
>> => 493879313 + 16 - 0 + 0 * 32 = 493879329
>> Bad example, for g=131074:
>> =>493879315 + 16 - 2 + 2 * 32 = 493879393
>>
>> gd.bg_flags = 0x4
>>
>> I suspect it's OK to just zero (or leave them zero) these:
>>
>> bg_free_blocks_count
>> bg_free_inodes_count
>> bg_used_dirs_count
>> bg_exclude_bitmap
>> bg_itable_unused
>>
>> As well as all checksum fields (hopefully e2fsck will correct those).
>> I did action this calculation for a few non-destructed GDs and my manual calculations seems OK for the groups I checked (all into meta_bg area):
>>
>> 131072 (multiple of 64, meta_bg + fleg_bg)
>> 131073 (multiple of 64 + 1 - first meta_bg backup)
>> 131074 (none of the others, ie, plain group with data blocks only)
>> 131088 (multiple of 16 but not 64, flex_bg)
>> 131089 (multiple of 16 but not 64, +1)
>> 131135 (multiple of 64 + 63 - second meta_bg backup)
>> It is however a very limited sample, but should cover all the corner cases I could think of based on the specification and my understanding thereof.
>>
>> Given the above I should be able to write a small program that will produce the 128 4KiB blocks that's required, and then I can use dd to place them into the correct locations.
>> As an aside, debugfs refuses to open the filesystem:
>>
>> crowsnest ~ # debugfs /dev/lvm/home
>> debugfs 1.45.4 (23-Sep-2019)
>> /dev/lvm/home: Block bitmap checksum does not match bitmap while reading allocation bitmaps
>> debugfs: chroot /
>> chroot: Filesystem not open
>> debugfs: quit
>> Which is probably fair. So for this one I'll have to go make modifications using either some programatic tool that opens the underlying block device, or use dd trickery (ie, construct a GD and dd it into the right location, a a sequence of 64 GDs as it's always 64 GDs that's destroyed in my case, exactly 1 4KiB block).
>>
>> Kind Regards,
>> Jaco
>>> Cheers, Andreas
>>>
>>>
>>>> On Jan 26, 2020, at 13:44, Jaco Kroon <[email protected]>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> So working through the dumpe2fs file, the group mentioned by dmesg
>>>> contains this:
>>>>
>>>> Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
>>>> Group descriptor at 13243514880
>>>> Block bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>> Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>> Inode table at 0-31 (bg #0 + 0)
>>>> 0 free blocks, 0 free inodes, 0 directories
>>>> Free blocks: 13243514880-13243547647
>>>> Free inodes: 206929921-206930432
>>>>
>>>> Based on that it's quite simple to see that during the array
>>>> reconstruction we apparently wiped a bunch of data blocks with all
>>>> zeroes. This is obviously bad. During reconstruction we had to zero one
>>>> of the disks before we could get the array to reassemble. What I'm
>>>> wondering is whether this process was a good choice now, and whether the
>>>> right disk was zeroed. Obviously this implies major data loss (at least
>>>> 4TB, probably more assuming that directory structures may well have been
>>>> destroyed as well, maybe less if some of those blocks weren't in use).
>>>>
>>>> I'm hoping that it's possible to recreate these group descriptors (there
>>>> are a few of them) to at least point to the correct locations on disk,
>>>> and to then attempt a cleanup with e2fsck. Again, data loss here is to
>>>> be expected, but if we can limit it at least that would be great.
>>>>
>>>> There are unfortunately a large bunch of groups affected (128 cases of
>>>> 64 consecutive group blocks).
>>>>
>>>> 32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
>>>> 4KB/block => 1TB of data lost. However, this is extremely conservative
>>>> seeing that this could include directory structures with cascading effect.
>>>>
>>>> Based on the patterns of the first 64 group descriptors (GDs) it looks
>>>> like it should be possible to reconstruct the 8192 affected GDs, or
>>>> alternatively possibly "uninit" them
>>>> (
>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization
>>>> ).
>>>> I'm inclined to reason that it's probably safer to repair in the GDs the
>>>> following fields:
>>>>
>>>> bg_block_bitmap_{lo,hi}
>>>> bg_inode_bitmap_{lo,hi}
>>>> bg_inode_table_{lo,hi}
>>>>
>>>> I'm not sure about:
>>>>
>>>> bg_flags (I'm guessing the safest is to leave this zeroed).
>>>> bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).
>>>>
>>>> The following should (as far as my understanding goes) then be "fixable"
>>>> by e2fsck:
>>>>
>>>> bg_free_blocks_count_{lo,hi}
>>>> bg_free_inodes_count_{lo,hi}
>>>> bg_used_dirs_count_{lo,hi}
>>>> bg_block_bitmap_csum_{lo,hi}
>>>> bg_inode_bitmap_csum_{lo,hi}
>>>> bg_itable_unused_{lo,hi}
>>>> bg_checksum
>>>>
>>>> And of course, tracking down the GD on disk will be tricky it seems. It
>>>> seems some blocks have the GD in the block, and a bunch of others don't
>>>> (nor does dumpe2fs say where exactly they are). There is 2048 blocks of
>>>> GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
>>>> group 2^17 onwards there are additional groups simply stating "Group
>>>> descriptor at ${frist_block_of_group}", so it's unclear how to track
>>>> down the GD for a given block group.
>>>>
>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
>>>>
>>>> does not describe this particularly well either, and there seems to be
>>>> confusion w.r.t. flex_bg and meta_bg features and this.
>>>>
>>>> I do have an LVM snapshot of the affected LV currently, so happy to try
>>>> things.
>>>>
>>>> Kind Regards,
>>>> Jaco
>>>>
>>>>
>>>>> On 2020/01/26 12:21, Jaco Kroon wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I've got an 85TB ext4 filesystem which I'm unable to fsck. The only
>>>>> cases of same error I could find was from what I can find due to an SD
>>>>> card "swallowing" writes (ie, the card goes into a read-only mode but
>>>>> doesn't report write failure).
>>>>>
>>>>> crowsnest ~ # e2fsck -f /dev/lvm/home
>>>>>
>>>>> e2fsck 1.45.4 (23-Sep-2019)
>>>>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>>>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>>>> /dev/lvm/home: recovering journal
>>>>> e2fsck: unable to set superblock flags on /dev/lvm/home
>>>>>
>>>>>
>>>>> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>>>>>
>>>>> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>>>>>
>>>>> I have also (using dumpe2fs) obtained the location of the backup super
>>>>> blocks and tried same against a few other superblocks using -b. -y (as
>>>>> per suggestion from at least one post) make absolutely no difference,
>>>>> our understanding is that this simply answers yes to all questions, so
>>>>> we didn't expect this to have impact but decided it was worth a try anyway.
>>>>>
>>>>> Looking at the code for the unable to set superblock error it looks like
>>>>> the code is in e2fsck/unix.c, specifically this:
>>>>>
>>>>> 1765 if (ext2fs_has_feature_journal_needs_recovery(sb)) {
>>>>> 1766 if (ctx->options & E2F_OPT_READONLY) {
>>>>> ...
>>>>> 1771 } else {
>>>>> 1772 if (ctx->flags & E2F_FLAG_RESTARTED) {
>>>>> 1773 /*
>>>>> 1774 * Whoops, we attempted to run the
>>>>> 1775 * journal twice. This should never
>>>>> 1776 * happen, unless the hardware or
>>>>> 1777 * device driver is being bogus.
>>>>> 1778 */
>>>>> 1779 com_err(ctx->program_name, 0,
>>>>> 1780 _("unable to set superblock flags "
>>>>> 1781 "on %s\n"), ctx->device_name);
>>>>> 1782 fatal_error(ctx, 0);
>>>>> 1783 }
>>>>>
>>>>> That comment has me somewhat confused. I'm assuming the implication
>>>>> there is that e2fsck tried to update the superblock, but after reading
>>>>> it back, it's either unchanged or still wrong (In line with the
>>>>> description of the SD card I found online). None of our arrays are
>>>>> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
>>>>> (we downgraded back to 5.1.15, which we're on currently, after
>>>>> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
>>>>> didn't seem to fix those, and the 4.14.13 kernel that was used
>>>>> previously is known to cause ext4 corruption of the kind we saw on the
>>>>> other filesystems):
>>>>>
>>>>> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
>>>>> group 404160 overlaps superblock
>>>>> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>>>>>
>>>>> I created a dumpe2fs file as well:
>>>>>
>>>>> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
>>>>> dumpe2fs 1.45.4 (23-Sep-2019)
>>>>> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
>>>>> read '/dev/lvm/home' bitmaps
>>>>>
>>>>> Available at
>>>>> https://downloads.uls.co.za/85T/dump2fs_home.txt.xz
>>>>> (1.2GB,
>>>>> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>>>>>
>>>>> A strace of e2fsck -y -f /dev/lvm/home at
>>>>>
>>>>> https://downloads.uls.co.za/85T/fsck.strace.txt
>>>>> (13MB,
>>>>> md5:60aa91b0c47dd2837260218eb774152d)
>>>>>
>>>>> crowsnest ~ # tune2fs -l /dev/lvm/home
>>>>> tune2fs 1.45.4 (23-Sep-2019)
>>>>> Filesystem volume name: <none>
>>>>> Last mounted on: /home
>>>>> Filesystem UUID: 522a9faf-7992-4888-93d5-7fe49a9762d6
>>>>> Filesystem magic number: 0xEF53
>>>>> Filesystem revision #: 1 (dynamic)
>>>>> Filesystem features: has_journal ext_attr filetype meta_bg extent
>>>>> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
>>>>> metadata_csum
>>>>> Filesystem flags: signed_directory_hash
>>>>> Default mount options: user_xattr acl
>>>>> Filesystem state: clean
>>>>> Errors behavior: Continue
>>>>> Filesystem OS type: Linux
>>>>> Inode count: 356515840
>>>>> Block count: 22817013760
>>>>> Reserved block count: 0
>>>>> Free blocks: 6874204745
>>>>> Free inodes: 202183498
>>>>> First block: 0
>>>>> Block size: 4096
>>>>> Fragment size: 4096
>>>>> Group descriptor size: 64
>>>>> Blocks per group: 32768
>>>>> Fragments per group: 32768
>>>>> Inodes per group: 512
>>>>> Inode blocks per group: 32
>>>>> RAID stride: 128
>>>>> RAID stripe width: 1024
>>>>> First meta block group: 2048
>>>>> Flex block group size: 16
>>>>> Filesystem created: Thu Jul 26 12:19:07 2018
>>>>> Last mount time: Sat Jan 18 18:58:50 2020
>>>>> Last write time: Sun Jan 26 11:38:56 2020
>>>>> Mount count: 2
>>>>> Maximum mount count: -1
>>>>> Last checked: Wed Oct 30 17:37:27 2019
>>>>> Check interval: 0 (<none>)
>>>>> Lifetime writes: 976 TB
>>>>> Reserved blocks uid: 0 (user root)
>>>>> Reserved blocks gid: 0 (group root)
>>>>> First inode: 11
>>>>> Inode size: 256
>>>>> Required extra isize: 32
>>>>> Desired extra isize: 32
>>>>> Journal inode: 8
>>>>> Default directory hash: half_md4
>>>>> Directory Hash Seed: 876a7d14-bce8-4bef-9569-82e7d573b7aa
>>>>> Journal backup: inode blocks
>>>>> Checksum type: crc32c
>>>>> Checksum: 0xfbd895e9
>>>>>
>>>>> Infrastructure: 3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
>>>>> 10TB disks (100TB usable total). These are combined into a single VG
>>>>> using LVM, and then carved up into a number of LVs, the largest of which
>>>>> is this 85TB chunk. We have tried in the past to carve this into
>>>>> smaller LVs but failed. So we're aware that this is very large and not
>>>>> ideal.
>>>>>
>>>>> We did experience an assembly issue on one of the underlying RAID6 PVs,
>>>>> those have been resolved, and the disk that was giving issues has been
>>>>> scrubbed and rebuilt. rom what we can tell based on other file systems,
>>>>> this did not affect data integrity but we can't make that statement with
>>>>> 100% certainty, as such we are expecting some data loss here but it
>>>>> would be better if we can recover at least some of this data.
>>>>>
>>>>> Other filesystems which also resides on the same PV that was affected by
>>>>> the RAID6 problem either received a clean bill of health, or were
>>>>> successfully repaired by e2fsck (the system did crash however, it's
>>>>> unclear whether the RAID6 assembly problem was the cause or merely
>>>>> another consequence, and as a result, whether the corruption on the
>>>>> repaired filesystem was a consequence of the kernel or the RAID).
>>>>>
>>>>> I'm continuing onwards with e2fsck code to try and figure this out, am
>>>>> hopeful though that someone could perhaps provide some much needed
>>>>> insight and pointers for me.
>>>>>
>>>>> Kind Regards,
>>>>> Jaco
>>>>>
>>>>>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-01-28 12:53:59

by Jaco Kroon

[permalink] [raw]
Subject: Re: Re: e2fsck fails with unable to set superblock

Hi Andreas,

On 2020/01/28 01:15, Andreas Dilger wrote:
> On Jan 27, 2020, at 10:52 AM, Jaco Kroon <[email protected]> wrote:
>> Hi Andreas,
>>
>> Ok, so there were a few calc errors below. I believe I've rectified those. I've written code to both locate and read the GD, as well as generate the fields I plan to update, comparing the fields I plan to update against those of all known-good blocks. The only issue I've picked up is that I've realised I don't understand bg_flags at all. Extract from output of my program (it uses the read GD as template and then just overwrites the fields I'm calculating).
>> Group 65542, bg_flags differs: 0 (actual) != 4 (generated)
> The EXT2_BG_{INODE,BLOCK}_UNINIT flags mean that the inode/block bitmaps are
> all zero and do not need to be read from disk. That should only be used for
> groups that have no blocks or inodes allocated to them.
>
> The EXT2_BG_INODE_ZEROED flag means that the inode table has been zeroed out
> on disk after the filesystem has been formatted. It is safest to set this
> flag on the group, so that the kernel does not try to zero out the inode tables
> again, especially if your bitmaps are not correct.
That makes sense.  Good explanation thank you.

>> and to possibly inspect the data at inode_table looking for some kind of "magic" as to whether or not it's already initialized or not...
> The group descriptors also have a checksum that needs to be valid.
I'm hoping fsck will fix this after I've actioned mine?
>
>> My code is available at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c for anyone that cares. Currently some work pending.
> It probably makes more sense to include this into e2fsck, so that it will be
> usable for everyone.
I agree.  For now I needed quick and nasty.  I have no objection to do a
patch at some point on e2fsck, or someone else can use my throw-away
code and refine it.  For now I'm taking stdout from that and using dd to
clobber the meta_bg GDT blocks, as well as replace the backups.  One
reason for not doing that from the start was simply that I had no idea
where exactly this would need to go into e2fsck.
> IMHO, there is a design flaw in the meta_bg disk format,
> in that the last group(s) of the filesystem may have few/no backup copies, so
> cannot easily be recovered in case of an error.

1 copy if there is only one group, 2 of fewer than meta_bg groups ... 3
otherwise.  Frankly I'm OK(ish) with three backups, one of which is
typically ~128MB away from the first two.  Perhaps not quite good
enough.  I get where you're coming from.  This is the first time that
I've personally experienced GD corruption failures, or at least, that I
*know* what I'm looking at.  Having said that, in >15 years this is only
the third ext* filesystem that I've managed to corrupt to the point
where fsck would not recover it.  The previous one was a few years back,
and that was due to a major hardware issue.  The time before that was
sheer incompetence and stupidity.

> There is also the secondary
> issue that meta_bg causes the metadata to be spread out over the whole disk,
> which causes a *LOT* of seeking on a very large filesystem, on the order of
> millions of seeks, which takes a lot of time to read.

I can't comment to that.  I don't know how often these things are
actually read from disk, whether it's once-off during mount time, or
whether there are further reads of these blocks after that.  If it's
once off I don't particularly care personally, if an 85TB file system
takes a minute to mount ... so be it.  If it's continuous ... yes that
can become problematic I reckon.

Thank you very much for the information, you've managed to (even though
not tell me how to fix my issue) at the very least confirm some things
for me.  Aware this is goodwill, and that's more than I could ask for in
all reality.

Kind Regards,
Jaco

>
>> On 2020/01/27 12:24, Jaco Kroon wrote:
>>> Hi Andreas,
>>>
>>> Thank you. The filesystem uses meta_bg and is larger than 16TB, my issue also doesn't exhibit in the first 16TB covered by those blocks.
>>> On 2020/01/26 23:45, Andreas Dilger wrote:
>>>> There are backups of all the group descriptors that can be used in such cases, immediately following the backup superblocks.
>>>>
>>>> Failing that, the group descriptors follow a very regular pattern and could be recreated by hand if needed (eg. all the backups were also corrupted for some reason).
>>>>
>>> You are right. It's however tricky to wrap your head around it. I think I've got it, but if you don't mind double checking me please:
>>>
>>> Given a group number g. sb = superblock.
>>> gds_per_block = 2^(10+sb.s_log_block_size) / sb.s_desc_size = 4096 / 64 = 64
>>> gd_block = floor(g / gds_per_block)
>>> if (gd_block < ${sb.s_reserved_gdt_blocks})
>>> phys_gd_block = gd_block + 1;
>>> else
>>> phys_gd_block = floor(gd_block / gds_per_block) * gds_per_block * sb.s_blocks_per_group
>>>
>>> phys_gd_block_offset = sb.s_desc_size * (g % gds_per_block)
>>>
>>> Backup blocks, are either with every superblock backup (groups 0, and groups being a power of 3, 5 and 7, ie, 0, 1, 3, 5, 7, 9, 25, 27, ...) where gd_block < ${sb.s_reserved_gdt_blocks); or
>>> phys_gd_backup1 = phys_gd_block + sb.s_blocks_per_group
>>> phys_gd_backup2 = phys_gd_block + sb.s_blocks_per_group * (gds_per_block - 1)
>>>
>>> offset stays the same.
>>>
>>> To reconstruct it's critical to fill the following fields, with the required calculations (gd = group descriptor, calculations for groups < 2^17, using 131072 as example):
>>>
>>> bitmap_group = floor(g / flex_groups) * flex_groups
>>> => floor(131072 / 16) * 16 => 131072
>>> gd.bg_block_bitmap = bitmap_group * blocks_per_group + g % flex_groups
>>> => 131072 * 3768 + 131072 % 16
>>> => 493879296
>>> if (bitmap_group % gds_per_block == 0) /* if bitmap_group also houses a meta group block */
>>> gd.bg_block_bitmap++;
>>> => if (131072 % 64 == 0)
>>> => if (0 == 0)
>>> => gd.bg_block_bitmap = 493879297
>>>
>>> gd.bg_inode_bitmap = gd.bg_block_bitmap + flex_groups
>>> => gd.bg_inode_bitmap = 493879297 + 16 = 493879313
>>> gd.bg_inode_table = gd.bg_inode_bitmap + flex_groups - (g % flex_groups) + (g % flex_groups * inode_blocks_per_group)
>>> => 493879313 + 16 - 0 + 0 * 32 = 493879329
>>> Bad example, for g=131074:
>>> =>493879315 + 16 - 2 + 2 * 32 = 493879393
>>>
>>> gd.bg_flags = 0x4
>>>
>>> I suspect it's OK to just zero (or leave them zero) these:
>>>
>>> bg_free_blocks_count
>>> bg_free_inodes_count
>>> bg_used_dirs_count
>>> bg_exclude_bitmap
>>> bg_itable_unused
>>>
>>> As well as all checksum fields (hopefully e2fsck will correct those).
>>> I did action this calculation for a few non-destructed GDs and my manual calculations seems OK for the groups I checked (all into meta_bg area):
>>>
>>> 131072 (multiple of 64, meta_bg + fleg_bg)
>>> 131073 (multiple of 64 + 1 - first meta_bg backup)
>>> 131074 (none of the others, ie, plain group with data blocks only)
>>> 131088 (multiple of 16 but not 64, flex_bg)
>>> 131089 (multiple of 16 but not 64, +1)
>>> 131135 (multiple of 64 + 63 - second meta_bg backup)
>>> It is however a very limited sample, but should cover all the corner cases I could think of based on the specification and my understanding thereof.
>>>
>>> Given the above I should be able to write a small program that will produce the 128 4KiB blocks that's required, and then I can use dd to place them into the correct locations.
>>> As an aside, debugfs refuses to open the filesystem:
>>>
>>> crowsnest ~ # debugfs /dev/lvm/home
>>> debugfs 1.45.4 (23-Sep-2019)
>>> /dev/lvm/home: Block bitmap checksum does not match bitmap while reading allocation bitmaps
>>> debugfs: chroot /
>>> chroot: Filesystem not open
>>> debugfs: quit
>>> Which is probably fair. So for this one I'll have to go make modifications using either some programatic tool that opens the underlying block device, or use dd trickery (ie, construct a GD and dd it into the right location, a a sequence of 64 GDs as it's always 64 GDs that's destroyed in my case, exactly 1 4KiB block).
>>>
>>> Kind Regards,
>>> Jaco
>>>> Cheers, Andreas
>>>>
>>>>
>>>>> On Jan 26, 2020, at 13:44, Jaco Kroon <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> So working through the dumpe2fs file, the group mentioned by dmesg
>>>>> contains this:
>>>>>
>>>>> Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
>>>>> Group descriptor at 13243514880
>>>>> Block bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>> Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>> Inode table at 0-31 (bg #0 + 0)
>>>>> 0 free blocks, 0 free inodes, 0 directories
>>>>> Free blocks: 13243514880-13243547647
>>>>> Free inodes: 206929921-206930432
>>>>>
>>>>> Based on that it's quite simple to see that during the array
>>>>> reconstruction we apparently wiped a bunch of data blocks with all
>>>>> zeroes. This is obviously bad. During reconstruction we had to zero one
>>>>> of the disks before we could get the array to reassemble. What I'm
>>>>> wondering is whether this process was a good choice now, and whether the
>>>>> right disk was zeroed. Obviously this implies major data loss (at least
>>>>> 4TB, probably more assuming that directory structures may well have been
>>>>> destroyed as well, maybe less if some of those blocks weren't in use).
>>>>>
>>>>> I'm hoping that it's possible to recreate these group descriptors (there
>>>>> are a few of them) to at least point to the correct locations on disk,
>>>>> and to then attempt a cleanup with e2fsck. Again, data loss here is to
>>>>> be expected, but if we can limit it at least that would be great.
>>>>>
>>>>> There are unfortunately a large bunch of groups affected (128 cases of
>>>>> 64 consecutive group blocks).
>>>>>
>>>>> 32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
>>>>> 4KB/block => 1TB of data lost. However, this is extremely conservative
>>>>> seeing that this could include directory structures with cascading effect.
>>>>>
>>>>> Based on the patterns of the first 64 group descriptors (GDs) it looks
>>>>> like it should be possible to reconstruct the 8192 affected GDs, or
>>>>> alternatively possibly "uninit" them
>>>>> (
>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization
>>>>> ).
>>>>> I'm inclined to reason that it's probably safer to repair in the GDs the
>>>>> following fields:
>>>>>
>>>>> bg_block_bitmap_{lo,hi}
>>>>> bg_inode_bitmap_{lo,hi}
>>>>> bg_inode_table_{lo,hi}
>>>>>
>>>>> I'm not sure about:
>>>>>
>>>>> bg_flags (I'm guessing the safest is to leave this zeroed).
>>>>> bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).
>>>>>
>>>>> The following should (as far as my understanding goes) then be "fixable"
>>>>> by e2fsck:
>>>>>
>>>>> bg_free_blocks_count_{lo,hi}
>>>>> bg_free_inodes_count_{lo,hi}
>>>>> bg_used_dirs_count_{lo,hi}
>>>>> bg_block_bitmap_csum_{lo,hi}
>>>>> bg_inode_bitmap_csum_{lo,hi}
>>>>> bg_itable_unused_{lo,hi}
>>>>> bg_checksum
>>>>>
>>>>> And of course, tracking down the GD on disk will be tricky it seems. It
>>>>> seems some blocks have the GD in the block, and a bunch of others don't
>>>>> (nor does dumpe2fs say where exactly they are). There is 2048 blocks of
>>>>> GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
>>>>> group 2^17 onwards there are additional groups simply stating "Group
>>>>> descriptor at ${frist_block_of_group}", so it's unclear how to track
>>>>> down the GD for a given block group.
>>>>>
>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
>>>>>
>>>>> does not describe this particularly well either, and there seems to be
>>>>> confusion w.r.t. flex_bg and meta_bg features and this.
>>>>>
>>>>> I do have an LVM snapshot of the affected LV currently, so happy to try
>>>>> things.
>>>>>
>>>>> Kind Regards,
>>>>> Jaco
>>>>>
>>>>>
>>>>>> On 2020/01/26 12:21, Jaco Kroon wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've got an 85TB ext4 filesystem which I'm unable to fsck. The only
>>>>>> cases of same error I could find was from what I can find due to an SD
>>>>>> card "swallowing" writes (ie, the card goes into a read-only mode but
>>>>>> doesn't report write failure).
>>>>>>
>>>>>> crowsnest ~ # e2fsck -f /dev/lvm/home
>>>>>>
>>>>>> e2fsck 1.45.4 (23-Sep-2019)
>>>>>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>>>>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>>>>> /dev/lvm/home: recovering journal
>>>>>> e2fsck: unable to set superblock flags on /dev/lvm/home
>>>>>>
>>>>>>
>>>>>> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>>>>>>
>>>>>> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>>>>>>
>>>>>> I have also (using dumpe2fs) obtained the location of the backup super
>>>>>> blocks and tried same against a few other superblocks using -b. -y (as
>>>>>> per suggestion from at least one post) make absolutely no difference,
>>>>>> our understanding is that this simply answers yes to all questions, so
>>>>>> we didn't expect this to have impact but decided it was worth a try anyway.
>>>>>>
>>>>>> Looking at the code for the unable to set superblock error it looks like
>>>>>> the code is in e2fsck/unix.c, specifically this:
>>>>>>
>>>>>> 1765 if (ext2fs_has_feature_journal_needs_recovery(sb)) {
>>>>>> 1766 if (ctx->options & E2F_OPT_READONLY) {
>>>>>> ...
>>>>>> 1771 } else {
>>>>>> 1772 if (ctx->flags & E2F_FLAG_RESTARTED) {
>>>>>> 1773 /*
>>>>>> 1774 * Whoops, we attempted to run the
>>>>>> 1775 * journal twice. This should never
>>>>>> 1776 * happen, unless the hardware or
>>>>>> 1777 * device driver is being bogus.
>>>>>> 1778 */
>>>>>> 1779 com_err(ctx->program_name, 0,
>>>>>> 1780 _("unable to set superblock flags "
>>>>>> 1781 "on %s\n"), ctx->device_name);
>>>>>> 1782 fatal_error(ctx, 0);
>>>>>> 1783 }
>>>>>>
>>>>>> That comment has me somewhat confused. I'm assuming the implication
>>>>>> there is that e2fsck tried to update the superblock, but after reading
>>>>>> it back, it's either unchanged or still wrong (In line with the
>>>>>> description of the SD card I found online). None of our arrays are
>>>>>> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
>>>>>> (we downgraded back to 5.1.15, which we're on currently, after
>>>>>> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
>>>>>> didn't seem to fix those, and the 4.14.13 kernel that was used
>>>>>> previously is known to cause ext4 corruption of the kind we saw on the
>>>>>> other filesystems):
>>>>>>
>>>>>> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
>>>>>> group 404160 overlaps superblock
>>>>>> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>>>>>>
>>>>>> I created a dumpe2fs file as well:
>>>>>>
>>>>>> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
>>>>>> dumpe2fs 1.45.4 (23-Sep-2019)
>>>>>> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
>>>>>> read '/dev/lvm/home' bitmaps
>>>>>>
>>>>>> Available at
>>>>>> https://downloads.uls.co.za/85T/dump2fs_home.txt.xz
>>>>>> (1.2GB,
>>>>>> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>>>>>>
>>>>>> A strace of e2fsck -y -f /dev/lvm/home at
>>>>>>
>>>>>> https://downloads.uls.co.za/85T/fsck.strace.txt
>>>>>> (13MB,
>>>>>> md5:60aa91b0c47dd2837260218eb774152d)
>>>>>>
>>>>>> crowsnest ~ # tune2fs -l /dev/lvm/home
>>>>>> tune2fs 1.45.4 (23-Sep-2019)
>>>>>> Filesystem volume name: <none>
>>>>>> Last mounted on: /home
>>>>>> Filesystem UUID: 522a9faf-7992-4888-93d5-7fe49a9762d6
>>>>>> Filesystem magic number: 0xEF53
>>>>>> Filesystem revision #: 1 (dynamic)
>>>>>> Filesystem features: has_journal ext_attr filetype meta_bg extent
>>>>>> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
>>>>>> metadata_csum
>>>>>> Filesystem flags: signed_directory_hash
>>>>>> Default mount options: user_xattr acl
>>>>>> Filesystem state: clean
>>>>>> Errors behavior: Continue
>>>>>> Filesystem OS type: Linux
>>>>>> Inode count: 356515840
>>>>>> Block count: 22817013760
>>>>>> Reserved block count: 0
>>>>>> Free blocks: 6874204745
>>>>>> Free inodes: 202183498
>>>>>> First block: 0
>>>>>> Block size: 4096
>>>>>> Fragment size: 4096
>>>>>> Group descriptor size: 64
>>>>>> Blocks per group: 32768
>>>>>> Fragments per group: 32768
>>>>>> Inodes per group: 512
>>>>>> Inode blocks per group: 32
>>>>>> RAID stride: 128
>>>>>> RAID stripe width: 1024
>>>>>> First meta block group: 2048
>>>>>> Flex block group size: 16
>>>>>> Filesystem created: Thu Jul 26 12:19:07 2018
>>>>>> Last mount time: Sat Jan 18 18:58:50 2020
>>>>>> Last write time: Sun Jan 26 11:38:56 2020
>>>>>> Mount count: 2
>>>>>> Maximum mount count: -1
>>>>>> Last checked: Wed Oct 30 17:37:27 2019
>>>>>> Check interval: 0 (<none>)
>>>>>> Lifetime writes: 976 TB
>>>>>> Reserved blocks uid: 0 (user root)
>>>>>> Reserved blocks gid: 0 (group root)
>>>>>> First inode: 11
>>>>>> Inode size: 256
>>>>>> Required extra isize: 32
>>>>>> Desired extra isize: 32
>>>>>> Journal inode: 8
>>>>>> Default directory hash: half_md4
>>>>>> Directory Hash Seed: 876a7d14-bce8-4bef-9569-82e7d573b7aa
>>>>>> Journal backup: inode blocks
>>>>>> Checksum type: crc32c
>>>>>> Checksum: 0xfbd895e9
>>>>>>
>>>>>> Infrastructure: 3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
>>>>>> 10TB disks (100TB usable total). These are combined into a single VG
>>>>>> using LVM, and then carved up into a number of LVs, the largest of which
>>>>>> is this 85TB chunk. We have tried in the past to carve this into
>>>>>> smaller LVs but failed. So we're aware that this is very large and not
>>>>>> ideal.
>>>>>>
>>>>>> We did experience an assembly issue on one of the underlying RAID6 PVs,
>>>>>> those have been resolved, and the disk that was giving issues has been
>>>>>> scrubbed and rebuilt. rom what we can tell based on other file systems,
>>>>>> this did not affect data integrity but we can't make that statement with
>>>>>> 100% certainty, as such we are expecting some data loss here but it
>>>>>> would be better if we can recover at least some of this data.
>>>>>>
>>>>>> Other filesystems which also resides on the same PV that was affected by
>>>>>> the RAID6 problem either received a clean bill of health, or were
>>>>>> successfully repaired by e2fsck (the system did crash however, it's
>>>>>> unclear whether the RAID6 assembly problem was the cause or merely
>>>>>> another consequence, and as a result, whether the corruption on the
>>>>>> repaired filesystem was a consequence of the kernel or the RAID).
>>>>>>
>>>>>> I'm continuing onwards with e2fsck code to try and figure this out, am
>>>>>> hopeful though that someone could perhaps provide some much needed
>>>>>> insight and pointers for me.
>>>>>>
>>>>>> Kind Regards,
>>>>>> Jaco
>>>>>>
>>>>>>
>
> Cheers, Andreas
>
>
>
>
>

2020-01-28 21:13:21

by Jaco Kroon

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

Hi,

Just to provide feedback, and perhaps some usage tips for anyone reading
the thread (and before I forget how all of this stuck together in the end).

I used the code at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c to
recover.  Firstly I generated a list of GDs that needed to be repaired:

./ext4_rebuild_gds /dev/lvm/snap_home 2>&1 | tee /var/tmp/rebuild_checks.txt

snap_home is marked read-only in LVM (my safety/as close to a
block-level backup I can come).

cd /var/tmp
grep differs: rebuild_checks.txt | grep -v bg_flags | sed -re 's/^Group
([0-9]+), .*/\1/p' | uniq | awk 'BEGIN{ start=-1; prev=-10 } { if (prev
< $1-1) { if (start >= 0) { print start " " prev }; start=$1 }; prev=$1
} END{ print start " " prev }' > blocks.txt

exec 3<blocks.txt; while read A B <&3; do oblock=$(~/ext4_rebuild_gds
/dev/lvm/snap_home $A $B 2>&1 > .$A.tmp | sed -nre 's/^Physical GD block
for output start: //p') && mv .$A.tmp $A-$oblock.block; done

for i in *.block; do sblock=$(basename ${i#*-} .block); for o in 0
32768; do dd if=$i of=/dev/lvm/home bs=4096 seek=$(( sblock + o ));
done; done

I also did validate that for all pairs in blocks.txt B-A == 64, and A%64
== 0 (to make sure it's a full GDT block that I need to update).  If
this is not the case, you should reduce A such that it becomes a
multiple of 64, then adjust B such that B-A==64 (add additional pairs
such that the entire original A to B range is covered).  This may set
ZEROED bg_flags value on some GDs as a side effect but hopefully this
won't cause problems.  I also verified that all A >= first_meta_bg -
else you need to find appropriate block directly following super-block.

I only replaced the first two copies of the GDT blocks in each case,
could replace the third as well by adding a 32768*63 entry for o in the
inner for loop.

fsck is now running and getting past the previous failure point.  I am
hopeful that we'll at least recover some data.

I do have a few questions regarding fsck though:

1.  Will it notice that the backups GDT blocks are not the same as the
primaries and replace the backup blocks?

2.  I'm assuming it will rebuild the reachability for all inodes
starting at / and thus be able to determine which inodes should be
free/in use, and rebuild bitmaps accordingly (as well as mark the inodes
themselves free)?

3.  Sort out all buggered up checksums in the GDs I've changed?

Anyway, perhaps someone can answer out of hand, otherwise time will
tell, to date:

# e2fsck -y -f /dev/lvm/home
e2fsck 1.45.4 (23-Sep-2019)
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix? yes

Inode 13844923 was part of the orphaned inode list.  FIXED.
Inode 13844932 has an invalid extent node (blk 13666881850, lblk 27684864)
Clear? yes

Kind Regards,
Jaco

On 2020/01/28 14:53, Jaco Kroon wrote:

> Hi Andreas,
>
> On 2020/01/28 01:15, Andreas Dilger wrote:
>> On Jan 27, 2020, at 10:52 AM, Jaco Kroon <[email protected]> wrote:
>>> Hi Andreas,
>>>
>>> Ok, so there were a few calc errors below. I believe I've rectified those. I've written code to both locate and read the GD, as well as generate the fields I plan to update, comparing the fields I plan to update against those of all known-good blocks. The only issue I've picked up is that I've realised I don't understand bg_flags at all. Extract from output of my program (it uses the read GD as template and then just overwrites the fields I'm calculating).
>>> Group 65542, bg_flags differs: 0 (actual) != 4 (generated)
>> The EXT2_BG_{INODE,BLOCK}_UNINIT flags mean that the inode/block bitmaps are
>> all zero and do not need to be read from disk. That should only be used for
>> groups that have no blocks or inodes allocated to them.
>>
>> The EXT2_BG_INODE_ZEROED flag means that the inode table has been zeroed out
>> on disk after the filesystem has been formatted. It is safest to set this
>> flag on the group, so that the kernel does not try to zero out the inode tables
>> again, especially if your bitmaps are not correct.
> That makes sense.  Good explanation thank you.
>
>>> and to possibly inspect the data at inode_table looking for some kind of "magic" as to whether or not it's already initialized or not...
>> The group descriptors also have a checksum that needs to be valid.
> I'm hoping fsck will fix this after I've actioned mine?
>>> My code is available at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c for anyone that cares. Currently some work pending.
>> It probably makes more sense to include this into e2fsck, so that it will be
>> usable for everyone.
> I agree.  For now I needed quick and nasty.  I have no objection to do a
> patch at some point on e2fsck, or someone else can use my throw-away
> code and refine it.  For now I'm taking stdout from that and using dd to
> clobber the meta_bg GDT blocks, as well as replace the backups.  One
> reason for not doing that from the start was simply that I had no idea
> where exactly this would need to go into e2fsck.
>> IMHO, there is a design flaw in the meta_bg disk format,
>> in that the last group(s) of the filesystem may have few/no backup copies, so
>> cannot easily be recovered in case of an error.
> 1 copy if there is only one group, 2 of fewer than meta_bg groups ... 3
> otherwise.  Frankly I'm OK(ish) with three backups, one of which is
> typically ~128MB away from the first two.  Perhaps not quite good
> enough.  I get where you're coming from.  This is the first time that
> I've personally experienced GD corruption failures, or at least, that I
> *know* what I'm looking at.  Having said that, in >15 years this is only
> the third ext* filesystem that I've managed to corrupt to the point
> where fsck would not recover it.  The previous one was a few years back,
> and that was due to a major hardware issue.  The time before that was
> sheer incompetence and stupidity.
>
>> There is also the secondary
>> issue that meta_bg causes the metadata to be spread out over the whole disk,
>> which causes a *LOT* of seeking on a very large filesystem, on the order of
>> millions of seeks, which takes a lot of time to read.
> I can't comment to that.  I don't know how often these things are
> actually read from disk, whether it's once-off during mount time, or
> whether there are further reads of these blocks after that.  If it's
> once off I don't particularly care personally, if an 85TB file system
> takes a minute to mount ... so be it.  If it's continuous ... yes that
> can become problematic I reckon.
>
> Thank you very much for the information, you've managed to (even though
> not tell me how to fix my issue) at the very least confirm some things
> for me.  Aware this is goodwill, and that's more than I could ask for in
> all reality.
>
> Kind Regards,
> Jaco
>
>>> On 2020/01/27 12:24, Jaco Kroon wrote:
>>>> Hi Andreas,
>>>>
>>>> Thank you. The filesystem uses meta_bg and is larger than 16TB, my issue also doesn't exhibit in the first 16TB covered by those blocks.
>>>> On 2020/01/26 23:45, Andreas Dilger wrote:
>>>>> There are backups of all the group descriptors that can be used in such cases, immediately following the backup superblocks.
>>>>>
>>>>> Failing that, the group descriptors follow a very regular pattern and could be recreated by hand if needed (eg. all the backups were also corrupted for some reason).
>>>>>
>>>> You are right. It's however tricky to wrap your head around it. I think I've got it, but if you don't mind double checking me please:
>>>>
>>>> Given a group number g. sb = superblock.
>>>> gds_per_block = 2^(10+sb.s_log_block_size) / sb.s_desc_size = 4096 / 64 = 64
>>>> gd_block = floor(g / gds_per_block)
>>>> if (gd_block < ${sb.s_reserved_gdt_blocks})
>>>> phys_gd_block = gd_block + 1;
>>>> else
>>>> phys_gd_block = floor(gd_block / gds_per_block) * gds_per_block * sb.s_blocks_per_group
>>>>
>>>> phys_gd_block_offset = sb.s_desc_size * (g % gds_per_block)
>>>>
>>>> Backup blocks, are either with every superblock backup (groups 0, and groups being a power of 3, 5 and 7, ie, 0, 1, 3, 5, 7, 9, 25, 27, ...) where gd_block < ${sb.s_reserved_gdt_blocks); or
>>>> phys_gd_backup1 = phys_gd_block + sb.s_blocks_per_group
>>>> phys_gd_backup2 = phys_gd_block + sb.s_blocks_per_group * (gds_per_block - 1)
>>>>
>>>> offset stays the same.
>>>>
>>>> To reconstruct it's critical to fill the following fields, with the required calculations (gd = group descriptor, calculations for groups < 2^17, using 131072 as example):
>>>>
>>>> bitmap_group = floor(g / flex_groups) * flex_groups
>>>> => floor(131072 / 16) * 16 => 131072
>>>> gd.bg_block_bitmap = bitmap_group * blocks_per_group + g % flex_groups
>>>> => 131072 * 3768 + 131072 % 16
>>>> => 493879296
>>>> if (bitmap_group % gds_per_block == 0) /* if bitmap_group also houses a meta group block */
>>>> gd.bg_block_bitmap++;
>>>> => if (131072 % 64 == 0)
>>>> => if (0 == 0)
>>>> => gd.bg_block_bitmap = 493879297
>>>>
>>>> gd.bg_inode_bitmap = gd.bg_block_bitmap + flex_groups
>>>> => gd.bg_inode_bitmap = 493879297 + 16 = 493879313
>>>> gd.bg_inode_table = gd.bg_inode_bitmap + flex_groups - (g % flex_groups) + (g % flex_groups * inode_blocks_per_group)
>>>> => 493879313 + 16 - 0 + 0 * 32 = 493879329
>>>> Bad example, for g=131074:
>>>> =>493879315 + 16 - 2 + 2 * 32 = 493879393
>>>>
>>>> gd.bg_flags = 0x4
>>>>
>>>> I suspect it's OK to just zero (or leave them zero) these:
>>>>
>>>> bg_free_blocks_count
>>>> bg_free_inodes_count
>>>> bg_used_dirs_count
>>>> bg_exclude_bitmap
>>>> bg_itable_unused
>>>>
>>>> As well as all checksum fields (hopefully e2fsck will correct those).
>>>> I did action this calculation for a few non-destructed GDs and my manual calculations seems OK for the groups I checked (all into meta_bg area):
>>>>
>>>> 131072 (multiple of 64, meta_bg + fleg_bg)
>>>> 131073 (multiple of 64 + 1 - first meta_bg backup)
>>>> 131074 (none of the others, ie, plain group with data blocks only)
>>>> 131088 (multiple of 16 but not 64, flex_bg)
>>>> 131089 (multiple of 16 but not 64, +1)
>>>> 131135 (multiple of 64 + 63 - second meta_bg backup)
>>>> It is however a very limited sample, but should cover all the corner cases I could think of based on the specification and my understanding thereof.
>>>>
>>>> Given the above I should be able to write a small program that will produce the 128 4KiB blocks that's required, and then I can use dd to place them into the correct locations.
>>>> As an aside, debugfs refuses to open the filesystem:
>>>>
>>>> crowsnest ~ # debugfs /dev/lvm/home
>>>> debugfs 1.45.4 (23-Sep-2019)
>>>> /dev/lvm/home: Block bitmap checksum does not match bitmap while reading allocation bitmaps
>>>> debugfs: chroot /
>>>> chroot: Filesystem not open
>>>> debugfs: quit
>>>> Which is probably fair. So for this one I'll have to go make modifications using either some programatic tool that opens the underlying block device, or use dd trickery (ie, construct a GD and dd it into the right location, a a sequence of 64 GDs as it's always 64 GDs that's destroyed in my case, exactly 1 4KiB block).
>>>>
>>>> Kind Regards,
>>>> Jaco
>>>>> Cheers, Andreas
>>>>>
>>>>>
>>>>>> On Jan 26, 2020, at 13:44, Jaco Kroon <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> So working through the dumpe2fs file, the group mentioned by dmesg
>>>>>> contains this:
>>>>>>
>>>>>> Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
>>>>>> Group descriptor at 13243514880
>>>>>> Block bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>>> Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>>> Inode table at 0-31 (bg #0 + 0)
>>>>>> 0 free blocks, 0 free inodes, 0 directories
>>>>>> Free blocks: 13243514880-13243547647
>>>>>> Free inodes: 206929921-206930432
>>>>>>
>>>>>> Based on that it's quite simple to see that during the array
>>>>>> reconstruction we apparently wiped a bunch of data blocks with all
>>>>>> zeroes. This is obviously bad. During reconstruction we had to zero one
>>>>>> of the disks before we could get the array to reassemble. What I'm
>>>>>> wondering is whether this process was a good choice now, and whether the
>>>>>> right disk was zeroed. Obviously this implies major data loss (at least
>>>>>> 4TB, probably more assuming that directory structures may well have been
>>>>>> destroyed as well, maybe less if some of those blocks weren't in use).
>>>>>>
>>>>>> I'm hoping that it's possible to recreate these group descriptors (there
>>>>>> are a few of them) to at least point to the correct locations on disk,
>>>>>> and to then attempt a cleanup with e2fsck. Again, data loss here is to
>>>>>> be expected, but if we can limit it at least that would be great.
>>>>>>
>>>>>> There are unfortunately a large bunch of groups affected (128 cases of
>>>>>> 64 consecutive group blocks).
>>>>>>
>>>>>> 32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
>>>>>> 4KB/block => 1TB of data lost. However, this is extremely conservative
>>>>>> seeing that this could include directory structures with cascading effect.
>>>>>>
>>>>>> Based on the patterns of the first 64 group descriptors (GDs) it looks
>>>>>> like it should be possible to reconstruct the 8192 affected GDs, or
>>>>>> alternatively possibly "uninit" them
>>>>>> (
>>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization
>>>>>> ).
>>>>>> I'm inclined to reason that it's probably safer to repair in the GDs the
>>>>>> following fields:
>>>>>>
>>>>>> bg_block_bitmap_{lo,hi}
>>>>>> bg_inode_bitmap_{lo,hi}
>>>>>> bg_inode_table_{lo,hi}
>>>>>>
>>>>>> I'm not sure about:
>>>>>>
>>>>>> bg_flags (I'm guessing the safest is to leave this zeroed).
>>>>>> bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).
>>>>>>
>>>>>> The following should (as far as my understanding goes) then be "fixable"
>>>>>> by e2fsck:
>>>>>>
>>>>>> bg_free_blocks_count_{lo,hi}
>>>>>> bg_free_inodes_count_{lo,hi}
>>>>>> bg_used_dirs_count_{lo,hi}
>>>>>> bg_block_bitmap_csum_{lo,hi}
>>>>>> bg_inode_bitmap_csum_{lo,hi}
>>>>>> bg_itable_unused_{lo,hi}
>>>>>> bg_checksum
>>>>>>
>>>>>> And of course, tracking down the GD on disk will be tricky it seems. It
>>>>>> seems some blocks have the GD in the block, and a bunch of others don't
>>>>>> (nor does dumpe2fs say where exactly they are). There is 2048 blocks of
>>>>>> GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
>>>>>> group 2^17 onwards there are additional groups simply stating "Group
>>>>>> descriptor at ${frist_block_of_group}", so it's unclear how to track
>>>>>> down the GD for a given block group.
>>>>>>
>>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
>>>>>>
>>>>>> does not describe this particularly well either, and there seems to be
>>>>>> confusion w.r.t. flex_bg and meta_bg features and this.
>>>>>>
>>>>>> I do have an LVM snapshot of the affected LV currently, so happy to try
>>>>>> things.
>>>>>>
>>>>>> Kind Regards,
>>>>>> Jaco
>>>>>>
>>>>>>
>>>>>>> On 2020/01/26 12:21, Jaco Kroon wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've got an 85TB ext4 filesystem which I'm unable to fsck. The only
>>>>>>> cases of same error I could find was from what I can find due to an SD
>>>>>>> card "swallowing" writes (ie, the card goes into a read-only mode but
>>>>>>> doesn't report write failure).
>>>>>>>
>>>>>>> crowsnest ~ # e2fsck -f /dev/lvm/home
>>>>>>>
>>>>>>> e2fsck 1.45.4 (23-Sep-2019)
>>>>>>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>>>>>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>>>>>> /dev/lvm/home: recovering journal
>>>>>>> e2fsck: unable to set superblock flags on /dev/lvm/home
>>>>>>>
>>>>>>>
>>>>>>> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>>>>>>>
>>>>>>> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>>>>>>>
>>>>>>> I have also (using dumpe2fs) obtained the location of the backup super
>>>>>>> blocks and tried same against a few other superblocks using -b. -y (as
>>>>>>> per suggestion from at least one post) make absolutely no difference,
>>>>>>> our understanding is that this simply answers yes to all questions, so
>>>>>>> we didn't expect this to have impact but decided it was worth a try anyway.
>>>>>>>
>>>>>>> Looking at the code for the unable to set superblock error it looks like
>>>>>>> the code is in e2fsck/unix.c, specifically this:
>>>>>>>
>>>>>>> 1765 if (ext2fs_has_feature_journal_needs_recovery(sb)) {
>>>>>>> 1766 if (ctx->options & E2F_OPT_READONLY) {
>>>>>>> ...
>>>>>>> 1771 } else {
>>>>>>> 1772 if (ctx->flags & E2F_FLAG_RESTARTED) {
>>>>>>> 1773 /*
>>>>>>> 1774 * Whoops, we attempted to run the
>>>>>>> 1775 * journal twice. This should never
>>>>>>> 1776 * happen, unless the hardware or
>>>>>>> 1777 * device driver is being bogus.
>>>>>>> 1778 */
>>>>>>> 1779 com_err(ctx->program_name, 0,
>>>>>>> 1780 _("unable to set superblock flags "
>>>>>>> 1781 "on %s\n"), ctx->device_name);
>>>>>>> 1782 fatal_error(ctx, 0);
>>>>>>> 1783 }
>>>>>>>
>>>>>>> That comment has me somewhat confused. I'm assuming the implication
>>>>>>> there is that e2fsck tried to update the superblock, but after reading
>>>>>>> it back, it's either unchanged or still wrong (In line with the
>>>>>>> description of the SD card I found online). None of our arrays are
>>>>>>> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
>>>>>>> (we downgraded back to 5.1.15, which we're on currently, after
>>>>>>> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
>>>>>>> didn't seem to fix those, and the 4.14.13 kernel that was used
>>>>>>> previously is known to cause ext4 corruption of the kind we saw on the
>>>>>>> other filesystems):
>>>>>>>
>>>>>>> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
>>>>>>> group 404160 overlaps superblock
>>>>>>> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>>>>>>>
>>>>>>> I created a dumpe2fs file as well:
>>>>>>>
>>>>>>> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
>>>>>>> dumpe2fs 1.45.4 (23-Sep-2019)
>>>>>>> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
>>>>>>> read '/dev/lvm/home' bitmaps
>>>>>>>
>>>>>>> Available at
>>>>>>> https://downloads.uls.co.za/85T/dump2fs_home.txt.xz
>>>>>>> (1.2GB,
>>>>>>> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>>>>>>>
>>>>>>> A strace of e2fsck -y -f /dev/lvm/home at
>>>>>>>
>>>>>>> https://downloads.uls.co.za/85T/fsck.strace.txt
>>>>>>> (13MB,
>>>>>>> md5:60aa91b0c47dd2837260218eb774152d)
>>>>>>>
>>>>>>> crowsnest ~ # tune2fs -l /dev/lvm/home
>>>>>>> tune2fs 1.45.4 (23-Sep-2019)
>>>>>>> Filesystem volume name: <none>
>>>>>>> Last mounted on: /home
>>>>>>> Filesystem UUID: 522a9faf-7992-4888-93d5-7fe49a9762d6
>>>>>>> Filesystem magic number: 0xEF53
>>>>>>> Filesystem revision #: 1 (dynamic)
>>>>>>> Filesystem features: has_journal ext_attr filetype meta_bg extent
>>>>>>> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
>>>>>>> metadata_csum
>>>>>>> Filesystem flags: signed_directory_hash
>>>>>>> Default mount options: user_xattr acl
>>>>>>> Filesystem state: clean
>>>>>>> Errors behavior: Continue
>>>>>>> Filesystem OS type: Linux
>>>>>>> Inode count: 356515840
>>>>>>> Block count: 22817013760
>>>>>>> Reserved block count: 0
>>>>>>> Free blocks: 6874204745
>>>>>>> Free inodes: 202183498
>>>>>>> First block: 0
>>>>>>> Block size: 4096
>>>>>>> Fragment size: 4096
>>>>>>> Group descriptor size: 64
>>>>>>> Blocks per group: 32768
>>>>>>> Fragments per group: 32768
>>>>>>> Inodes per group: 512
>>>>>>> Inode blocks per group: 32
>>>>>>> RAID stride: 128
>>>>>>> RAID stripe width: 1024
>>>>>>> First meta block group: 2048
>>>>>>> Flex block group size: 16
>>>>>>> Filesystem created: Thu Jul 26 12:19:07 2018
>>>>>>> Last mount time: Sat Jan 18 18:58:50 2020
>>>>>>> Last write time: Sun Jan 26 11:38:56 2020
>>>>>>> Mount count: 2
>>>>>>> Maximum mount count: -1
>>>>>>> Last checked: Wed Oct 30 17:37:27 2019
>>>>>>> Check interval: 0 (<none>)
>>>>>>> Lifetime writes: 976 TB
>>>>>>> Reserved blocks uid: 0 (user root)
>>>>>>> Reserved blocks gid: 0 (group root)
>>>>>>> First inode: 11
>>>>>>> Inode size: 256
>>>>>>> Required extra isize: 32
>>>>>>> Desired extra isize: 32
>>>>>>> Journal inode: 8
>>>>>>> Default directory hash: half_md4
>>>>>>> Directory Hash Seed: 876a7d14-bce8-4bef-9569-82e7d573b7aa
>>>>>>> Journal backup: inode blocks
>>>>>>> Checksum type: crc32c
>>>>>>> Checksum: 0xfbd895e9
>>>>>>>
>>>>>>> Infrastructure: 3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
>>>>>>> 10TB disks (100TB usable total). These are combined into a single VG
>>>>>>> using LVM, and then carved up into a number of LVs, the largest of which
>>>>>>> is this 85TB chunk. We have tried in the past to carve this into
>>>>>>> smaller LVs but failed. So we're aware that this is very large and not
>>>>>>> ideal.
>>>>>>>
>>>>>>> We did experience an assembly issue on one of the underlying RAID6 PVs,
>>>>>>> those have been resolved, and the disk that was giving issues has been
>>>>>>> scrubbed and rebuilt. rom what we can tell based on other file systems,
>>>>>>> this did not affect data integrity but we can't make that statement with
>>>>>>> 100% certainty, as such we are expecting some data loss here but it
>>>>>>> would be better if we can recover at least some of this data.
>>>>>>>
>>>>>>> Other filesystems which also resides on the same PV that was affected by
>>>>>>> the RAID6 problem either received a clean bill of health, or were
>>>>>>> successfully repaired by e2fsck (the system did crash however, it's
>>>>>>> unclear whether the RAID6 assembly problem was the cause or merely
>>>>>>> another consequence, and as a result, whether the corruption on the
>>>>>>> repaired filesystem was a consequence of the kernel or the RAID).
>>>>>>>
>>>>>>> I'm continuing onwards with e2fsck code to try and figure this out, am
>>>>>>> hopeful though that someone could perhaps provide some much needed
>>>>>>> insight and pointers for me.
>>>>>>>
>>>>>>> Kind Regards,
>>>>>>> Jaco
>>>>>>>
>>>>>>>
>> Cheers, Andreas
>>
>>
>>
>>
>>

2020-01-29 04:36:21

by Jaco Kroon

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

Hi,

Inode 181716301 block 33554947 conflicts with critical metadata,
skipping block checks.
Inode 181716301 block 524296 conflicts with critical metadata, skipping
block checks.
Inode 181716301 block 2 conflicts with critical metadata, skipping block
checks.
Inode 181716301 block 294 conflicts with critical metadata, skipping
block checks.
Inode 181716301 block 1247805839 conflicts with critical metadata,
skipping block checks.
Inode 181716301 block 288 conflicts with critical metadata, skipping
block checks.
Inode 181716301 block 103285040 conflicts with critical metadata,
skipping block checks.
Inode 181716301 block 872415232 conflicts with critical metadata,
skipping block checks.
Inode 181716301 block 2560 conflicts with critical metadata, skipping
block checks.
Inode 181716301 block 479199248 conflicts with critical metadata,
skipping block checks.
Inode 181716301 block 1006632963 conflicts with critical metadata,
skipping block checks.
Killed

So the critical block stuff I'm guessing can be expected since a bunch
of those tree structures probably got zeroed too.

It got killed because it ran out of RAM (OOM killer), 32GB physical +
16GB swap.  I've extended swap to 512GB now and restarted.  It's
probably overkill (I hope).

Any ideas on what might be consuming the RAM like this?   Unfortunately
my scroll-back doesn't go back far enough to see what other inodes if
any are also affected.  I've restarted with 2>&1 | tee /var/tmp/fsck.txt
now.

Happy to go hunting to look for possible optimization ideas here ...

Another idea is to use debugfs to mark inode 181716301 as deleted, but
I'm not sure that's safe at this stage?


Kind Regards,
Jaco


On 2020/01/28 23:12, Jaco Kroon wrote:

> Hi,
>
> Just to provide feedback, and perhaps some usage tips for anyone reading
> the thread (and before I forget how all of this stuck together in the end).
>
> I used the code at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c to
> recover.  Firstly I generated a list of GDs that needed to be repaired:
>
> ./ext4_rebuild_gds /dev/lvm/snap_home 2>&1 | tee /var/tmp/rebuild_checks.txt
>
> snap_home is marked read-only in LVM (my safety/as close to a
> block-level backup I can come).
>
> cd /var/tmp
> grep differs: rebuild_checks.txt | grep -v bg_flags | sed -re 's/^Group
> ([0-9]+), .*/\1/p' | uniq | awk 'BEGIN{ start=-1; prev=-10 } { if (prev
> < $1-1) { if (start >= 0) { print start " " prev }; start=$1 }; prev=$1
> } END{ print start " " prev }' > blocks.txt
>
> exec 3<blocks.txt; while read A B <&3; do oblock=$(~/ext4_rebuild_gds
> /dev/lvm/snap_home $A $B 2>&1 > .$A.tmp | sed -nre 's/^Physical GD block
> for output start: //p') && mv .$A.tmp $A-$oblock.block; done
>
> for i in *.block; do sblock=$(basename ${i#*-} .block); for o in 0
> 32768; do dd if=$i of=/dev/lvm/home bs=4096 seek=$(( sblock + o ));
> done; done
>
> I also did validate that for all pairs in blocks.txt B-A == 64, and A%64
> == 0 (to make sure it's a full GDT block that I need to update).  If
> this is not the case, you should reduce A such that it becomes a
> multiple of 64, then adjust B such that B-A==64 (add additional pairs
> such that the entire original A to B range is covered).  This may set
> ZEROED bg_flags value on some GDs as a side effect but hopefully this
> won't cause problems.  I also verified that all A >= first_meta_bg -
> else you need to find appropriate block directly following super-block.
>
> I only replaced the first two copies of the GDT blocks in each case,
> could replace the third as well by adding a 32768*63 entry for o in the
> inner for loop.
>
> fsck is now running and getting past the previous failure point.  I am
> hopeful that we'll at least recover some data.
>
> I do have a few questions regarding fsck though:
>
> 1.  Will it notice that the backups GDT blocks are not the same as the
> primaries and replace the backup blocks?
>
> 2.  I'm assuming it will rebuild the reachability for all inodes
> starting at / and thus be able to determine which inodes should be
> free/in use, and rebuild bitmaps accordingly (as well as mark the inodes
> themselves free)?
>
> 3.  Sort out all buggered up checksums in the GDs I've changed?
>
> Anyway, perhaps someone can answer out of hand, otherwise time will
> tell, to date:
>
> # e2fsck -y -f /dev/lvm/home
> e2fsck 1.45.4 (23-Sep-2019)
> Pass 1: Checking inodes, blocks, and sizes
> Inodes that were part of a corrupted orphan linked list found.  Fix? yes
>
> Inode 13844923 was part of the orphaned inode list.  FIXED.
> Inode 13844932 has an invalid extent node (blk 13666881850, lblk 27684864)
> Clear? yes
>
> Kind Regards,
> Jaco
>
> On 2020/01/28 14:53, Jaco Kroon wrote:
>
>> Hi Andreas,
>>
>> On 2020/01/28 01:15, Andreas Dilger wrote:
>>> On Jan 27, 2020, at 10:52 AM, Jaco Kroon <[email protected]> wrote:
>>>> Hi Andreas,
>>>>
>>>> Ok, so there were a few calc errors below. I believe I've rectified those. I've written code to both locate and read the GD, as well as generate the fields I plan to update, comparing the fields I plan to update against those of all known-good blocks. The only issue I've picked up is that I've realised I don't understand bg_flags at all. Extract from output of my program (it uses the read GD as template and then just overwrites the fields I'm calculating).
>>>> Group 65542, bg_flags differs: 0 (actual) != 4 (generated)
>>> The EXT2_BG_{INODE,BLOCK}_UNINIT flags mean that the inode/block bitmaps are
>>> all zero and do not need to be read from disk. That should only be used for
>>> groups that have no blocks or inodes allocated to them.
>>>
>>> The EXT2_BG_INODE_ZEROED flag means that the inode table has been zeroed out
>>> on disk after the filesystem has been formatted. It is safest to set this
>>> flag on the group, so that the kernel does not try to zero out the inode tables
>>> again, especially if your bitmaps are not correct.
>> That makes sense.  Good explanation thank you.
>>
>>>> and to possibly inspect the data at inode_table looking for some kind of "magic" as to whether or not it's already initialized or not...
>>> The group descriptors also have a checksum that needs to be valid.
>> I'm hoping fsck will fix this after I've actioned mine?
>>>> My code is available at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c for anyone that cares. Currently some work pending.
>>> It probably makes more sense to include this into e2fsck, so that it will be
>>> usable for everyone.
>> I agree.  For now I needed quick and nasty.  I have no objection to do a
>> patch at some point on e2fsck, or someone else can use my throw-away
>> code and refine it.  For now I'm taking stdout from that and using dd to
>> clobber the meta_bg GDT blocks, as well as replace the backups.  One
>> reason for not doing that from the start was simply that I had no idea
>> where exactly this would need to go into e2fsck.
>>> IMHO, there is a design flaw in the meta_bg disk format,
>>> in that the last group(s) of the filesystem may have few/no backup copies, so
>>> cannot easily be recovered in case of an error.
>> 1 copy if there is only one group, 2 of fewer than meta_bg groups ... 3
>> otherwise.  Frankly I'm OK(ish) with three backups, one of which is
>> typically ~128MB away from the first two.  Perhaps not quite good
>> enough.  I get where you're coming from.  This is the first time that
>> I've personally experienced GD corruption failures, or at least, that I
>> *know* what I'm looking at.  Having said that, in >15 years this is only
>> the third ext* filesystem that I've managed to corrupt to the point
>> where fsck would not recover it.  The previous one was a few years back,
>> and that was due to a major hardware issue.  The time before that was
>> sheer incompetence and stupidity.
>>
>>> There is also the secondary
>>> issue that meta_bg causes the metadata to be spread out over the whole disk,
>>> which causes a *LOT* of seeking on a very large filesystem, on the order of
>>> millions of seeks, which takes a lot of time to read.
>> I can't comment to that.  I don't know how often these things are
>> actually read from disk, whether it's once-off during mount time, or
>> whether there are further reads of these blocks after that.  If it's
>> once off I don't particularly care personally, if an 85TB file system
>> takes a minute to mount ... so be it.  If it's continuous ... yes that
>> can become problematic I reckon.
>>
>> Thank you very much for the information, you've managed to (even though
>> not tell me how to fix my issue) at the very least confirm some things
>> for me.  Aware this is goodwill, and that's more than I could ask for in
>> all reality.
>>
>> Kind Regards,
>> Jaco
>>
>>>> On 2020/01/27 12:24, Jaco Kroon wrote:
>>>>> Hi Andreas,
>>>>>
>>>>> Thank you. The filesystem uses meta_bg and is larger than 16TB, my issue also doesn't exhibit in the first 16TB covered by those blocks.
>>>>> On 2020/01/26 23:45, Andreas Dilger wrote:
>>>>>> There are backups of all the group descriptors that can be used in such cases, immediately following the backup superblocks.
>>>>>>
>>>>>> Failing that, the group descriptors follow a very regular pattern and could be recreated by hand if needed (eg. all the backups were also corrupted for some reason).
>>>>>>
>>>>> You are right. It's however tricky to wrap your head around it. I think I've got it, but if you don't mind double checking me please:
>>>>>
>>>>> Given a group number g. sb = superblock.
>>>>> gds_per_block = 2^(10+sb.s_log_block_size) / sb.s_desc_size = 4096 / 64 = 64
>>>>> gd_block = floor(g / gds_per_block)
>>>>> if (gd_block < ${sb.s_reserved_gdt_blocks})
>>>>> phys_gd_block = gd_block + 1;
>>>>> else
>>>>> phys_gd_block = floor(gd_block / gds_per_block) * gds_per_block * sb.s_blocks_per_group
>>>>>
>>>>> phys_gd_block_offset = sb.s_desc_size * (g % gds_per_block)
>>>>>
>>>>> Backup blocks, are either with every superblock backup (groups 0, and groups being a power of 3, 5 and 7, ie, 0, 1, 3, 5, 7, 9, 25, 27, ...) where gd_block < ${sb.s_reserved_gdt_blocks); or
>>>>> phys_gd_backup1 = phys_gd_block + sb.s_blocks_per_group
>>>>> phys_gd_backup2 = phys_gd_block + sb.s_blocks_per_group * (gds_per_block - 1)
>>>>>
>>>>> offset stays the same.
>>>>>
>>>>> To reconstruct it's critical to fill the following fields, with the required calculations (gd = group descriptor, calculations for groups < 2^17, using 131072 as example):
>>>>>
>>>>> bitmap_group = floor(g / flex_groups) * flex_groups
>>>>> => floor(131072 / 16) * 16 => 131072
>>>>> gd.bg_block_bitmap = bitmap_group * blocks_per_group + g % flex_groups
>>>>> => 131072 * 3768 + 131072 % 16
>>>>> => 493879296
>>>>> if (bitmap_group % gds_per_block == 0) /* if bitmap_group also houses a meta group block */
>>>>> gd.bg_block_bitmap++;
>>>>> => if (131072 % 64 == 0)
>>>>> => if (0 == 0)
>>>>> => gd.bg_block_bitmap = 493879297
>>>>>
>>>>> gd.bg_inode_bitmap = gd.bg_block_bitmap + flex_groups
>>>>> => gd.bg_inode_bitmap = 493879297 + 16 = 493879313
>>>>> gd.bg_inode_table = gd.bg_inode_bitmap + flex_groups - (g % flex_groups) + (g % flex_groups * inode_blocks_per_group)
>>>>> => 493879313 + 16 - 0 + 0 * 32 = 493879329
>>>>> Bad example, for g=131074:
>>>>> =>493879315 + 16 - 2 + 2 * 32 = 493879393
>>>>>
>>>>> gd.bg_flags = 0x4
>>>>>
>>>>> I suspect it's OK to just zero (or leave them zero) these:
>>>>>
>>>>> bg_free_blocks_count
>>>>> bg_free_inodes_count
>>>>> bg_used_dirs_count
>>>>> bg_exclude_bitmap
>>>>> bg_itable_unused
>>>>>
>>>>> As well as all checksum fields (hopefully e2fsck will correct those).
>>>>> I did action this calculation for a few non-destructed GDs and my manual calculations seems OK for the groups I checked (all into meta_bg area):
>>>>>
>>>>> 131072 (multiple of 64, meta_bg + fleg_bg)
>>>>> 131073 (multiple of 64 + 1 - first meta_bg backup)
>>>>> 131074 (none of the others, ie, plain group with data blocks only)
>>>>> 131088 (multiple of 16 but not 64, flex_bg)
>>>>> 131089 (multiple of 16 but not 64, +1)
>>>>> 131135 (multiple of 64 + 63 - second meta_bg backup)
>>>>> It is however a very limited sample, but should cover all the corner cases I could think of based on the specification and my understanding thereof.
>>>>>
>>>>> Given the above I should be able to write a small program that will produce the 128 4KiB blocks that's required, and then I can use dd to place them into the correct locations.
>>>>> As an aside, debugfs refuses to open the filesystem:
>>>>>
>>>>> crowsnest ~ # debugfs /dev/lvm/home
>>>>> debugfs 1.45.4 (23-Sep-2019)
>>>>> /dev/lvm/home: Block bitmap checksum does not match bitmap while reading allocation bitmaps
>>>>> debugfs: chroot /
>>>>> chroot: Filesystem not open
>>>>> debugfs: quit
>>>>> Which is probably fair. So for this one I'll have to go make modifications using either some programatic tool that opens the underlying block device, or use dd trickery (ie, construct a GD and dd it into the right location, a a sequence of 64 GDs as it's always 64 GDs that's destroyed in my case, exactly 1 4KiB block).
>>>>>
>>>>> Kind Regards,
>>>>> Jaco
>>>>>> Cheers, Andreas
>>>>>>
>>>>>>
>>>>>>> On Jan 26, 2020, at 13:44, Jaco Kroon <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> So working through the dumpe2fs file, the group mentioned by dmesg
>>>>>>> contains this:
>>>>>>>
>>>>>>> Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
>>>>>>> Group descriptor at 13243514880
>>>>>>> Block bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>>>> Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>>>> Inode table at 0-31 (bg #0 + 0)
>>>>>>> 0 free blocks, 0 free inodes, 0 directories
>>>>>>> Free blocks: 13243514880-13243547647
>>>>>>> Free inodes: 206929921-206930432
>>>>>>>
>>>>>>> Based on that it's quite simple to see that during the array
>>>>>>> reconstruction we apparently wiped a bunch of data blocks with all
>>>>>>> zeroes. This is obviously bad. During reconstruction we had to zero one
>>>>>>> of the disks before we could get the array to reassemble. What I'm
>>>>>>> wondering is whether this process was a good choice now, and whether the
>>>>>>> right disk was zeroed. Obviously this implies major data loss (at least
>>>>>>> 4TB, probably more assuming that directory structures may well have been
>>>>>>> destroyed as well, maybe less if some of those blocks weren't in use).
>>>>>>>
>>>>>>> I'm hoping that it's possible to recreate these group descriptors (there
>>>>>>> are a few of them) to at least point to the correct locations on disk,
>>>>>>> and to then attempt a cleanup with e2fsck. Again, data loss here is to
>>>>>>> be expected, but if we can limit it at least that would be great.
>>>>>>>
>>>>>>> There are unfortunately a large bunch of groups affected (128 cases of
>>>>>>> 64 consecutive group blocks).
>>>>>>>
>>>>>>> 32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
>>>>>>> 4KB/block => 1TB of data lost. However, this is extremely conservative
>>>>>>> seeing that this could include directory structures with cascading effect.
>>>>>>>
>>>>>>> Based on the patterns of the first 64 group descriptors (GDs) it looks
>>>>>>> like it should be possible to reconstruct the 8192 affected GDs, or
>>>>>>> alternatively possibly "uninit" them
>>>>>>> (
>>>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization
>>>>>>> ).
>>>>>>> I'm inclined to reason that it's probably safer to repair in the GDs the
>>>>>>> following fields:
>>>>>>>
>>>>>>> bg_block_bitmap_{lo,hi}
>>>>>>> bg_inode_bitmap_{lo,hi}
>>>>>>> bg_inode_table_{lo,hi}
>>>>>>>
>>>>>>> I'm not sure about:
>>>>>>>
>>>>>>> bg_flags (I'm guessing the safest is to leave this zeroed).
>>>>>>> bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).
>>>>>>>
>>>>>>> The following should (as far as my understanding goes) then be "fixable"
>>>>>>> by e2fsck:
>>>>>>>
>>>>>>> bg_free_blocks_count_{lo,hi}
>>>>>>> bg_free_inodes_count_{lo,hi}
>>>>>>> bg_used_dirs_count_{lo,hi}
>>>>>>> bg_block_bitmap_csum_{lo,hi}
>>>>>>> bg_inode_bitmap_csum_{lo,hi}
>>>>>>> bg_itable_unused_{lo,hi}
>>>>>>> bg_checksum
>>>>>>>
>>>>>>> And of course, tracking down the GD on disk will be tricky it seems. It
>>>>>>> seems some blocks have the GD in the block, and a bunch of others don't
>>>>>>> (nor does dumpe2fs say where exactly they are). There is 2048 blocks of
>>>>>>> GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
>>>>>>> group 2^17 onwards there are additional groups simply stating "Group
>>>>>>> descriptor at ${frist_block_of_group}", so it's unclear how to track
>>>>>>> down the GD for a given block group.
>>>>>>>
>>>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
>>>>>>>
>>>>>>> does not describe this particularly well either, and there seems to be
>>>>>>> confusion w.r.t. flex_bg and meta_bg features and this.
>>>>>>>
>>>>>>> I do have an LVM snapshot of the affected LV currently, so happy to try
>>>>>>> things.
>>>>>>>
>>>>>>> Kind Regards,
>>>>>>> Jaco
>>>>>>>
>>>>>>>
>>>>>>>> On 2020/01/26 12:21, Jaco Kroon wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've got an 85TB ext4 filesystem which I'm unable to fsck. The only
>>>>>>>> cases of same error I could find was from what I can find due to an SD
>>>>>>>> card "swallowing" writes (ie, the card goes into a read-only mode but
>>>>>>>> doesn't report write failure).
>>>>>>>>
>>>>>>>> crowsnest ~ # e2fsck -f /dev/lvm/home
>>>>>>>>
>>>>>>>> e2fsck 1.45.4 (23-Sep-2019)
>>>>>>>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>>>>>>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>>>>>>> /dev/lvm/home: recovering journal
>>>>>>>> e2fsck: unable to set superblock flags on /dev/lvm/home
>>>>>>>>
>>>>>>>>
>>>>>>>> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>>>>>>>>
>>>>>>>> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>>>>>>>>
>>>>>>>> I have also (using dumpe2fs) obtained the location of the backup super
>>>>>>>> blocks and tried same against a few other superblocks using -b. -y (as
>>>>>>>> per suggestion from at least one post) make absolutely no difference,
>>>>>>>> our understanding is that this simply answers yes to all questions, so
>>>>>>>> we didn't expect this to have impact but decided it was worth a try anyway.
>>>>>>>>
>>>>>>>> Looking at the code for the unable to set superblock error it looks like
>>>>>>>> the code is in e2fsck/unix.c, specifically this:
>>>>>>>>
>>>>>>>> 1765 if (ext2fs_has_feature_journal_needs_recovery(sb)) {
>>>>>>>> 1766 if (ctx->options & E2F_OPT_READONLY) {
>>>>>>>> ...
>>>>>>>> 1771 } else {
>>>>>>>> 1772 if (ctx->flags & E2F_FLAG_RESTARTED) {
>>>>>>>> 1773 /*
>>>>>>>> 1774 * Whoops, we attempted to run the
>>>>>>>> 1775 * journal twice. This should never
>>>>>>>> 1776 * happen, unless the hardware or
>>>>>>>> 1777 * device driver is being bogus.
>>>>>>>> 1778 */
>>>>>>>> 1779 com_err(ctx->program_name, 0,
>>>>>>>> 1780 _("unable to set superblock flags "
>>>>>>>> 1781 "on %s\n"), ctx->device_name);
>>>>>>>> 1782 fatal_error(ctx, 0);
>>>>>>>> 1783 }
>>>>>>>>
>>>>>>>> That comment has me somewhat confused. I'm assuming the implication
>>>>>>>> there is that e2fsck tried to update the superblock, but after reading
>>>>>>>> it back, it's either unchanged or still wrong (In line with the
>>>>>>>> description of the SD card I found online). None of our arrays are
>>>>>>>> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
>>>>>>>> (we downgraded back to 5.1.15, which we're on currently, after
>>>>>>>> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
>>>>>>>> didn't seem to fix those, and the 4.14.13 kernel that was used
>>>>>>>> previously is known to cause ext4 corruption of the kind we saw on the
>>>>>>>> other filesystems):
>>>>>>>>
>>>>>>>> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
>>>>>>>> group 404160 overlaps superblock
>>>>>>>> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>>>>>>>>
>>>>>>>> I created a dumpe2fs file as well:
>>>>>>>>
>>>>>>>> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
>>>>>>>> dumpe2fs 1.45.4 (23-Sep-2019)
>>>>>>>> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
>>>>>>>> read '/dev/lvm/home' bitmaps
>>>>>>>>
>>>>>>>> Available at
>>>>>>>> https://downloads.uls.co.za/85T/dump2fs_home.txt.xz
>>>>>>>> (1.2GB,
>>>>>>>> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>>>>>>>>
>>>>>>>> A strace of e2fsck -y -f /dev/lvm/home at
>>>>>>>>
>>>>>>>> https://downloads.uls.co.za/85T/fsck.strace.txt
>>>>>>>> (13MB,
>>>>>>>> md5:60aa91b0c47dd2837260218eb774152d)
>>>>>>>>
>>>>>>>> crowsnest ~ # tune2fs -l /dev/lvm/home
>>>>>>>> tune2fs 1.45.4 (23-Sep-2019)
>>>>>>>> Filesystem volume name: <none>
>>>>>>>> Last mounted on: /home
>>>>>>>> Filesystem UUID: 522a9faf-7992-4888-93d5-7fe49a9762d6
>>>>>>>> Filesystem magic number: 0xEF53
>>>>>>>> Filesystem revision #: 1 (dynamic)
>>>>>>>> Filesystem features: has_journal ext_attr filetype meta_bg extent
>>>>>>>> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
>>>>>>>> metadata_csum
>>>>>>>> Filesystem flags: signed_directory_hash
>>>>>>>> Default mount options: user_xattr acl
>>>>>>>> Filesystem state: clean
>>>>>>>> Errors behavior: Continue
>>>>>>>> Filesystem OS type: Linux
>>>>>>>> Inode count: 356515840
>>>>>>>> Block count: 22817013760
>>>>>>>> Reserved block count: 0
>>>>>>>> Free blocks: 6874204745
>>>>>>>> Free inodes: 202183498
>>>>>>>> First block: 0
>>>>>>>> Block size: 4096
>>>>>>>> Fragment size: 4096
>>>>>>>> Group descriptor size: 64
>>>>>>>> Blocks per group: 32768
>>>>>>>> Fragments per group: 32768
>>>>>>>> Inodes per group: 512
>>>>>>>> Inode blocks per group: 32
>>>>>>>> RAID stride: 128
>>>>>>>> RAID stripe width: 1024
>>>>>>>> First meta block group: 2048
>>>>>>>> Flex block group size: 16
>>>>>>>> Filesystem created: Thu Jul 26 12:19:07 2018
>>>>>>>> Last mount time: Sat Jan 18 18:58:50 2020
>>>>>>>> Last write time: Sun Jan 26 11:38:56 2020
>>>>>>>> Mount count: 2
>>>>>>>> Maximum mount count: -1
>>>>>>>> Last checked: Wed Oct 30 17:37:27 2019
>>>>>>>> Check interval: 0 (<none>)
>>>>>>>> Lifetime writes: 976 TB
>>>>>>>> Reserved blocks uid: 0 (user root)
>>>>>>>> Reserved blocks gid: 0 (group root)
>>>>>>>> First inode: 11
>>>>>>>> Inode size: 256
>>>>>>>> Required extra isize: 32
>>>>>>>> Desired extra isize: 32
>>>>>>>> Journal inode: 8
>>>>>>>> Default directory hash: half_md4
>>>>>>>> Directory Hash Seed: 876a7d14-bce8-4bef-9569-82e7d573b7aa
>>>>>>>> Journal backup: inode blocks
>>>>>>>> Checksum type: crc32c
>>>>>>>> Checksum: 0xfbd895e9
>>>>>>>>
>>>>>>>> Infrastructure: 3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
>>>>>>>> 10TB disks (100TB usable total). These are combined into a single VG
>>>>>>>> using LVM, and then carved up into a number of LVs, the largest of which
>>>>>>>> is this 85TB chunk. We have tried in the past to carve this into
>>>>>>>> smaller LVs but failed. So we're aware that this is very large and not
>>>>>>>> ideal.
>>>>>>>>
>>>>>>>> We did experience an assembly issue on one of the underlying RAID6 PVs,
>>>>>>>> those have been resolved, and the disk that was giving issues has been
>>>>>>>> scrubbed and rebuilt. rom what we can tell based on other file systems,
>>>>>>>> this did not affect data integrity but we can't make that statement with
>>>>>>>> 100% certainty, as such we are expecting some data loss here but it
>>>>>>>> would be better if we can recover at least some of this data.
>>>>>>>>
>>>>>>>> Other filesystems which also resides on the same PV that was affected by
>>>>>>>> the RAID6 problem either received a clean bill of health, or were
>>>>>>>> successfully repaired by e2fsck (the system did crash however, it's
>>>>>>>> unclear whether the RAID6 assembly problem was the cause or merely
>>>>>>>> another consequence, and as a result, whether the corruption on the
>>>>>>>> repaired filesystem was a consequence of the kernel or the RAID).
>>>>>>>>
>>>>>>>> I'm continuing onwards with e2fsck code to try and figure this out, am
>>>>>>>> hopeful though that someone could perhaps provide some much needed
>>>>>>>> insight and pointers for me.
>>>>>>>>
>>>>>>>> Kind Regards,
>>>>>>>> Jaco
>>>>>>>>
>>>>>>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>

2020-01-29 20:05:36

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

On Jan 28, 2020, at 9:35 PM, Jaco Kroon <[email protected]> wrote:
>
> Hi,
>
> Inode 181716301 block 33554947 conflicts with critical metadata,
> skipping block checks.
> Inode 181716301 block 524296 conflicts with critical metadata, skipping
> block checks.
> Inode 181716301 block 2 conflicts with critical metadata, skipping block
> checks.
> Inode 181716301 block 294 conflicts with critical metadata, skipping
> block checks.
> Inode 181716301 block 1247805839 conflicts with critical metadata,
> skipping block checks.
> Inode 181716301 block 288 conflicts with critical metadata, skipping
> block checks.
> Inode 181716301 block 103285040 conflicts with critical metadata,
> skipping block checks.
> Inode 181716301 block 872415232 conflicts with critical metadata,
> skipping block checks.
> Inode 181716301 block 2560 conflicts with critical metadata, skipping
> block checks.
> Inode 181716301 block 479199248 conflicts with critical metadata,
> skipping block checks.
> Inode 181716301 block 1006632963 conflicts with critical metadata,
> skipping block checks.

This inode is probably just random garbage. Erase that inode with:

debugfs -w -R "clri <181716301>" /dev/sdX

There may be multiple such inodes with nearby numbers in the likely
case that a whole block is corrupted. There has been some discussion
about the best way to handle such corruption of a whole inode table
block, but nothing has been implemented in e2fsck yet.

> So the critical block stuff I'm guessing can be expected since a bunch
> of those tree structures probably got zeroed too.
>
> It got killed because it ran out of RAM (OOM killer), 32GB physical +
> 16GB swap. I've extended swap to 512GB now and restarted. It's
> probably overkill (I hope).
>
> Any ideas on what might be consuming the RAM like this? Unfortunately
> my scroll-back doesn't go back far enough to see what other inodes if
> any are also affected. I've restarted with 2>&1 | tee /var/tmp/fsck.txt
> now.
>
> Happy to go hunting to look for possible optimization ideas here ...
>
> Another idea is to use debugfs to mark inode 181716301 as deleted, but
> I'm not sure that's safe at this stage?

Marking it "deleted" isn't really the right thing, since (AFAIR) that
will just update the inode bitmap and possibly set the "i_dtime" field
in the inode. The "clri" command will zero out the inode, which erases
all of the bad block allocation references for that inode. This is no
"loss" since the inode is already garbage.

Cheers, Andreas

>
>
> Kind Regards,
> Jaco
>
>
> On 2020/01/28 23:12, Jaco Kroon wrote:
>
>> Hi,
>>
>> Just to provide feedback, and perhaps some usage tips for anyone reading
>> the thread (and before I forget how all of this stuck together in the end).
>>
>> I used the code at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c to
>> recover. Firstly I generated a list of GDs that needed to be repaired:
>>
>> ./ext4_rebuild_gds /dev/lvm/snap_home 2>&1 | tee /var/tmp/rebuild_checks.txt
>>
>> snap_home is marked read-only in LVM (my safety/as close to a
>> block-level backup I can come).
>>
>> cd /var/tmp
>> grep differs: rebuild_checks.txt | grep -v bg_flags | sed -re 's/^Group
>> ([0-9]+), .*/\1/p' | uniq | awk 'BEGIN{ start=-1; prev=-10 } { if (prev
>> < $1-1) { if (start >= 0) { print start " " prev }; start=$1 }; prev=$1
>> } END{ print start " " prev }' > blocks.txt
>>
>> exec 3<blocks.txt; while read A B <&3; do oblock=$(~/ext4_rebuild_gds
>> /dev/lvm/snap_home $A $B 2>&1 > .$A.tmp | sed -nre 's/^Physical GD block
>> for output start: //p') && mv .$A.tmp $A-$oblock.block; done
>>
>> for i in *.block; do sblock=$(basename ${i#*-} .block); for o in 0
>> 32768; do dd if=$i of=/dev/lvm/home bs=4096 seek=$(( sblock + o ));
>> done; done
>>
>> I also did validate that for all pairs in blocks.txt B-A == 64, and A%64
>> == 0 (to make sure it's a full GDT block that I need to update). If
>> this is not the case, you should reduce A such that it becomes a
>> multiple of 64, then adjust B such that B-A==64 (add additional pairs
>> such that the entire original A to B range is covered). This may set
>> ZEROED bg_flags value on some GDs as a side effect but hopefully this
>> won't cause problems. I also verified that all A >= first_meta_bg -
>> else you need to find appropriate block directly following super-block.
>>
>> I only replaced the first two copies of the GDT blocks in each case,
>> could replace the third as well by adding a 32768*63 entry for o in the
>> inner for loop.
>>
>> fsck is now running and getting past the previous failure point. I am
>> hopeful that we'll at least recover some data.
>>
>> I do have a few questions regarding fsck though:
>>
>> 1. Will it notice that the backups GDT blocks are not the same as the
>> primaries and replace the backup blocks?
>>
>> 2. I'm assuming it will rebuild the reachability for all inodes
>> starting at / and thus be able to determine which inodes should be
>> free/in use, and rebuild bitmaps accordingly (as well as mark the inodes
>> themselves free)?
>>
>> 3. Sort out all buggered up checksums in the GDs I've changed?
>>
>> Anyway, perhaps someone can answer out of hand, otherwise time will
>> tell, to date:
>>
>> # e2fsck -y -f /dev/lvm/home
>> e2fsck 1.45.4 (23-Sep-2019)
>> Pass 1: Checking inodes, blocks, and sizes
>> Inodes that were part of a corrupted orphan linked list found. Fix? yes
>>
>> Inode 13844923 was part of the orphaned inode list. FIXED.
>> Inode 13844932 has an invalid extent node (blk 13666881850, lblk 27684864)
>> Clear? yes
>>
>> Kind Regards,
>> Jaco
>>
>> On 2020/01/28 14:53, Jaco Kroon wrote:
>>
>>> Hi Andreas,
>>>
>>> On 2020/01/28 01:15, Andreas Dilger wrote:
>>>> On Jan 27, 2020, at 10:52 AM, Jaco Kroon <[email protected]> wrote:
>>>>> Hi Andreas,
>>>>>
>>>>> Ok, so there were a few calc errors below. I believe I've rectified those. I've written code to both locate and read the GD, as well as generate the fields I plan to update, comparing the fields I plan to update against those of all known-good blocks. The only issue I've picked up is that I've realised I don't understand bg_flags at all. Extract from output of my program (it uses the read GD as template and then just overwrites the fields I'm calculating).
>>>>> Group 65542, bg_flags differs: 0 (actual) != 4 (generated)
>>>> The EXT2_BG_{INODE,BLOCK}_UNINIT flags mean that the inode/block bitmaps are
>>>> all zero and do not need to be read from disk. That should only be used for
>>>> groups that have no blocks or inodes allocated to them.
>>>>
>>>> The EXT2_BG_INODE_ZEROED flag means that the inode table has been zeroed out
>>>> on disk after the filesystem has been formatted. It is safest to set this
>>>> flag on the group, so that the kernel does not try to zero out the inode tables
>>>> again, especially if your bitmaps are not correct.
>>> That makes sense. Good explanation thank you.
>>>
>>>>> and to possibly inspect the data at inode_table looking for some kind of "magic" as to whether or not it's already initialized or not...
>>>> The group descriptors also have a checksum that needs to be valid.
>>> I'm hoping fsck will fix this after I've actioned mine?
>>>>> My code is available at http://downloads.uls.co.za/85T/ext4_rebuild_gds.c for anyone that cares. Currently some work pending.
>>>> It probably makes more sense to include this into e2fsck, so that it will be
>>>> usable for everyone.
>>> I agree. For now I needed quick and nasty. I have no objection to do a
>>> patch at some point on e2fsck, or someone else can use my throw-away
>>> code and refine it. For now I'm taking stdout from that and using dd to
>>> clobber the meta_bg GDT blocks, as well as replace the backups. One
>>> reason for not doing that from the start was simply that I had no idea
>>> where exactly this would need to go into e2fsck.
>>>> IMHO, there is a design flaw in the meta_bg disk format,
>>>> in that the last group(s) of the filesystem may have few/no backup copies, so
>>>> cannot easily be recovered in case of an error.
>>> 1 copy if there is only one group, 2 of fewer than meta_bg groups ... 3
>>> otherwise. Frankly I'm OK(ish) with three backups, one of which is
>>> typically ~128MB away from the first two. Perhaps not quite good
>>> enough. I get where you're coming from. This is the first time that
>>> I've personally experienced GD corruption failures, or at least, that I
>>> *know* what I'm looking at. Having said that, in >15 years this is only
>>> the third ext* filesystem that I've managed to corrupt to the point
>>> where fsck would not recover it. The previous one was a few years back,
>>> and that was due to a major hardware issue. The time before that was
>>> sheer incompetence and stupidity.
>>>
>>>> There is also the secondary
>>>> issue that meta_bg causes the metadata to be spread out over the whole disk,
>>>> which causes a *LOT* of seeking on a very large filesystem, on the order of
>>>> millions of seeks, which takes a lot of time to read.
>>> I can't comment to that. I don't know how often these things are
>>> actually read from disk, whether it's once-off during mount time, or
>>> whether there are further reads of these blocks after that. If it's
>>> once off I don't particularly care personally, if an 85TB file system
>>> takes a minute to mount ... so be it. If it's continuous ... yes that
>>> can become problematic I reckon.
>>>
>>> Thank you very much for the information, you've managed to (even though
>>> not tell me how to fix my issue) at the very least confirm some things
>>> for me. Aware this is goodwill, and that's more than I could ask for in
>>> all reality.
>>>
>>> Kind Regards,
>>> Jaco
>>>
>>>>> On 2020/01/27 12:24, Jaco Kroon wrote:
>>>>>> Hi Andreas,
>>>>>>
>>>>>> Thank you. The filesystem uses meta_bg and is larger than 16TB, my issue also doesn't exhibit in the first 16TB covered by those blocks.
>>>>>> On 2020/01/26 23:45, Andreas Dilger wrote:
>>>>>>> There are backups of all the group descriptors that can be used in such cases, immediately following the backup superblocks.
>>>>>>>
>>>>>>> Failing that, the group descriptors follow a very regular pattern and could be recreated by hand if needed (eg. all the backups were also corrupted for some reason).
>>>>>>>
>>>>>> You are right. It's however tricky to wrap your head around it. I think I've got it, but if you don't mind double checking me please:
>>>>>>
>>>>>> Given a group number g. sb = superblock.
>>>>>> gds_per_block = 2^(10+sb.s_log_block_size) / sb.s_desc_size = 4096 / 64 = 64
>>>>>> gd_block = floor(g / gds_per_block)
>>>>>> if (gd_block < ${sb.s_reserved_gdt_blocks})
>>>>>> phys_gd_block = gd_block + 1;
>>>>>> else
>>>>>> phys_gd_block = floor(gd_block / gds_per_block) * gds_per_block * sb.s_blocks_per_group
>>>>>>
>>>>>> phys_gd_block_offset = sb.s_desc_size * (g % gds_per_block)
>>>>>>
>>>>>> Backup blocks, are either with every superblock backup (groups 0, and groups being a power of 3, 5 and 7, ie, 0, 1, 3, 5, 7, 9, 25, 27, ...) where gd_block < ${sb.s_reserved_gdt_blocks); or
>>>>>> phys_gd_backup1 = phys_gd_block + sb.s_blocks_per_group
>>>>>> phys_gd_backup2 = phys_gd_block + sb.s_blocks_per_group * (gds_per_block - 1)
>>>>>>
>>>>>> offset stays the same.
>>>>>>
>>>>>> To reconstruct it's critical to fill the following fields, with the required calculations (gd = group descriptor, calculations for groups < 2^17, using 131072 as example):
>>>>>>
>>>>>> bitmap_group = floor(g / flex_groups) * flex_groups
>>>>>> => floor(131072 / 16) * 16 => 131072
>>>>>> gd.bg_block_bitmap = bitmap_group * blocks_per_group + g % flex_groups
>>>>>> => 131072 * 3768 + 131072 % 16
>>>>>> => 493879296
>>>>>> if (bitmap_group % gds_per_block == 0) /* if bitmap_group also houses a meta group block */
>>>>>> gd.bg_block_bitmap++;
>>>>>> => if (131072 % 64 == 0)
>>>>>> => if (0 == 0)
>>>>>> => gd.bg_block_bitmap = 493879297
>>>>>>
>>>>>> gd.bg_inode_bitmap = gd.bg_block_bitmap + flex_groups
>>>>>> => gd.bg_inode_bitmap = 493879297 + 16 = 493879313
>>>>>> gd.bg_inode_table = gd.bg_inode_bitmap + flex_groups - (g % flex_groups) + (g % flex_groups * inode_blocks_per_group)
>>>>>> => 493879313 + 16 - 0 + 0 * 32 = 493879329
>>>>>> Bad example, for g=131074:
>>>>>> =>493879315 + 16 - 2 + 2 * 32 = 493879393
>>>>>>
>>>>>> gd.bg_flags = 0x4
>>>>>>
>>>>>> I suspect it's OK to just zero (or leave them zero) these:
>>>>>>
>>>>>> bg_free_blocks_count
>>>>>> bg_free_inodes_count
>>>>>> bg_used_dirs_count
>>>>>> bg_exclude_bitmap
>>>>>> bg_itable_unused
>>>>>>
>>>>>> As well as all checksum fields (hopefully e2fsck will correct those).
>>>>>> I did action this calculation for a few non-destructed GDs and my manual calculations seems OK for the groups I checked (all into meta_bg area):
>>>>>>
>>>>>> 131072 (multiple of 64, meta_bg + fleg_bg)
>>>>>> 131073 (multiple of 64 + 1 - first meta_bg backup)
>>>>>> 131074 (none of the others, ie, plain group with data blocks only)
>>>>>> 131088 (multiple of 16 but not 64, flex_bg)
>>>>>> 131089 (multiple of 16 but not 64, +1)
>>>>>> 131135 (multiple of 64 + 63 - second meta_bg backup)
>>>>>> It is however a very limited sample, but should cover all the corner cases I could think of based on the specification and my understanding thereof.
>>>>>>
>>>>>> Given the above I should be able to write a small program that will produce the 128 4KiB blocks that's required, and then I can use dd to place them into the correct locations.
>>>>>> As an aside, debugfs refuses to open the filesystem:
>>>>>>
>>>>>> crowsnest ~ # debugfs /dev/lvm/home
>>>>>> debugfs 1.45.4 (23-Sep-2019)
>>>>>> /dev/lvm/home: Block bitmap checksum does not match bitmap while reading allocation bitmaps
>>>>>> debugfs: chroot /
>>>>>> chroot: Filesystem not open
>>>>>> debugfs: quit
>>>>>> Which is probably fair. So for this one I'll have to go make modifications using either some programatic tool that opens the underlying block device, or use dd trickery (ie, construct a GD and dd it into the right location, a a sequence of 64 GDs as it's always 64 GDs that's destroyed in my case, exactly 1 4KiB block).
>>>>>>
>>>>>> Kind Regards,
>>>>>> Jaco
>>>>>>> Cheers, Andreas
>>>>>>>
>>>>>>>
>>>>>>>> On Jan 26, 2020, at 13:44, Jaco Kroon <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> So working through the dumpe2fs file, the group mentioned by dmesg
>>>>>>>> contains this:
>>>>>>>>
>>>>>>>> Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
>>>>>>>> Group descriptor at 13243514880
>>>>>>>> Block bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>>>>> Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
>>>>>>>> Inode table at 0-31 (bg #0 + 0)
>>>>>>>> 0 free blocks, 0 free inodes, 0 directories
>>>>>>>> Free blocks: 13243514880-13243547647
>>>>>>>> Free inodes: 206929921-206930432
>>>>>>>>
>>>>>>>> Based on that it's quite simple to see that during the array
>>>>>>>> reconstruction we apparently wiped a bunch of data blocks with all
>>>>>>>> zeroes. This is obviously bad. During reconstruction we had to zero one
>>>>>>>> of the disks before we could get the array to reassemble. What I'm
>>>>>>>> wondering is whether this process was a good choice now, and whether the
>>>>>>>> right disk was zeroed. Obviously this implies major data loss (at least
>>>>>>>> 4TB, probably more assuming that directory structures may well have been
>>>>>>>> destroyed as well, maybe less if some of those blocks weren't in use).
>>>>>>>>
>>>>>>>> I'm hoping that it's possible to recreate these group descriptors (there
>>>>>>>> are a few of them) to at least point to the correct locations on disk,
>>>>>>>> and to then attempt a cleanup with e2fsck. Again, data loss here is to
>>>>>>>> be expected, but if we can limit it at least that would be great.
>>>>>>>>
>>>>>>>> There are unfortunately a large bunch of groups affected (128 cases of
>>>>>>>> 64 consecutive group blocks).
>>>>>>>>
>>>>>>>> 32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
>>>>>>>> 4KB/block => 1TB of data lost. However, this is extremely conservative
>>>>>>>> seeing that this could include directory structures with cascading effect.
>>>>>>>>
>>>>>>>> Based on the patterns of the first 64 group descriptors (GDs) it looks
>>>>>>>> like it should be possible to reconstruct the 8192 affected GDs, or
>>>>>>>> alternatively possibly "uninit" them
>>>>>>>> (
>>>>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization
>>>>>>>> ).
>>>>>>>> I'm inclined to reason that it's probably safer to repair in the GDs the
>>>>>>>> following fields:
>>>>>>>>
>>>>>>>> bg_block_bitmap_{lo,hi}
>>>>>>>> bg_inode_bitmap_{lo,hi}
>>>>>>>> bg_inode_table_{lo,hi}
>>>>>>>>
>>>>>>>> I'm not sure about:
>>>>>>>>
>>>>>>>> bg_flags (I'm guessing the safest is to leave this zeroed).
>>>>>>>> bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).
>>>>>>>>
>>>>>>>> The following should (as far as my understanding goes) then be "fixable"
>>>>>>>> by e2fsck:
>>>>>>>>
>>>>>>>> bg_free_blocks_count_{lo,hi}
>>>>>>>> bg_free_inodes_count_{lo,hi}
>>>>>>>> bg_used_dirs_count_{lo,hi}
>>>>>>>> bg_block_bitmap_csum_{lo,hi}
>>>>>>>> bg_inode_bitmap_csum_{lo,hi}
>>>>>>>> bg_itable_unused_{lo,hi}
>>>>>>>> bg_checksum
>>>>>>>>
>>>>>>>> And of course, tracking down the GD on disk will be tricky it seems. It
>>>>>>>> seems some blocks have the GD in the block, and a bunch of others don't
>>>>>>>> (nor does dumpe2fs say where exactly they are). There is 2048 blocks of
>>>>>>>> GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
>>>>>>>> group 2^17 onwards there are additional groups simply stating "Group
>>>>>>>> descriptor at ${frist_block_of_group}", so it's unclear how to track
>>>>>>>> down the GD for a given block group.
>>>>>>>>
>>>>>>>> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
>>>>>>>>
>>>>>>>> does not describe this particularly well either, and there seems to be
>>>>>>>> confusion w.r.t. flex_bg and meta_bg features and this.
>>>>>>>>
>>>>>>>> I do have an LVM snapshot of the affected LV currently, so happy to try
>>>>>>>> things.
>>>>>>>>
>>>>>>>> Kind Regards,
>>>>>>>> Jaco
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 2020/01/26 12:21, Jaco Kroon wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've got an 85TB ext4 filesystem which I'm unable to fsck. The only
>>>>>>>>> cases of same error I could find was from what I can find due to an SD
>>>>>>>>> card "swallowing" writes (ie, the card goes into a read-only mode but
>>>>>>>>> doesn't report write failure).
>>>>>>>>>
>>>>>>>>> crowsnest ~ # e2fsck -f /dev/lvm/home
>>>>>>>>>
>>>>>>>>> e2fsck 1.45.4 (23-Sep-2019)
>>>>>>>>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>>>>>>>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>>>>>>>> /dev/lvm/home: recovering journal
>>>>>>>>> e2fsck: unable to set superblock flags on /dev/lvm/home
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>>>>>>>>>
>>>>>>>>> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>>>>>>>>>
>>>>>>>>> I have also (using dumpe2fs) obtained the location of the backup super
>>>>>>>>> blocks and tried same against a few other superblocks using -b. -y (as
>>>>>>>>> per suggestion from at least one post) make absolutely no difference,
>>>>>>>>> our understanding is that this simply answers yes to all questions, so
>>>>>>>>> we didn't expect this to have impact but decided it was worth a try anyway.
>>>>>>>>>
>>>>>>>>> Looking at the code for the unable to set superblock error it looks like
>>>>>>>>> the code is in e2fsck/unix.c, specifically this:
>>>>>>>>>
>>>>>>>>> 1765 if (ext2fs_has_feature_journal_needs_recovery(sb)) {
>>>>>>>>> 1766 if (ctx->options & E2F_OPT_READONLY) {
>>>>>>>>> ...
>>>>>>>>> 1771 } else {
>>>>>>>>> 1772 if (ctx->flags & E2F_FLAG_RESTARTED) {
>>>>>>>>> 1773 /*
>>>>>>>>> 1774 * Whoops, we attempted to run the
>>>>>>>>> 1775 * journal twice. This should never
>>>>>>>>> 1776 * happen, unless the hardware or
>>>>>>>>> 1777 * device driver is being bogus.
>>>>>>>>> 1778 */
>>>>>>>>> 1779 com_err(ctx->program_name, 0,
>>>>>>>>> 1780 _("unable to set superblock flags "
>>>>>>>>> 1781 "on %s\n"), ctx->device_name);
>>>>>>>>> 1782 fatal_error(ctx, 0);
>>>>>>>>> 1783 }
>>>>>>>>>
>>>>>>>>> That comment has me somewhat confused. I'm assuming the implication
>>>>>>>>> there is that e2fsck tried to update the superblock, but after reading
>>>>>>>>> it back, it's either unchanged or still wrong (In line with the
>>>>>>>>> description of the SD card I found online). None of our arrays are
>>>>>>>>> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
>>>>>>>>> (we downgraded back to 5.1.15, which we're on currently, after
>>>>>>>>> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
>>>>>>>>> didn't seem to fix those, and the 4.14.13 kernel that was used
>>>>>>>>> previously is known to cause ext4 corruption of the kind we saw on the
>>>>>>>>> other filesystems):
>>>>>>>>>
>>>>>>>>> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
>>>>>>>>> group 404160 overlaps superblock
>>>>>>>>> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>>>>>>>>>
>>>>>>>>> I created a dumpe2fs file as well:
>>>>>>>>>
>>>>>>>>> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
>>>>>>>>> dumpe2fs 1.45.4 (23-Sep-2019)
>>>>>>>>> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
>>>>>>>>> read '/dev/lvm/home' bitmaps
>>>>>>>>>
>>>>>>>>> Available at
>>>>>>>>> https://downloads.uls.co.za/85T/dump2fs_home.txt.xz
>>>>>>>>> (1.2GB,
>>>>>>>>> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>>>>>>>>>
>>>>>>>>> A strace of e2fsck -y -f /dev/lvm/home at
>>>>>>>>>
>>>>>>>>> https://downloads.uls.co.za/85T/fsck.strace.txt
>>>>>>>>> (13MB,
>>>>>>>>> md5:60aa91b0c47dd2837260218eb774152d)
>>>>>>>>>
>>>>>>>>> crowsnest ~ # tune2fs -l /dev/lvm/home
>>>>>>>>> tune2fs 1.45.4 (23-Sep-2019)
>>>>>>>>> Filesystem volume name: <none>
>>>>>>>>> Last mounted on: /home
>>>>>>>>> Filesystem UUID: 522a9faf-7992-4888-93d5-7fe49a9762d6
>>>>>>>>> Filesystem magic number: 0xEF53
>>>>>>>>> Filesystem revision #: 1 (dynamic)
>>>>>>>>> Filesystem features: has_journal ext_attr filetype meta_bg extent
>>>>>>>>> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
>>>>>>>>> metadata_csum
>>>>>>>>> Filesystem flags: signed_directory_hash
>>>>>>>>> Default mount options: user_xattr acl
>>>>>>>>> Filesystem state: clean
>>>>>>>>> Errors behavior: Continue
>>>>>>>>> Filesystem OS type: Linux
>>>>>>>>> Inode count: 356515840
>>>>>>>>> Block count: 22817013760
>>>>>>>>> Reserved block count: 0
>>>>>>>>> Free blocks: 6874204745
>>>>>>>>> Free inodes: 202183498
>>>>>>>>> First block: 0
>>>>>>>>> Block size: 4096
>>>>>>>>> Fragment size: 4096
>>>>>>>>> Group descriptor size: 64
>>>>>>>>> Blocks per group: 32768
>>>>>>>>> Fragments per group: 32768
>>>>>>>>> Inodes per group: 512
>>>>>>>>> Inode blocks per group: 32
>>>>>>>>> RAID stride: 128
>>>>>>>>> RAID stripe width: 1024
>>>>>>>>> First meta block group: 2048
>>>>>>>>> Flex block group size: 16
>>>>>>>>> Filesystem created: Thu Jul 26 12:19:07 2018
>>>>>>>>> Last mount time: Sat Jan 18 18:58:50 2020
>>>>>>>>> Last write time: Sun Jan 26 11:38:56 2020
>>>>>>>>> Mount count: 2
>>>>>>>>> Maximum mount count: -1
>>>>>>>>> Last checked: Wed Oct 30 17:37:27 2019
>>>>>>>>> Check interval: 0 (<none>)
>>>>>>>>> Lifetime writes: 976 TB
>>>>>>>>> Reserved blocks uid: 0 (user root)
>>>>>>>>> Reserved blocks gid: 0 (group root)
>>>>>>>>> First inode: 11
>>>>>>>>> Inode size: 256
>>>>>>>>> Required extra isize: 32
>>>>>>>>> Desired extra isize: 32
>>>>>>>>> Journal inode: 8
>>>>>>>>> Default directory hash: half_md4
>>>>>>>>> Directory Hash Seed: 876a7d14-bce8-4bef-9569-82e7d573b7aa
>>>>>>>>> Journal backup: inode blocks
>>>>>>>>> Checksum type: crc32c
>>>>>>>>> Checksum: 0xfbd895e9
>>>>>>>>>
>>>>>>>>> Infrastructure: 3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
>>>>>>>>> 10TB disks (100TB usable total). These are combined into a single VG
>>>>>>>>> using LVM, and then carved up into a number of LVs, the largest of which
>>>>>>>>> is this 85TB chunk. We have tried in the past to carve this into
>>>>>>>>> smaller LVs but failed. So we're aware that this is very large and not
>>>>>>>>> ideal.
>>>>>>>>>
>>>>>>>>> We did experience an assembly issue on one of the underlying RAID6 PVs,
>>>>>>>>> those have been resolved, and the disk that was giving issues has been
>>>>>>>>> scrubbed and rebuilt. rom what we can tell based on other file systems,
>>>>>>>>> this did not affect data integrity but we can't make that statement with
>>>>>>>>> 100% certainty, as such we are expecting some data loss here but it
>>>>>>>>> would be better if we can recover at least some of this data.
>>>>>>>>>
>>>>>>>>> Other filesystems which also resides on the same PV that was affected by
>>>>>>>>> the RAID6 problem either received a clean bill of health, or were
>>>>>>>>> successfully repaired by e2fsck (the system did crash however, it's
>>>>>>>>> unclear whether the RAID6 assembly problem was the cause or merely
>>>>>>>>> another consequence, and as a result, whether the corruption on the
>>>>>>>>> repaired filesystem was a consequence of the kernel or the RAID).
>>>>>>>>>
>>>>>>>>> I'm continuing onwards with e2fsck code to try and figure this out, am
>>>>>>>>> hopeful though that someone could perhaps provide some much needed
>>>>>>>>> insight and pointers for me.
>>>>>>>>>
>>>>>>>>> Kind Regards,
>>>>>>>>> Jaco
>>>>>>>>>
>>>>>>>>>
>>>> Cheers, Andreas
>>>>
>>>>
>>>>
>>>>
>>>>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-01-29 20:53:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

On Wed, Jan 29, 2020 at 06:35:41AM +0200, Jaco Kroon wrote:
> Hi,
>
> Inode 181716301 block 33554947 conflicts with critical metadata,
> skipping block checks.
>
> So the critical block stuff I'm guessing can be expected since a bunch
> of those tree structures probably got zeroed too.

It's possible that this was caused by the tree structures getting
written with garbage (33554947 is not zero, so it's not the extent
tree structure getting zeroed, by definition). If metadata checksums
are enabled, then the kernel would notice (and flag them with EXT4-fs
error reports) if extent trees were not correctly set up.

Another possibility is that hueristics you used for guessing how to
recontrust the block group discripts were incorrectly. Note that if
the file system has been grown, using on-line or off-line resize2fs,
the results may not be identical to how the block groups laid out by
mke2fs would have done things. So trying to use the existing pattern
of block group descriptors to reconstruct missing ones is fraught with
potential problems.

If the file system has never been resized, and if you have exactly the
same version of e2fsprogs used to initially create the file system,
and if you have the exact same version of /etc/mke2fs.conf, and the
exact same command-line options to mke2fs, you might be able to use
"mke2fs -S" (see the mke2fs manpage) to rewrite the superblock and
block group descriptors. But if any of the listed assumptions can't
be assured, it's a dangerous thing to do.

> Another idea is to use debugfs to mark inode 181716301 as deleted, but
> I'm not sure that's safe at this stage?

Well, you'll lose whatever was in that inode, but it's more likely
that the problem is that if the block group descriptors are incorret,
you'll cause even more damage.

Did you make a full image backup of the good disks you can revert any
experiments that you might try?

Good look,

- Ted

P.S. For future reference, please take a look at the man page of
e2image for how you can back up the ext4's critical metadata blocks.

2020-01-30 10:56:27

by Jaco Kroon

[permalink] [raw]
Subject: Re: e2fsck fails with unable to set superblock

Hi Ted,

On 2020/01/29 22:50, Theodore Y. Ts'o wrote:
> On Wed, Jan 29, 2020 at 06:35:41AM +0200, Jaco Kroon wrote:
>> Hi,
>>
>> Inode 181716301 block 33554947 conflicts with critical metadata,
>> skipping block checks.
>>
>> So the critical block stuff I'm guessing can be expected since a bunch
>> of those tree structures probably got zeroed too.
> It's possible that this was caused by the tree structures getting
> written with garbage (33554947 is not zero, so it's not the extent
> tree structure getting zeroed, by definition). If metadata checksums
> are enabled, then the kernel would notice (and flag them with EXT4-fs
> error reports) if extent trees were not correctly set up.
>
> Another possibility is that hueristics you used for guessing how to
> recontrust the block group discripts were incorrectly. Note that if
> the file system has been grown, using on-line or off-line resize2fs,
> the results may not be identical to how the block groups laid out by
> mke2fs would have done things. So trying to use the existing pattern
> of block group descriptors to reconstruct missing ones is fraught with
> potential problems.
So my code did some extra work in that it regenerated existing ones too,
and the only issues it picked up was with those GDs which was "all
zero".  So I'm fairly confident that it's OK what I've done.  The
descriptions on the links I've previously posted made more and more
sense as I re-read them a few times and were spot on with what was found
on disk for non-damaged GDT blocks.  Other than bg_flags ... which
Andreas explained quite well.
>
> If the file system has never been resized, and if you have exactly the
> same version of e2fsprogs used to initially create the file system,
> and if you have the exact same version of /etc/mke2fs.conf, and the
> exact same command-line options to mke2fs, you might be able to use
> "mke2fs -S" (see the mke2fs manpage) to rewrite the superblock and
> block group descriptors. But if any of the listed assumptions can't
> be assured, it's a dangerous thing to do.

It has, a few times, always online.  Generally in increments of 1TB at a
time. I can't remember all the arguments and stuff though, and I have
definitely upgraded e2fsprogs in the meantime.

Hehehe, dangerous at this point in time is an option compared to
reformatting and definitely losing all the data I can only win.  And LVM
snapshots are helpful w.r.t. being able to roll back, but it can't get
worse than "complete data loss" which is where I'm currently at.

>
>> Another idea is to use debugfs to mark inode 181716301 as deleted, but
>> I'm not sure that's safe at this stage?
> Well, you'll lose whatever was in that inode, but it's more likely
> that the problem is that if the block group descriptors are incorret,
> you'll cause even more damage.
>
> Did you make a full image backup of the good disks you can revert any
> experiments that you might try?

LVM snapshot yes.  Don't have 85T just lying around elsewhere to dd onto.

>
> Good look,
>
> - Ted
>
> P.S. For future reference, please take a look at the man page of
> e2image for how you can back up the ext4's critical metadata blocks.
>
This is great!  I'll definitely add that to my bag of tricks. 
Especially for this particular server which houses most of our backups
for other hosts.

Kind Regards,
Jaco

2020-01-30 10:57:23

by Jaco Kroon

[permalink] [raw]
Subject: Re: *** SPAM *** Re: e2fsck fails with unable to set superblock

Hi,

On 2020/01/29 22:00, Andreas Dilger wrote:
> On Jan 28, 2020, at 9:35 PM, Jaco Kroon <[email protected]> wrote:
>> Hi,
>>
>> Inode 181716301 block 33554947 conflicts with critical metadata,
>> skipping block checks.
>> Inode 181716301 block 524296 conflicts with critical metadata, skipping
>> block checks.
>> Inode 181716301 block 2 conflicts with critical metadata, skipping block
>> checks.
>> Inode 181716301 block 294 conflicts with critical metadata, skipping
>> block checks.
>> Inode 181716301 block 1247805839 conflicts with critical metadata,
>> skipping block checks.
>> Inode 181716301 block 288 conflicts with critical metadata, skipping
>> block checks.
>> Inode 181716301 block 103285040 conflicts with critical metadata,
>> skipping block checks.
>> Inode 181716301 block 872415232 conflicts with critical metadata,
>> skipping block checks.
>> Inode 181716301 block 2560 conflicts with critical metadata, skipping
>> block checks.
>> Inode 181716301 block 479199248 conflicts with critical metadata,
>> skipping block checks.
>> Inode 181716301 block 1006632963 conflicts with critical metadata,
>> skipping block checks.
> This inode is probably just random garbage. Erase that inode with:
>
> debugfs -w -R "clri <181716301>" /dev/sdX
>
> There may be multiple such inodes with nearby numbers in the likely
> case that a whole block is corrupted. There has been some discussion
> about the best way to handle such corruption of a whole inode table
> block, but nothing has been implemented in e2fsck yet.

crowsnest ~ # debugfs -w -R "clri <181716301>" /dev/lvm/home
debugfs 1.45.4 (23-Sep-2019)
/dev/lvm/home: Block bitmap checksum does not match bitmap while reading
allocation bitmaps
clri: Filesystem not open

-n sorts that out.

There were a few other inodes too, wiped them too, restarted fsck now.

>
>> So the critical block stuff I'm guessing can be expected since a bunch
>> of those tree structures probably got zeroed too.
>>
>> It got killed because it ran out of RAM (OOM killer), 32GB physical +
>> 16GB swap. I've extended swap to 512GB now and restarted. It's
>> probably overkill (I hope).
>>
>> Any ideas on what might be consuming the RAM like this? Unfortunately
>> my scroll-back doesn't go back far enough to see what other inodes if
>> any are also affected. I've restarted with 2>&1 | tee /var/tmp/fsck.txt
>> now.
>>
>> Happy to go hunting to look for possible optimization ideas here ...
>>
>> Another idea is to use debugfs to mark inode 181716301 as deleted, but
>> I'm not sure that's safe at this stage?
> Marking it "deleted" isn't really the right thing, since (AFAIR) that
> will just update the inode bitmap and possibly set the "i_dtime" field
> in the inode. The "clri" command will zero out the inode, which erases
> all of the bad block allocation references for that inode. This is no
> "loss" since the inode is already garbage.

Agreed and makes perfect sense.

Kind Regards,
Jaco