2022-04-16 00:04:52

by Peter Urbanec

[permalink] [raw]
Subject: resize2fs on ext4 leads to corruption

I think I may have run into a resize2fs bug that leads to data loss. I
see this:

# mount -t ext4 /dev/md0 /mnt/RED8
mount: /mnt/RED8: wrong fs type, bad option, bad superblock on /dev/md0,
missing codepage or helper program, or other error.

Besides reporting the issue and gathering as much information as I can
to help debug this, I'd also like to ask for some assistance trying to
recover
the data. I'm prepared to put in some effort. I'm on Gentoo and can
apply git patches and rebuild the kernel or compile e2fsprogs.

The system is a *32-bit* Gentoo installation built well over a decade
ago, but is kept reasonably up to date.

# uname -a
Linux gw 5.16.9-gentoo #2 SMP Sun Feb 13 21:19:40 AEDT 2022 i686
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz GenuineIntel GNU/Linux

sys-fs/e2fsprogs
     Installed versions:  1.46.4(11:13:09 04/01/22)(nls split-usr
threads -cron -fuse -lto -static-libs)

sys-libs/e2fsprogs-libs
     Installed versions:  1.46.4-r1(17:26:02 02/01/22)(split-usr
-static-libs ABI_MIPS="-n32 -n64 -o32" ABI_S390="-32 -64" ABI_X86="32
-64 -x32")


Here is the sequence of steps that lead to data loss:

Added one 8TB disk to a md raid5 array:

# mdadm --add /dev/md0 /dev/sdi1
# mdadm --grow --raid-devices=4
--backup-file=/root/grow_md0_20220410.bak  /dev/md0

[183222.697484] md: md0: reshape done.
[183222.866677] md0: detected capacity change from 31255572480 to
46883358720

md0 : active raid5 sdi1[4] sda1[3] sdh1[1] sdg1[0]
      23441679360 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] [UUUU]
      bitmap: 0/59 pages [0KB], 65536KB chunk

# umount /mnt/RED8

# tune2fs -E stride=128,stripe_width=384  /dev/md0

# fsck.ext4 -f -v -C 0 -D /dev/md0

# mount -t ext4 /dev/md0 /mnt/RED8

At this stage I used the system for about a week without any issues.
Then earlier today:

# umount /mnt/RED8

# e2fsck -f /dev/md0
e2fsck 1.46.4 (18-Aug-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
RED8: 11665593/488370176 files (0.6% non-contiguous),
3434311347/3906946560 blocks

# resize2fs /dev/md0
resize2fs 1.46.4 (18-Aug-2021)
Resizing the filesystem on /dev/md0 to 5860419840 (4k) blocks.
The filesystem on /dev/md0 is now 5860419840 (4k) blocks long.

So far so good. Everything appears to be working just fine until now.

# mount -t ext4 /dev/md0 /mnt/RED8
mount: /mnt/RED8: wrong fs type, bad option, bad superblock on /dev/md0,
missing codepage or helper program, or other error.

# dumpe2fs -h /dev/md0
dumpe2fs 1.46.4 (18-Aug-2021)
dumpe2fs: Bad magic number in super-block while trying to open /dev/md0
Couldn't find valid filesystem superblock.

# dumpe2fs -o superblock=32768  -h /dev/md0
dumpe2fs 1.46.4 (18-Aug-2021)
Filesystem volume name:   RED8
Last mounted on:          /exported/Music
Filesystem UUID:          1e999cb8-12b2-4ab7-b41b-c77fd267a102
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
sparse_super2 filetype extent 64bit flex_bg large_dir inline_data
sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         not clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              732553216
Block count:              5860419840
Reserved block count:     0
Free blocks:              2410725583
Free inodes:              720887623
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         4096
Inode blocks per group:   256
RAID stride:              128
RAID stripe width:        384
Flex block group size:    16
Filesystem created:       Wed Jan  2 01:42:39 2019
Last mount time:          Mon Apr 11 00:39:58 2022
Last write time:          Fri Apr 15 17:53:06 2022
Mount count:              0
Maximum mount count:      -1
Last checked:             Fri Apr 15 17:04:07 2022
Check interval:           0 (<none>)
Lifetime writes:          9 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      7f7889a7-4ff4-4bbb-a7d0-5e9821e7e70b
Journal backup:           inode blocks
Backup block groups:      1 178845
Checksum type:            crc32c
Checksum:                 0x47b714d9
Journal features:         journal_incompat_revoke journal_64bit
journal_checksum_v3
Total journal size:       1024M
Total journal blocks:     262144
Max transaction length:   262144
Fast commit length:       0
Journal sequence:         0x00064746
Journal start:            0
Journal checksum type:    crc32c
Journal checksum:         0x262ca522

# e2fsck -f -C 0 -b 32768 -z /root/20220415_2015_e2fsck-b_32768.e2undo
/dev/md0
e2fsck 1.46.4 (18-Aug-2021)
Overwriting existing filesystem; this can be undone using the command:
    e2undo /root/20220415_2015_e2fsck-b_32768.e2undo /dev/md0

e2fsck: Undo file corrupt while trying to open /dev/md0

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

In light of recent mailing list traffic, I suspect that the issue may be
caused by sparse_super2 .

Any suggestions as to what I could try to recover? Unfortunately, I do
not have an undo file for the resize2fs run (which is a bit unusual for
me, since I usually tend to take advantage of safety features).

Thanks,

    Peter Urbanec


2022-04-16 02:09:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: resize2fs on ext4 leads to corruption

On Sat, Apr 16, 2022 at 12:37:29AM +1000, Peter Urbanec wrote:
> # e2fsck -f -C 0 -b 32768 -z /root/20220415_2015_e2fsck-b_32768.e2undo
> /dev/md0
> e2fsck 1.46.4 (18-Aug-2021)
> Overwriting existing filesystem; this can be undone using the command:
> ??? e2undo /root/20220415_2015_e2fsck-b_32768.e2undo /dev/md0
>
> e2fsck: Undo file corrupt while trying to open /dev/md0
>
> The superblock could not be read or does not describe a valid ext2/ext3/ext4
> filesystem.? If the device is valid and it really contains an ext2/ext3/ext4
> filesystem (and not swap or ufs or something else), then the superblock
> is corrupt, and you might try running e2fsck with an alternate superblock:
> ??? e2fsck -b 8193 <device>
> ?or
> ??? e2fsck -b 32768 <device>

So the failure of "e2fsck -f -C 0 -b 32768 -z /root/e2fsck.e2undo
/dev/md0" appears to be a bug where e2fsck doesn't work correctly with
an undo file when using a backup superblock. I can replicate this
using these commands:

mke2fs -q -t ext4 /tmp/foo.img 2G
e2fsck -b 32768 -z /tmp/undo /tmp/foo.img

Running e2fsck without the -z option succeeds. The combination of the
-b and -z option seems to be broken. As a workaround, I would suggest
doing is to try running e2fsck with -n, which will open the block
device read-only, e.g. "e2fsck -b 32768 -n /dev/mdXX". If the changes
e2fsck look safe, then you can run e2fsck without the -n option.


As far as what happens, I wasn't able to replicate the problem when
resizing an empty file system from 3906946560 to 5860419840 blocks,
using a 64-bit binary. I've stopped testing 32-bit builds quite a
while ago, and filling the file system would take more time than I
have at the moment. I will note that 3906946560 fits in 32-bits,
while 5860419840 is larger than 2**32. So it could very much be some
kind of a 32-bit block number overflow.

For better or for worse, I don't have the resources or time to test
the full set of combinations of file system features, and as I
mentioned, I don't test 32-bit builds any more, either. Enterprise
distributions will provide paid support, but they tend to only support
a limited number of file system features, and not the full set of
combination of features.

While I'm grateful that Gentoo users seem to be super adventurous in
terms of turning every single feature they can find, and then send in
bug reports so we can improve the case base --- there are reasons why
features like sparse_super2 and inline_data are not enabled by
default, and I feel bad when people turn on non-standard features, and
then lose data because they aren't doing backups and they enable
features that haven't received as much testing as the default
features.

So I don't know if the problemw as due to some kind of bug in
resize2fs caused by the use of 64-bit block numbers on a 32-bit
binary, or due to the enablement of the sparse_super2 feature. But
for now, let's see if we can get recover your file system, hopefully
with minimal data loss. And since the using an undo file seems to be
problematic with running e2fsck -b, the alternatives are (a) do a full
block level backup of the file system before running e2fsck --- which
I know will be hard since this is a very large file system, and (b)
using e2fsck -n first and looking at what e2fsck would do first.

I will look at why "e2fsck -b 32768 -z /tmp/foo.undo /tmp/foo.img"
fails, but I may not get to it in a week or two.

Cheers,

- Ted





2022-04-18 12:39:56

by Peter Urbanec

[permalink] [raw]
Subject: Re: resize2fs on ext4 leads to corruption

First, thank you for your help with this. I finally managed to get some
time today to take another look.

On 16/04/2022 07:22, Theodore Ts'o wrote:
> As a workaround, I would suggest
> doing is to try running e2fsck with -n, which will open the block
> device read-only, e.g. "e2fsck -b 32768 -n /dev/mdXX". If the changes
> e2fsck look safe, then you can run e2fsck without the -n option.

First I updated e2fsprogs to 1.46.5 and then tried:

# e2fsck -f -C 0 -b 32768 -n /dev/md0
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Error reading block 32 (Attempt to read block from filesystem resulted
in short read). Ignore error? no

e2fsck: aborted

Initially things looked promising and there were no errors being
reported even at 47% of the way through the check. The check terminated
at approximately the 50% mark with what appears to be a 32-bit overflow
issue.

Looking at the code shows that the `block` member of `struct_io_channel`
has a type of `unsigned long`, which is 4 bytes on a 32-bit system. As a
result, `e2fsck_handle_(read|write)_error()` can not actually handle
blocks past the 32-bit limit. `e2fsck_handle_read_error()` is called via
`channel->read_error()` from `raw_read_blk()` in `unix_io.c`. That
function uses `unsigned long long` for `block`. I have not looked
closely enough to determine whether the 32-bit overflow will only be an
issue when the error handler is invoked.

Ideally, the `block` member of `struct_io_channel` would be of type
`unsigned long long`, but changing this may introduce binary
compatibility issues. As an alternative, it may be prudent to perform an
early check for the total number of blocks in a file system and refuse
to run e2fsck (and other tools) if that number would cause an overflow.

I have not followed the code any further to see what may have happened
during resize2fs.

> As far as what happens, I wasn't able to replicate the problem when
> resizing an empty file system from 3906946560 to 5860419840 blocks,
> using a 64-bit binary. I've stopped testing 32-bit builds quite a
> while ago

If it is helpful, I can run some ad-hoc tests on the affected 32-bit
system, now with e2fsprogs 1.46.5. I've got enough free space to create
about ~2TB file for testing purposes. If that is enough for an empty
test file system, please let me know the sequence of test commands and
I'll provide the results.

> 3906946560 fits in 32-bits, while 5860419840 is larger than 2**32.
> So it could very much be some kind of a 32-bit block number overflow.

It almost certainly there is at least some aspect that is related to
32-bit overflows. For that reason I'm not going to test my luck running
anything that would further modify this file system on the 32-bit
installation. I'm in the process of building a 64-bit system and will
transplant the HDDs to this new machine. The plan is to go with kernel
5.17.3 + e2fsprogs 1.46.5 and try the above `e2fsck -n` command to see
if I have more luck on a 64-bit host.


> While I'm grateful that Gentoo users seem to be super adventurous in
> terms of turning every single feature they can find, and then send in
> bug reports so we can improve the case base --- there are reasons why
> features like sparse_super2 and inline_data are not enabled by
> default

Indeed, Gentoo tends to attract people who read man pages while tweaking
all the options, rather than go with defaults. I can't speak for others,
but in my case, when I created the file system, I systematically went
though the documented options and turned on anything that looked like it
would be applicable to the expected use case for the file system. I
tried to stay away from options that were documented as not robust, for
example `bigalloc`. I suppose it's a bit like premature optimisation - I
should have left most things at default values and only start making
changes if/when the need arises.

I should probably rethink my storage strategy. I certainly need to move
from 32-bit to 64-bit, the hardware is capable. One big ext4 file system
directly on top of md-raid5 has served me well for two decades. Maybe
that is no longer a smart option when that one file system has grown to
24TB.

> So I don't know if the problemw as due to some kind of bug in
> resize2fs caused by the use of 64-bit block numbers on a 32-bit
> binary, or due to the enablement of the sparse_super2 feature.

I used `dd` to made copies of superblocks 1 and 32768. I could not use
`dd` to get a copy of superblock 5860392960 - maybe another case of a
32-bit limit.

A quick hexdump comparison of superblocks 1 and 32768 shows that
superblock 1 has been zeroed out and only a few fields have been
repopulated with values that match superblock 32768. It also appears
that the checksum has been recomputed using the empty fields as it
differs between the two superblocks.

It looks like only the following fields have values set in superblock 1:
s_inodes_count = 00 E0 A9 2B
s_blocks_count_lo = 00 E9 4E 5D
s_free_blocks_count_lo = CF C0 B0 8F
s_free_inodes_count = 47 DF F7 2A
s_wtime = 62 24 00 00
s_state = 01 00
s_blocks_count_hi = 01 00 00 00
s_kbytes_written = 20 9E 25 4F 00 00 00 00
s_backup_bgs[1] = 9D BA 02 00

superblock 1 only has a value in s_backup_bgs[1] whereas superblock
32768 has both values set:
s_backup_bgs[0] = 01 00 00 00
s_backup_bgs[1] = 9D BA 02 00

A spot check of the `s_blocks_count_(lo|hi)` values indicates the
correct size in 64-bit number of 4k blocks.

If it will help, I can supply these two superblocks.


> alternatives are (a) do a full
> block level backup of the file system before running e2fsck --- which
> I know will be hard since this is a very large file system

I could probably scrounge up enough old HDDs and SATA controllers to
concatenate a non-redundant 16TB volume, but I don't have enough gear to
stretch it to 24TB.

Once I transplant the drives to a 64-bit machine, is there a way I could
use e2image to create a file that I can use to test whether an e2fsck
run will work?

Thank you,

Peter

2022-04-19 17:05:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: resize2fs on ext4 leads to corruption

On Mon, Apr 18, 2022 at 07:33:38PM +1000, Peter Urbanec wrote:
> First I updated e2fsprogs to 1.46.5 and then tried:
>
> # e2fsck -f -C 0 -b 32768 -n /dev/md0
> e2fsck 1.46.5 (30-Dec-2021)
> Pass 1: Checking inodes, blocks, and sizes
> Error reading block 32 (Attempt to read block from filesystem resulted in
> short read). Ignore error? no
>
> e2fsck: aborted
>
> Initially things looked promising and there were no errors being reported
> even at 47% of the way through the check. The check terminated at
> approximately the 50% mark with what appears to be a 32-bit overflow issue.

The 32-bit overflow issue is in the reporting of the error.
struct_io_channel's read_error() callback function, which is set to
e2fsck_handle_read_error(), is only called when there is an I/O error
(that's the short read).

So it's not the cause of the failure, but because of that bug, we
don't know what the block number was where the apparently I/O error
was found. Did you check the system logs to see if there was any
kernel messages that mgiht give us a hint?

Since you built e2fsprogs 1.42.6, can you build it making sure that
debugging symbols are included, and then run the binary out under a
debugger so when you get the error, you can ^C to escape out to the
debugger, and then print the stack trace and step up to see where in
the e2fsck prorgam we were trying to read the block, and what the
block number should have been. So something like:

./configure CFLAGS=-g ; make ; gdb e2fsck/e2fsck

> Ideally, the `block` member of `struct_io_channel` would be of type
> `unsigned long long`, but changing this may introduce binary compatibility
> issues. As an alternative, it may be prudent to perform an early check for
> the total number of blocks in a file system and refuse to run e2fsck (and
> other tools) if that number would cause an overflow.

Well, what we can do is to add new callback functions which are 64-bit
clean, and new versions of e2fsck can provide both the 32-bit and
64-bit error functions.

Refusing to run is probably a bit extreme, since this merely is a
cosmetic issue, and only when there is an I/O error.

Creating 64-bit clean calllback functions io_channel's read_error()
and write_error() was always part of the plan, and I thought we had
done it already --- but because it was less critical since RAID arrays
never fail ("What never? Well, hardly ever!"[1]) I just never got around
to it. :-/

[1] https://gsarchive.net/pinafore/web_opera/pin04.html

> Once I transplant the drives to a 64-bit machine, is there a way I could use
> e2image to create a file that I can use to test whether an e2fsck run will
> work?

An alternative which might be easier would be to create a scratch test
file system, using dm-thin which will allow you to simulate a very
large file system with thin-provisioning[2]. It will only take
(roughly) as much space as you write into it.

[2] https://wiki.archlinux.org/title/LVM#Thin_provisioning

Linux will tend to spread the use of its block groups over the LBA
space, as you start filling it and create a bunch of directories. So
if you, say, unpack a Linux kernel and then build it, and then unpack
a different Linux kernel version tree in to a different directory
hierarchy and build it, that should be enough to make sure that you
are using a variety of block groups spaced out through the block
device. You could then try running e2fsck on a 32-bit platform, and
see if you can replicate the problem, and then do an off-line resize,
etc., without risking your data and without requiring a huge amount of
space.

Cheers,

- Ted

2022-04-25 23:08:32

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH] e2fsck: Always probe filesystem blocksize with simple io_manager

"Theodore Ts'o" <[email protected]> writes:

> So the failure of "e2fsck -f -C 0 -b 32768 -z /root/e2fsck.e2undo
> /dev/md0" appears to be a bug where e2fsck doesn't work correctly with
> an undo file when using a backup superblock. I can replicate this
> using these commands:
>
> mke2fs -q -t ext4 /tmp/foo.img 2G
> e2fsck -b 32768 -z /tmp/undo /tmp/foo.img
>
> Running e2fsck without the -z option succeeds. The combination of the
> -b and -z option seems to be broken. As a workaround, I would suggest
> doing is to try running e2fsck with -n, which will open the block
> device read-only, e.g. "e2fsck -b 32768 -n /dev/mdXX". If the changes
> e2fsck look safe, then you can run e2fsck without the -n option.

Ted,

I think this is a fix for the combination of -b and -z.

Thanks,

>8

Combining superblock (-b) with undo file (-z) fails iff the block size
is not specified (-B) and is different from the first blocksize probed
in try_open_fs (1k). The reason is as follows:

try_open_fs will probe different blocksizes if none is provided on the
command line. It is done by opening and closing the filesystem until it
finds a blocksize that makes sense. This is fine for all io_managers,
but undo_io creates the undo file with that blocksize during
ext2fs_open. Once try_open_fs realizes it had the wrong blocksize and
retries with a different blocksize, undo_io will read the previously
created file and think it's corrupt for this filesystem.

Ideally, undo_io would know this is a probe and would fix the undo file.
It is not simple, though, because it would require undo_io to know the
file was just created by the probe code, since an undo file survives
through different fsck sessions. We'd have to pass this information
around somehow. This seems like a complex change to solve a corner
case.

Instead, this patch changes the blocksize probe to always use the
unix_io_manager. This way, we safely probe for the blocksize without
side effects. Once the blocksize is known, we can safely reopen the
filesystem under the proper io_manager.

An easily reproducer for this issue (from Ted, adapted by me) is:

mke2fs -b 4k -q -t ext4 /tmp/foo.img 2G
e2fsck -b 32768 -z /tmp/undo /tmp/foo.img

Reported-by: Peter Urbanec <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
e2fsck/unix.c | 41 ++++++++++++++++++++++++-----------------
1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/e2fsck/unix.c b/e2fsck/unix.c
index ae231f93deb7..341b484e6ede 100644
--- a/e2fsck/unix.c
+++ b/e2fsck/unix.c
@@ -1171,25 +1171,32 @@ static errcode_t try_open_fs(e2fsck_t ctx, int flags, io_manager io_ptr,
errcode_t retval;

*ret_fs = NULL;
- if (ctx->superblock && ctx->blocksize) {
- retval = ext2fs_open2(ctx->filesystem_name, ctx->io_options,
- flags, ctx->superblock, ctx->blocksize,
- io_ptr, ret_fs);
- } else if (ctx->superblock) {
- int blocksize;
- for (blocksize = EXT2_MIN_BLOCK_SIZE;
- blocksize <= EXT2_MAX_BLOCK_SIZE; blocksize *= 2) {
- if (*ret_fs) {
- ext2fs_free(*ret_fs);
- *ret_fs = NULL;
+
+ if (ctx->superblock) {
+ unsigned long blocksize = ctx->blocksize;
+
+ if (!blocksize) {
+ for (blocksize = EXT2_MIN_BLOCK_SIZE;
+ blocksize <= EXT2_MAX_BLOCK_SIZE; blocksize *= 2) {
+
+ retval = ext2fs_open2(ctx->filesystem_name,
+ ctx->io_options, flags,
+ ctx->superblock, blocksize,
+ unix_io_manager, ret_fs);
+ if (*ret_fs) {
+ ext2fs_free(*ret_fs);
+ *ret_fs = NULL;
+ }
+ if (!retval)
+ break;
}
- retval = ext2fs_open2(ctx->filesystem_name,
- ctx->io_options, flags,
- ctx->superblock, blocksize,
- io_ptr, ret_fs);
- if (!retval)
- break;
+ if (retval)
+ return retval;
}
+
+ retval = ext2fs_open2(ctx->filesystem_name, ctx->io_options,
+ flags, ctx->superblock, blocksize,
+ io_ptr, ret_fs);
} else
retval = ext2fs_open2(ctx->filesystem_name, ctx->io_options,
flags, 0, 0, io_ptr, ret_fs);
--
2.35.1