2008-12-04 22:26:51

by Eric Sandeen

[permalink] [raw]
Subject: tune2fs -I seems dangerous

As a small experiment...

dd if=/dev/zero of=fsfile bs=1M count=16
mkfs.ext4 -F -I 128 fsfile
mkdir -p mnt
mount -o loop fsfile mnt
for I in `seq 1 4096`; do echo $I > mnt/file.$I; done
umount mnt
tune2fs -I 256 fsfile
e2fsck -fy fsfile

... this yields 10031 lines of fsck output, and results in about 38% of
the files that were on the filesystem going missing.

I don't have the strong sense that tune2fs -I has been shaken out at
all; should it be shipping as a useable option?

Thanks,
-Eric






Subject: Re: tune2fs -I seems dangerous

Eric Sandeen wrote:
> As a small experiment...
> [snip]
> ... this yields 10031 lines of fsck output, and results in about 38% of
> the files that were on the filesystem going missing.
I have had a similar experience converting from 128 to 256 bytes inodes.
After a while, tune2fs -I would simply stop doing IO and using a lot of
CPU with a few IO bursts every half-hour or so. This is on a recent/fast
x86-64 computer. I had to cancel the thing after leaving it running for
over 24 hours.
>
> I don't have the strong sense that tune2fs -I has been shaken out at
> all; should it be shipping as a useable option?
Maybe add a --accept-consequences-of-shooting-myself-in-the-foot flag ?

Cheers,
Jonathan

2008-12-05 01:24:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: tune2fs -I seems dangerous

On Thu, Dec 04, 2008 at 04:26:47PM -0600, Eric Sandeen wrote:
> As a small experiment...
>
> dd if=/dev/zero of=fsfile bs=1M count=16
> mkfs.ext4 -F -I 128 fsfile
> mkdir -p mnt
> mount -o loop fsfile mnt
> for I in `seq 1 4096`; do echo $I > mnt/file.$I; done
> umount mnt
> tune2fs -I 256 fsfile
> e2fsck -fy fsfile
>
> ... this yields 10031 lines of fsck output, and results in about 38% of
> the files that were on the filesystem going missing.

Looks like the problem is that tune2fs -I was only tested on ext3
filesystem. It blows up rather spectacularly on filesystems with the
flex_bg option, and it's apparently not updating the checksums if the
uninit_bg option is specified.

> I don't have the strong sense that tune2fs -I has been shaken out at
> all; should it be shipping as a useable option?

It needs some TLC, that's for certain. Move of the code was copied
from resize2fs, so it's pretty paranoid about error checking and so
on. The major problems from when the code was adapted for use in
expanding the inode table, and the algorithm that tries to do that
work.

The major problem is seems to be that it's not double checking to make
sure that all of the blocks that it needs to move in order to expand
the inode table are in fact moveable. Specifically, the code is not
checking and will blindly assume success when in fact things are *not*
successful under the following conditions:

1) Flex_bg is enabled, and there is an inode table for a subsequent
block group immediately following the inode table.

2) There is a block from the bad block inode immediately following
the inode table (which is really bad). Tune2fs -I will not notice,
relocate the block in the bad block, and then write the inode table
onto the bad block, possibly causing the loss of up to 16 inodes
per bad block immediately following the inode table.

3) The filesystem is formatted for RAID so there is stride setting
which causes the block or inode bitmap to be located immediately
following the inode. This will be caught be e2fsck, if the user is
paranoid enough to run e2fsck immediately after tune2fs -I.

I think is fair, though, to say that tune2fs -I code was written by
someone who wasn't sufficiently paranoid to think through all of the
failure cases. There is in fact a FIXME!! comment for case #2, but at
the very least what should have happend is that the move_block should
keep track of how many blocks were moved, and if it wasn't equal to
needed blocks, it should have signalled an error because it would have
indicated either a programming bug or a hardware bug or a filesystem
corruption bug. Either way, it shouldn't move forward because there
is the risk that users' files might get destroyed.

- Ted

2008-12-05 01:25:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: tune2fs -I seems dangerous

On Thu, Dec 04, 2008 at 06:14:36PM -0500, Jonathan Bastien-Filiatrault wrote:
> I have had a similar experience converting from 128 to 256 bytes inodes.
> After a while, tune2fs -I would simply stop doing IO and using a lot of
> CPU with a few IO bursts every half-hour or so. This is on a recent/fast
> x86-64 computer. I had to cancel the thing after leaving it running for
> over 24 hours.

The speed problems have been reported and mostly addressed in the
e2fsprogs git repository. There's one more patch that should
completely solve the tune2fs -I performance problem, but I haven't had
time to audit the patch just yet. It is on my to do list.

- Ted

2008-12-05 03:08:42

by Eric Sandeen

[permalink] [raw]
Subject: Re: tune2fs -I seems dangerous

Theodore Tso wrote:

> I think is fair, though, to say that tune2fs -I code was written by
> someone who wasn't sufficiently paranoid to think through all of the
> failure cases. There is in fact a FIXME!! comment for case #2, but at
> the very least what should have happend is that the move_block should
> keep track of how many blocks were moved, and if it wasn't equal to
> needed blocks, it should have signalled an error because it would have
> indicated either a programming bug or a hardware bug or a filesystem
> corruption bug. Either way, it shouldn't move forward because there
> is the risk that users' files might get destroyed.

So for kicks, when I try ext3:

mkfs.ext3 -F -I 128 -N 16384 fsfile
mount -o loop fsfile mnt
for I in `seq 1 16384`; do echo $I > mnt/file.$I; done
umount mnt
tune2fs -I 256 fsfile

tune2fs 1.41.3 (12-Oct-2008)
Error in resizing the inode size.
Run e2undo to undo the file system changes.

e2undo
Usage: e2undo <transaction file> <filesystem>

it seems to me that a) it'd be nice to know what the error was (likely
no room for the larger inodes?) and maybe not even begin if you know
you'll hit an error due to free space, and b) maybe e2undo (or the
previous command) should give you a hint about where the e2undo
transaction file is?

It's no my intent to pick on anyone who wrote the code, but it seems
like maybe this should be disabled in the next release unless these
issues get ironed out...

Thanks,
-Eric