2009-08-14 23:55:08

by Frank Mayhar

[permalink] [raw]
Subject: fsck infinite loop on corrupt ext4 file system

Hello, folks. We recently ran into a pretty severe ext4 crash (being
worked on by someone else) that caused some seriously corrupted file
systems, one of which in turn exposed an fsck problem. We noticed this
when fsck started looping endlessly trying to correct that file system.
Basically, the group descriptors were mangled; fsck complains about
invalid checksums, forces a full check and during pass 1 tries to
allocate some inode bitmap blocks (apparently). That allocation fails,
pass 1 errors out and starts the check over. Endlessly.

I've attached output from the first few loops; unfortunately the file
system image is far, far too large to transport. I've done some
analysis and it appears that check_super_block is noticing the problem
and hitting this case:

if (gd->bg_inode_table == 0) {
ctx->invalid_inode_table_flag[i]++;
ctx->invalid_bitmaps++;
}
free_blocks += gd->bg_free_blocks_count;
free_inodes += gd->bg_free_inodes_count;

(Around line 623 in super.c in the 1.41.8 source.)

Later, during pass 1, he calls handle_fs_bad_blocks due to
ctx->invalid_bitmaps being set and tries to allocate blocks for the
inode table. This allocation fails.

I suspect that the inode table blocks in question simply aren't marked
free and certainly fsck isn't so marking them before it does the
allocate. Should it try to first free the affected blocks? Isn't the
inode table static? Why is handle_fs_bad_blocks trying to reallocate it
without at least trying to free it first?
--
Frank Mayhar <[email protected]>
Google, Inc.


Attachments:
sdi3-fsck-output (20.86 kB)

2009-08-18 01:10:26

by Frank Mayhar

[permalink] [raw]
Subject: Re: fsck infinite loop on corrupt ext4 file system

On Fri, 2009-08-14 at 16:55 -0700, Frank Mayhar wrote:
> Hello, folks. We recently ran into a pretty severe ext4 crash (being
> worked on by someone else) that caused some seriously corrupted file
> systems, one of which in turn exposed an fsck problem. We noticed this
> when fsck started looping endlessly trying to correct that file system.
> Basically, the group descriptors were mangled; fsck complains about
> invalid checksums, forces a full check and during pass 1 tries to
> allocate some inode bitmap blocks (apparently). That allocation fails,
> pass 1 errors out and starts the check over. Endlessly.

I've made a little more progress since Friday. I had grabbed a dumpe2fs
dump of the corrupted file system and one of the newly-created file
system on the same device. Adjusting for normal variation (numbers of
free blocks, flags, etc.), there are no differences _except_ in the very
block groups that fsck complained about having bad checksums. For those
(and only those), the locations of the block bitmap and inode table
differ. I've attached the diff output.

In particular, block group 276 claims to have its inode table at blocks
0-204, which is clearly wrong. This is the block group for which the
allocation failed, causing the original loop.

It's clear that fsck is neither correcting the block groups nor is it
detecting the bad entries properly (a sanity check might be in order
here). It's not even noticing that it's looping, it just keeps failing
the allocation and retrying. While it may be that fsck can't recover
the file system in this case, it should at least notice and abort.

My thinking is that the location of the inode tables should be invariant
over the life of the file system. Certainly there's no place in ext4
itself that changes those fields (that I can see, anyway). Why couldn't
fsck compute the proper values and compare those against what's there?
--
Frank Mayhar <[email protected]>
Google, Inc.


Attachments:
dump-diff (29.67 kB)

2009-08-18 02:47:41

by Andreas Dilger

[permalink] [raw]
Subject: Re: fsck infinite loop on corrupt ext4 file system

On Aug 17, 2009 18:10 -0700, Frank Mayhar wrote:
> I've made a little more progress since Friday. I had grabbed a dumpe2fs
> dump of the corrupted file system and one of the newly-created file
> system on the same device. Adjusting for normal variation (numbers of
> free blocks, flags, etc.), there are no differences _except_ in the very
> block groups that fsck complained about having bad checksums. For those
> (and only those), the locations of the block bitmap and inode table
> differ. I've attached the diff output.

It doesn't appear that the two filesystems were created with the same
options, or one of the filesystems was resized or something.

> In particular, block group 276 claims to have its inode table at blocks
> 0-204, which is clearly wrong. This is the block group for which the
> allocation failed, causing the original loop.
>
> It's clear that fsck is neither correcting the block groups nor is it
> detecting the bad entries properly (a sanity check might be in order
> here). It's not even noticing that it's looping, it just keeps failing
> the allocation and retrying. While it may be that fsck can't recover
> the file system in this case, it should at least notice and abort.
>
> My thinking is that the location of the inode tables should be invariant
> over the life of the file system. Certainly there's no place in ext4
> itself that changes those fields (that I can see, anyway). Why couldn't
> fsck compute the proper values and compare those against what's there?

With the addition of FLEX_BG there is no longer a hard & fast rule for
the location of the block groups' metadata. In the past it was always
guaranteed to be within the group itself, now it can be anywhere.

> Group 276: (Blocks 9043968-9076735)
> - Block bitmap at 9043968 (+0), Inode bitmap at 9043969 (+1)
> - Inode table at 0-204
> + Block bitmap at 8912900, Inode bitmap at 8912916
> + Inode table at 8913748-8913952

This is definitely bogus and should be detected/fixed by e2fsck. I
suspect it used to be handled (pre-flexbg) by the check that the inode
table is within the group, but now there is no sanity check for the
placement at all (including overlapping with other groups, superblocks,
etc.

It makes sense to still validate the sanity of the group descriptor
data, and then check the backup group descriptors if the primaries
are suspicious.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-08-18 16:02:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck infinite loop on corrupt ext4 file system

On Mon, Aug 17, 2009 at 06:10:22PM -0700, Frank Mayhar wrote:
> It's clear that fsck is neither correcting the block groups nor is it
> detecting the bad entries properly (a sanity check might be in order
> here). It's not even noticing that it's looping, it just keeps failing
> the allocation and retrying. While it may be that fsck can't recover
> the file system in this case, it should at least notice and abort.
>
> My thinking is that the location of the inode tables should be invariant
> over the life of the file system. Certainly there's no place in ext4
> itself that changes those fields (that I can see, anyway). Why couldn't
> fsck compute the proper values and compare those against what's there?

So there are a couple of things going on here. The first is that the
code which tries to allocate new inode/block allocation bitmaps or
inode tables wasn't taught that filesystems with the FLEX_BG feature
should have the metadata located at the beginning of the
flex-blockgroup, but if we can't find space for it there (allocating
the inode table is tricky since it requires possibly up to a few
hundred contiguous free blocks), we should try to find the space
anywhere in the filesystem. If it can't find the space, we should
indeed abort. Please find attached a patch which should fix e2fsck to
handle this case correctly. Could you test it and let me know if it
works correctly?

As far as assuming the inode tables are invariant over the life of the
filesystem --- this is normally true, but inode tables can be located
in places other than the default; for example if bad blocks located
where the inode tables should be, then the inode tables can be pushed
to non-standard locations. So this makes calculating where the inode
table "should" be a little tricky, especially since the contents of
the bad blocks can change after the filesystem is formatted.

In addition, e2fsck tries very hard not to destroy data, and so there
is the question of what to do if there are data blocks located where
the inode table "should" be. In theory e2fsck should be able to move
the inode data blocks elsewhere, or if there is no space, potentially
the offer to delete a user file to make room for the inode table ---
after all, better sacrifice one or two data files rather than lose
potentially several hundred or thousand files. But this is a level of
complexity that I never had a chance to add to e2fsck, and in truth
the case where we run into this level of lossage is very rare.

After all, most of the time we have so many copies of the block group
descriptors, and the backup group descripts are rarely written, so
most of the time this level of corruption should be quite rare.
Making e2fsck smarter to deal with the most extreme cases of loss is
therefore desirable, but it's always been a "nice to have".

In any case, with ext4 and the flex_bg feature, the ability to
allocate the inode table anywhere in the filesystem should make the
case where the really complex recovery code even more rarely required.

Please try this patch and see if it fixes things up for you or not.

Thanks!!

- Ted

diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index 518c2ff..203468b 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -2376,9 +2376,10 @@ static void new_table_block(e2fsck_t ctx, blk_t first_block, int group,
const char *name, int num, blk_t *new_block)
{
ext2_filsys fs = ctx->fs;
+ dgrp_t last_grp;
blk_t old_block = *new_block;
blk_t last_block;
- int i;
+ int i, is_flexbg, flexbg, flexbg_size;
char *buf;
struct problem_context pctx;

@@ -2388,19 +2389,44 @@ static void new_table_block(e2fsck_t ctx, blk_t first_block, int group,
pctx.blk = old_block;
pctx.str = name;

- last_block = ext2fs_group_last_block(fs, group);
+ /*
+ * For flex_bg filesystems, first try to allocate the metadata
+ * within the flex_bg, and if that fails then try finding the
+ * space anywhere in the filesystem.
+ */
+ is_flexbg = EXT2_HAS_INCOMPAT_FEATURE(fs->super,
+ EXT4_FEATURE_INCOMPAT_FLEX_BG);
+ if (is_flexbg) {
+ flexbg_size = 1 << fs->super->s_log_groups_per_flex;
+ flexbg = group / flexbg_size;
+ first_block = ext2fs_group_first_block(fs,
+ flexbg_size * flexbg);
+ last_grp = group | (flexbg_size - 1);
+ if (last_grp > fs->group_desc_count)
+ last_grp = fs->group_desc_count;
+ last_block = ext2fs_group_last_block(fs, last_grp);
+ } else
+ last_block = ext2fs_group_last_block(fs, group);
pctx.errcode = ext2fs_get_free_blocks(fs, first_block, last_block,
- num, ctx->block_found_map, new_block);
+ num, ctx->block_found_map,
+ new_block);
+ if (is_flexbg && (pctx.errcode = EXT2_ET_BLOCK_ALLOC_FAIL))
+ pctx.errcode = ext2fs_get_free_blocks(fs,
+ fs->super->s_first_data_block,
+ fs->super->s_blocks_count,
+ num, ctx->block_found_map, new_block);
if (pctx.errcode) {
pctx.num = num;
fix_problem(ctx, PR_1_RELOC_BLOCK_ALLOCATE, &pctx);
ext2fs_unmark_valid(fs);
+ ctx->flags |= E2F_FLAG_ABORT;
return;
}
pctx.errcode = ext2fs_get_mem(fs->blocksize, &buf);
if (pctx.errcode) {
fix_problem(ctx, PR_1_RELOC_MEMORY_ALLOCATE, &pctx);
ext2fs_unmark_valid(fs);
+ ctx->flags |= E2F_FLAG_ABORT;
return;
}
ext2fs_mark_super_dirty(fs);

2009-08-18 16:31:16

by Frank Mayhar

[permalink] [raw]
Subject: Re: fsck infinite loop on corrupt ext4 file system

On Tue, 2009-08-18 at 12:01 -0400, Theodore Tso wrote:
> On Mon, Aug 17, 2009 at 06:10:22PM -0700, Frank Mayhar wrote:
> > It's clear that fsck is neither correcting the block groups nor is it
> > detecting the bad entries properly (a sanity check might be in order
> > here). It's not even noticing that it's looping, it just keeps failing
> > the allocation and retrying. While it may be that fsck can't recover
> > the file system in this case, it should at least notice and abort.
> >
> > My thinking is that the location of the inode tables should be invariant
> > over the life of the file system. Certainly there's no place in ext4
> > itself that changes those fields (that I can see, anyway). Why couldn't
> > fsck compute the proper values and compare those against what's there?
>
> So there are a couple of things going on here. The first is that the
> code which tries to allocate new inode/block allocation bitmaps or
> inode tables wasn't taught that filesystems with the FLEX_BG feature
> should have the metadata located at the beginning of the
> flex-blockgroup, but if we can't find space for it there (allocating
> the inode table is tricky since it requires possibly up to a few
> hundred contiguous free blocks), we should try to find the space
> anywhere in the filesystem. If it can't find the space, we should
> indeed abort. Please find attached a patch which should fix e2fsck to
> handle this case correctly. Could you test it and let me know if it
> works correctly?

Will do. I wasn't able to keep a copy of the corrupted image but I
should be able to do _something_ with your patch. Thanks!

> As far as assuming the inode tables are invariant over the life of the
> filesystem --- this is normally true, but inode tables can be located
> in places other than the default; for example if bad blocks located
> where the inode tables should be, then the inode tables can be pushed
> to non-standard locations. So this makes calculating where the inode
> table "should" be a little tricky, especially since the contents of
> the bad blocks can change after the filesystem is formatted.

Ah, right. As far as I understand, though, bad blocks are the only
exception. (Note that resizing isn't an issue here, nor will it be in
the foreseeable future.)

> In addition, e2fsck tries very hard not to destroy data, and so there
> is the question of what to do if there are data blocks located where
> the inode table "should" be.

I would think that that case would be even more rare than the one we're
dealing with here. In fact outside of a resize operation I can't think
of how it might happen.

> In any case, with ext4 and the flex_bg feature, the ability to
> allocate the inode table anywhere in the filesystem should make the
> case where the really complex recovery code even more rarely required.

Yeah, agreed. In fact just noticing that the allocation error is
unrecoverable and failing the fsck would be sufficient for our needs;
our problem was really that fsck was blindly looping until it got
killed. (I see that your patch does indeed abort the check if the
allocation fails.)

> Please try this patch and see if it fixes things up for you or not.

I'll do so; it might be a bit but I'll let you know how it goes.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-08-18 17:03:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck infinite loop on corrupt ext4 file system

On Tue, Aug 18, 2009 at 09:31:09AM -0700, Frank Mayhar wrote:
>
> Will do. I wasn't able to keep a copy of the corrupted image but I
> should be able to do _something_ with your patch. Thanks!
>

OK, I was hoping you had a test case handy. I'll try to generate one,
so I can check the changes into git. I had left things unchecked in
just in case I had missed something that might get picked up assuming
you still had a corrupted image to try testing the patch out against.

> > In addition, e2fsck tries very hard not to destroy data, and so there
> > is the question of what to do if there are data blocks located where
> > the inode table "should" be.
>
> I would think that that case would be even more rare than the one we're
> dealing with here. In fact outside of a resize operation I can't think
> of how it might happen.

With ext3 and ext4 prior to 2.6.30 (when we added the block validity
check code), it was actually pretty easy for this to happen, actually
--- all it would take is a corrupted block allocation bitmap. With
the latest ext4 code, I grant it's pretty unlikely to happen.

It still can happen, if the both the block group descriptors get
corrupted, such that the block allocation bitmap block points to a
mostly zero-filled block, and the inode table pointer for a block
group is also corrupted to some place random. If this doesn't get
noticed for some period of time while blocks are allocated, and then
later, e2fsck recovers by reading the backup block group descriptors,
this failure mode could very much happen. It does require multiple
simultaneous failures, though, so it's not likely, but over hundreds
of thousands or millions of deployed Linux systems, Murphy's Law has a
way of catching up with us. :-/

Something we *could* do to further reduce the chances would be to
compare the primary and backup group descriptors, either at
mount-time, or in e2fsck. This would add an extra level of paranoia,
although the people who are trying to do 5 second boots with HDD's
would probably complain about the extra seeks that we'd be introducing
as a result.

- Ted

2009-08-18 19:03:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: fsck infinite loop on corrupt ext4 file system

On Aug 18, 2009 13:03 -0400, Theodore Ts'o wrote:
> Something we *could* do to further reduce the chances would be to
> compare the primary and backup group descriptors, either at
> mount-time, or in e2fsck. This would add an extra level of paranoia,
> although the people who are trying to do 5 second boots with HDD's
> would probably complain about the extra seeks that we'd be introducing
> as a result.

I've thought about this recently as well. Since the GDT blocks are
allocated contiguously (at least until we get META_BG filesystems) it
would only be a single extra seek and read at mount time. For a 16TB
filesystem there are 8MB of GDT blocks, so that isn't a huge amount of
extra IO as log as we do it with a single read instead of many seeks.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.