2006-10-18 21:11:07

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

I've been using Steve Grubb's purely evil "fsfuzzer" tool, at
http://people.redhat.com/sgrubb/files/fsfuzzer-0.4.tar.gz

basically it makes a filesystem, splats some random bits over it,
then tries to mount it and do some simple filesystem actions.

At best, the filesystem catches the corruption gracefully.
At worst, things spin out of control.

As you might guess, we found a couple places where things spin
out of control :) 2, to be exact.

First, we had a corrupted index directory that was never checked
for consistency... it was corrupt, and pointed to another "entry"
of length 0. The for() loop looped forever, since the length
of ext3_next_entry(de) was 0, and we kept looking at the same
pointer over and over and over and over... I modeled this check
and subsequent action on what is done for non-index directories
in ext3_readdir... but I also see a few places where this check
is deemed "too expensive" - any thoughts?

(also I'm not sure if "offset" is supposed to be offset in the
filesystem, or offset in the block, I think it's called both
ways...)

Index: linux-2.6.18/fs/ext3/namei.c
===================================================================
--- linux-2.6.18.orig/fs/ext3/namei.c
+++ linux-2.6.18/fs/ext3/namei.c
@@ -551,6 +551,15 @@ static int htree_dirblock_to_tree(struct
dir->i_sb->s_blocksize -
EXT3_DIR_REC_LEN(0));
for (; de < top; de = ext3_next_entry(de)) {
+ if (!ext3_check_dir_entry("htree_dirblock_to_tree", dir, de, bh,
+ (block<<EXT3_BLOCK_SIZE_BITS(dir->i_sb))
+ +((char *)de - bh->b_data))) {
+ /* On error, skip the f_pos to the next block. */
+ dir_file->f_pos = (dir_file->f_pos |
+ (dir_file->i_sb->s_blocksize - 1)) + 1;
+ brelse (bh);
+ return count;
+ }
ext3fs_dirhash(de->name, de->name_len, hinfo);
if ((hinfo->hash < start_hash) ||
((hinfo->hash == start_hash) &&

Next we had a root directory inode which had a corrupted size, claimed
to be > 200M on a 4M filesystem. ext3_get_blocks_handle() was returning 0,
meaning that lookup failed. (there was only really 1 block in the directory,
but because the size was so large, readdir kept coming back for more...)

instead of catching the no-block-at-this-offset error, we fell into the
!bh case, which assumed that there had been an IO error, and kept on trying
200M+ of blocks that didn't exist. I -think- it makes more sense to realize
that if ext3_get_blocks_handle returns 0, there is a hole at this location,
(as described by the on-disk metadata) and something has gone wrong.

Index: linux-2.6.18/fs/ext3/dir.c
===================================================================
--- linux-2.6.18.orig/fs/ext3/dir.c
+++ linux-2.6.18/fs/ext3/dir.c
@@ -141,6 +141,11 @@ static int ext3_readdir(struct file * fi
(PAGE_CACHE_SHIFT - inode->i_blkbits),
1);
bh = ext3_bread(NULL, inode, blk, 0, &err);
+ } else {
+ ext3_error(sb, "ext3_readdir",
+ "directory #%lu block %lu lookup failed, corrupt dir",
+ inode->i_ino, blk);
+ return -EINVAL;
}

/*

I'm not so sure about this one, though - seems like maybe also it should test
for an actual error case (< 0) from ext3_get_blocks_handle as well.

Comments?

Thanks,

-Eric


2006-10-18 21:40:26

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

On Oct 18, 2006 16:11 -0500, Eric Sandeen wrote:
> First, we had a corrupted index directory that was never checked
> for consistency... it was corrupt, and pointed to another "entry"
> of length 0. The for() loop looped forever, since the length
> of ext3_next_entry(de) was 0, and we kept looking at the same
> pointer over and over and over and over... I modeled this check
> and subsequent action on what is done for non-index directories
> in ext3_readdir... but I also see a few places where this check
> is deemed "too expensive" - any thoughts?

Hmm, in 2.6 ext2 this is handled somewhat differently - one of the main
places where ext2 and ext3 differ. The directory leaf data is kept in
the page cache and there is a helper function ext2_check_page() to mark
the page "checked". That means the page only needs to be checked once
after being read from disk, instead of each time through readdir.

That said, making ext3 run in that manner is major surgery, unlike your
fix. I've seen such errors in production so it is worthwhile to fix this.
It might be possible to have a helper function similar to ext3_bread()
when reading directory leaf blocks that checks only if !buffer_uptodate()?

> Index: linux-2.6.18/fs/ext3/namei.c
> ===================================================================
> --- linux-2.6.18.orig/fs/ext3/namei.c
> +++ linux-2.6.18/fs/ext3/namei.c
> @@ -551,6 +551,15 @@ static int htree_dirblock_to_tree(struct
> dir->i_sb->s_blocksize -
> EXT3_DIR_REC_LEN(0));
> for (; de < top; de = ext3_next_entry(de)) {
> + if (!ext3_check_dir_entry("htree_dirblock_to_tree", dir, de, bh,
> + (block<<EXT3_BLOCK_SIZE_BITS(dir->i_sb))
> + +((char *)de - bh->b_data))) {
> + /* On error, skip the f_pos to the next block. */
> + dir_file->f_pos = (dir_file->f_pos |
> + (dir_file->i_sb->s_blocksize - 1)) + 1;
> + brelse (bh);
> + return count;
> + }
> ext3fs_dirhash(de->name, de->name_len, hinfo);
> if ((hinfo->hash < start_hash) ||
> ((hinfo->hash == start_hash) &&
>
> Next we had a root directory inode which had a corrupted size, claimed
> to be > 200M on a 4M filesystem. ext3_get_blocks_handle() was returning 0,
> meaning that lookup failed. (there was only really 1 block in the directory,
> but because the size was so large, readdir kept coming back for more...)
>
> instead of catching the no-block-at-this-offset error, we fell into the
> !bh case, which assumed that there had been an IO error, and kept on trying
> 200M+ of blocks that didn't exist. I -think- it makes more sense to realize
> that if ext3_get_blocks_handle returns 0, there is a hole at this location,
> (as described by the on-disk metadata) and something has gone wrong.
>
> Index: linux-2.6.18/fs/ext3/dir.c
> ===================================================================
> --- linux-2.6.18.orig/fs/ext3/dir.c
> +++ linux-2.6.18/fs/ext3/dir.c
> @@ -141,6 +141,11 @@ static int ext3_readdir(struct file * fi
> (PAGE_CACHE_SHIFT - inode->i_blkbits),
> 1);
> bh = ext3_bread(NULL, inode, blk, 0, &err);
> + } else {
> + ext3_error(sb, "ext3_readdir",
> + "directory #%lu block %lu lookup failed, corrupt dir",
> + inode->i_ino, blk);
> + return -EINVAL;
> }
>
> /*
>
> I'm not so sure about this one, though - seems like maybe also it should test
> for an actual error case (< 0) from ext3_get_blocks_handle as well.

I'm not sure whether this is a win or not. It means that if there is ever
a directory with a bad leaf block any entries beyond that block are not
accessible anymore. The existing !bh case already marks the filesystem in
error. Maybe as a special case we can check in "if (!bh)" if i_size and
i_blocks make sense. Something like:

if (!bh) {
:
:
+ if (filp->f_pos > inode->i_blocks << 9) {
+ break;
filp->f_pos += sb->s_blocksize - offset;
continue;
}

This obviously won't help if the whole inode is bogus, but then nothing
will catch all errors.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-18 21:56:51

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

Andreas Dilger wrote:
> On Oct 18, 2006 16:11 -0500, Eric Sandeen wrote:
>> First, we had a corrupted index directory that was never checked
>> for consistency... it was corrupt, and pointed to another "entry"
>> of length 0. The for() loop looped forever, since the length
>> of ext3_next_entry(de) was 0, and we kept looking at the same
>> pointer over and over and over and over... I modeled this check
>> and subsequent action on what is done for non-index directories
>> in ext3_readdir... but I also see a few places where this check
>> is deemed "too expensive" - any thoughts?
>
> Hmm, in 2.6 ext2 this is handled somewhat differently - one of the main
> places where ext2 and ext3 differ. The directory leaf data is kept in
> the page cache and there is a helper function ext2_check_page() to mark
> the page "checked". That means the page only needs to be checked once
> after being read from disk, instead of each time through readdir.

ah, sure. Hm... well, this might be a bit of a performance hit if it's
checking cached data... let me think on that.

<... next patch ...>

> I'm not sure whether this is a win or not. It means that if there is ever
> a directory with a bad leaf block any entries beyond that block are not
> accessible anymore.

I'm amazed at how hard ext3 works to cope with bad blocks ;-)

Hm, yes, so just bailing out may not be so good.

> The existing !bh case already marks the filesystem in
> error. Maybe as a special case we can check in "if (!bh)" if i_size and
> i_blocks make sense. Something like:
>
> if (!bh) {
> :
> :
> + if (filp->f_pos > inode->i_blocks << 9) {
> + break;
> filp->f_pos += sb->s_blocksize - offset;
> continue;
> }
>
> This obviously won't help if the whole inode is bogus, but then nothing
> will catch all errors.

Yep, I'd thought maybe a size vs. blocks test might make sense; I think
there can never legitimately be a sparse directory?

I guess if the intent is to soldier on in the face of adversity, it
doesn't matter if it's an umappable offset or an IO error; ext3 wants to
go ahead & try the next one block anyway. So the size test probably
makes sense as a stopping point.

Thanks for the comments,

-Eric

2006-10-18 22:24:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

On Oct 18, 2006 16:56 -0500, Eric Sandeen wrote:
> Andreas Dilger wrote:
> > The directory leaf data is kept in
> > the page cache and there is a helper function ext2_check_page() to mark
> > the page "checked". That means the page only needs to be checked once
> > after being read from disk, instead of each time through readdir.
>
> ah, sure. Hm... well, this might be a bit of a performance hit if it's
> checking cached data... let me think on that.

Well, having something like "ext3_dir_bread()" that verifies the leaf block
once if (!uptodate()) would be almost the same as ext2 with fairly little
effort. It would help performance in several places, at the slight risk
of not handling in-memory corruption after the block is read...

> > I'm not sure whether this is a win or not. It means that if there is ever
> > a directory with a bad leaf block any entries beyond that block are not
> > accessible anymore.
>
> I'm amazed at how hard ext3 works to cope with bad blocks ;-)

It would fail all of your tests otherwise, right? That is one virtue of
ext2 having grown up in the days when bad blocks existed. Those days are
(sadly) coming back again, hence desire for fs-level checksums, etc.

> > The existing !bh case already marks the filesystem in
> > error. Maybe as a special case we can check in "if (!bh)" if i_size and
> > i_blocks make sense. Something like:
> >
> > if (!bh) {
> > :
> > :
> > + if (filp->f_pos > inode->i_blocks << 9) {
> > + break;
> > filp->f_pos += sb->s_blocksize - offset;
> > continue;
> > }
> >
> > This obviously won't help if the whole inode is bogus, but then nothing
> > will catch all errors.
>
> Yep, I'd thought maybe a size vs. blocks test might make sense; I think
> there can never legitimately be a sparse directory?

Not currently, though there was some desire to allow this during htree
development, to allow shrinking large-but-empty directories. Since this
already provokes an ext3_error() (which might be a panic()) to hit a hole
we can assume that this needs to be carefully implmemented.

> I guess if the intent is to soldier on in the face of adversity, it
> doesn't matter if it's an umappable offset or an IO error; ext3 wants to
> go ahead & try the next one block anyway. So the size test probably
> makes sense as a stopping point.

Well, it would also be possible to look into inode->i_blocks to see what
blocks exist past this offset, but that is complicated by the introduction

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-19 00:26:45

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

Andreas Dilger wrote:
> On Oct 18, 2006 16:56 -0500, Eric Sandeen wrote:
>> Andreas Dilger wrote:
>>> The directory leaf data is kept in
>>> the page cache and there is a helper function ext2_check_page() to mark
>>> the page "checked". That means the page only needs to be checked once
>>> after being read from disk, instead of each time through readdir.
>> ah, sure. Hm... well, this might be a bit of a performance hit if it's
>> checking cached data... let me think on that.
>
> Well, having something like "ext3_dir_bread()" that verifies the leaf block
> once if (!uptodate()) would be almost the same as ext2 with fairly little
> effort. It would help performance in several places, at the slight risk
> of not handling in-memory corruption after the block is read...

Right, I understand what you meant; I meant that adding the check as I had it
was extra overhead & a performance risk. I think missing in-memory corruption
is ok; if memory is getting corrupted then there are almost surely bigger
problems looming.

>>> I'm not sure whether this is a win or not. It means that if there is ever
>>> a directory with a bad leaf block any entries beyond that block are not
>>> accessible anymore.
>> I'm amazed at how hard ext3 works to cope with bad blocks ;-)
>
> It would fail all of your tests otherwise, right?

Well, this test basically just looks for oopses or hangs. If the filesystem
shut down at the first sign of trouble, that would satisfy this test.

> That is one virtue of
> ext2 having grown up in the days when bad blocks existed. Those days are
> (sadly) coming back again, hence desire for fs-level checksums, etc.

*nod*

...

>>> This obviously won't help if the whole inode is bogus, but then nothing
>>> will catch all errors.
>> Yep, I'd thought maybe a size vs. blocks test might make sense; I think
>> there can never legitimately be a sparse directory?
>
> Not currently, though there was some desire to allow this during htree
> development, to allow shrinking large-but-empty directories.

Yep, I wondered about that. Any chance it'll happen?

> Since this
> already provokes an ext3_error() (which might be a panic()) to hit a hole
> we can assume that this needs to be carefully implmemented.

Well, adding it as you suggested is in a case where it will -already- be calling
an ext3_error; adding the test just keeps it from going too much further after
that. I think it's safe, and a good idea.

>> I guess if the intent is to soldier on in the face of adversity, it
>> doesn't matter if it's an umappable offset or an IO error; ext3 wants to
>> go ahead & try the next one block anyway. So the size test probably
>> makes sense as a stopping point.
>
> Well, it would also be possible to look into inode->i_blocks to see what
> blocks exist past this offset, but that is complicated by the introduction
> <eom>
introduction of ...? :)

Your suggested test seems pretty sane to me.

-Eric

2006-10-19 07:35:21

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

On Oct 18, 2006 19:26 -0500, Eric Sandeen wrote:
> >Well, it would also be possible to look into inode->i_blocks to see what
> >blocks exist past this offset, but that is complicated by the introduction
> > <eom>
> introduction of ...? :)

Sorry - introduction of extents. So we can't just look into the i_blocks
{d,t,}indirect blocks to work out the maximum reasonable size for an inode
without adding decoding of extents into this code. Maybe if "SEEK_DATA"
is added to ext3 (patch was proposed this past week) then we could seek
past the hole efficiently. For now I'm happy to assume i_blocks * 512 is
a safe upper limit on the file size.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-19 16:04:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

Andreas Dilger wrote:
> On Oct 18, 2006 16:56 -0500, Eric Sandeen wrote:
>> Andreas Dilger wrote:
>>> The directory leaf data is kept in
>>> the page cache and there is a helper function ext2_check_page() to mark
>>> the page "checked". That means the page only needs to be checked once
>>> after being read from disk, instead of each time through readdir.
>> ah, sure. Hm... well, this might be a bit of a performance hit if it's
>> checking cached data... let me think on that.
>
> Well, having something like "ext3_dir_bread()" that verifies the leaf block
> once if (!uptodate()) would be almost the same as ext2 with fairly little
> effort. It would help performance in several places, at the slight risk
> of not handling in-memory corruption after the block is read...

How about just tweaking the existing ext3_bread so that it lets the
caller know whether or not it found an uptodate buffer? Seems
conceivable that more than just the dir code might want to do a data
sanity check, based on if this is a fresh read or not.

Could maybe even change the *err argument to a *retval; negative on
errors, else 0 == not read (found uptodate), 1 == fresh read (not found
uptodate). Or is that too much overloading...

-Eric

2006-10-19 22:43:44

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

Eric Sandeen wrote:
>> Well, having something like "ext3_dir_bread()" that verifies the leaf block
>> once if (!uptodate()) would be almost the same as ext2 with fairly little
>> effort. It would help performance in several places, at the slight risk
>> of not handling in-memory corruption after the block is read...
>
> How about just tweaking the existing ext3_bread so that it lets the
> caller know whether or not it found an uptodate buffer? Seems
> conceivable that more than just the dir code might want to do a data
> sanity check, based on if this is a fresh read or not.
>
> Could maybe even change the *err argument to a *retval; negative on
> errors, else 0 == not read (found uptodate), 1 == fresh read (not found
> uptodate). Or is that too much overloading...

I played around with this a little bit today, and it seems to have some
tangible results. A fairly unsophisticated test of running "find" over
my whole root filesystem 10 times :) with and without re-checking cached
directory entries, yielded about a 10% speedup when skipping the re-checks.

Is this something we want to do? Are we comfortable with only checking
directory entries the first time they are read from disk?

-Eric

2006-10-20 03:50:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

On Oct 19, 2006 17:43 -0500, Eric Sandeen wrote:
> Eric Sandeen wrote:
> > How about just tweaking the existing ext3_bread so that it lets the
> > caller know whether or not it found an uptodate buffer? Seems
> > conceivable that more than just the dir code might want to do a data
> > sanity check, based on if this is a fresh read or not.
> >
> > Could maybe even change the *err argument to a *retval; negative on
> > errors, else 0 == not read (found uptodate), 1 == fresh read (not found
> > uptodate). Or is that too much overloading...
>
> I played around with this a little bit today, and it seems to have some
> tangible results. A fairly unsophisticated test of running "find" over
> my whole root filesystem 10 times :) with and without re-checking cached
> directory entries, yielded about a 10% speedup when skipping the re-checks.
>
> Is this something we want to do? Are we comfortable with only checking
> directory entries the first time they are read from disk?

Well, we already do this on ext2 without noticable problems. As you say,
if we are getting memory corruption we are in for a world of hurt in other
areas. The only case that might be worth checking inside the loop is if
rec_len == 0, so that we don't spin on a bad entry forever.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-20 04:00:25

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH/RFC] - make ext3 more robust in the face of filesystem corruption

Andreas Dilger wrote:
> On Oct 19, 2006 17:43 -0500, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> How about just tweaking the existing ext3_bread so that it lets the
>>> caller know whether or not it found an uptodate buffer? Seems
>>> conceivable that more than just the dir code might want to do a data
>>> sanity check, based on if this is a fresh read or not.
>>>
>>> Could maybe even change the *err argument to a *retval; negative on
>>> errors, else 0 == not read (found uptodate), 1 == fresh read (not found
>>> uptodate). Or is that too much overloading...
>> I played around with this a little bit today, and it seems to have some
>> tangible results. A fairly unsophisticated test of running "find" over
>> my whole root filesystem 10 times :) with and without re-checking cached
>> directory entries, yielded about a 10% speedup when skipping the re-checks.
>>
>> Is this something we want to do? Are we comfortable with only checking
>> directory entries the first time they are read from disk?
>
> Well, we already do this on ext2 without noticable problems. As you say,
> if we are getting memory corruption we are in for a world of hurt in other
> areas. The only case that might be worth checking inside the loop is if
> rec_len == 0, so that we don't spin on a bad entry forever.

Sounds good, I'll whip up a patch; probably one patch first to add the checks &
fix the corruptor tests, and follow up with one to be smarter about the checks
in all cases.

Thanks,

-eric