2009-11-23 18:17:45

by Curt Wohlgemuth

[permalink] [raw]
Subject: Bug in extent zeroout: blocks not marked as new

I believe I've found a bug in ext4's extent conversion design that results
in block corruption in inodes using previously freed up metadata blocks.

Yes, this is similar to earlier related corruptions we've seen, but this
turned out to be far nastier to detect and analyze. All the corruptions
were simply blocks of all zeroes -- there was no signature data pattern that
would lead to a source of the corruption.

The problem is that, during conversion of extents from uninitialized to
initialized, ext4 will at some point decide that the remaining uninitialized
extent is too small to split, and will just write zeroes for it, and mark it
as initialized. This is fine -- until blocks from this "14-block tail
extent" are returned from ext4_ext_get_blocks().

The problem is that, since these blocks are not from an uninitialized
extent, the aren't marked as "new" -- i.e., set_buffer_new() is not called
on bh_result -- and the callers further up in the call stack don't call
unmap_underlying_metadata() on them.

(I've finally got around to understanding that the handling of old dirty
metadata blocks is ultimately handled on the allocating side, not on the
freeing side. Although bforget() is called on freed up metadata blocks,
because bforget() doesn't lock the buffer before it clears the dirty bit,
there's a known race with the writeout path in __block_write_full_page().
I've seen this race, and seen a block that's had bforget() called on it be
written to disk.)

Handling of old metadata blocks is supposed to be dealt with on the
allocation side, when unmap_underlying_metadata() is called for buffers
marked new on return from the get_block callback. But this won't get called
for blocks in an extent created with ext4_ext_zeroout().

This is why every corruption we've seen has been a block residing in a
"14-block tail extent" in a newly written file.

I've verified that blocks returned to do_direct_IO() that are in a 14-block
zeroout extent do *not* have unmap_underlying_metadata() called for them.
The same should be the case with buffered I/O as well.

The fix for this is difficult, though. The simplest thing I can think of is
to stop calling ext4_ext_zeroout() at all; easily done by setting
EXT4_EXT_ZERO_LEN to 0. But I'm not at all sure of the ramifications of
this. How expensive is extent split/initialization, especially in light of
Mingming's recent patches in this area?

The other obvious solution is to call unmap_underlying_metadata() at the
time that we zeroout the blocks.

Thoughts?

Thanks,
Curt


2009-11-23 19:50:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug in extent zeroout: blocks not marked as new

On Mon, Nov 23, 2009 at 10:17:46AM -0800, Curt Wohlgemuth wrote:
> I believe I've found a bug in ext4's extent conversion design that results
> in block corruption in inodes using previously freed up metadata blocks.
>
> (I've finally got around to understanding that the handling of old dirty
> metadata blocks is ultimately handled on the allocating side, not on the
> freeing side. Although bforget() is called on freed up metadata blocks,
> because bforget() doesn't lock the buffer before it clears the dirty bit,
> there's a known race with the writeout path in __block_write_full_page().
> I've seen this race, and seen a block that's had bforget() called on it be
> written to disk.)

Note that these problems are much rarer (and perhaps impossible) when
journalling is enabled, because we don't allow any freed blocks to be
reused until after a transaction commit --- this prevents blocks from
getting overwritten, thus damaging existing files that don't end up
getting deleted because the system cashed before the transaction could
complete.

As a result, blocks which are deleted can't get reused immediately,
and this should defang the bforget() race that you mention. Without
journalling we don't have this protection, so this exposes us to the
sorts of problems which you've been running into.

> The problem is that, during conversion of extents from uninitialized to
> initialized, ext4 will at some point decide that the remaining uninitialized
> extent is too small to split, and will just write zeroes for it, and mark it
> as initialized. This is fine -- until blocks from this "14-block tail
> extent" are returned from ext4_ext_get_blocks().
>
> The problem is that, since these blocks are not from an uninitialized
> extent, the aren't marked as "new" -- i.e., set_buffer_new() is not called
> on bh_result -- and the callers further up in the call stack don't call
> unmap_underlying_metadata() on them.

OK, I'm confused. The corruption is that you're having blocks that
are all zeroes show up where they should be non-zero data, right? But
in what you are describing, you should have garbage from old metadata
getting written over the newly initialized blocks --- garbage that
would have been discarded if unmap_underlying_metadata() had been
called. But that's not what you're seeing.

Blocks from an uninitialized extent aren't new, because they've been
around for a while, and they've been attached to the inode and
allocated. One thing we could do is call unmap_underlying_metadata
when those uninitialized blocks are *allocated*, because that's when
they their "allegiance" is transfered to the inode. However, since we
don't care about the contents of these uninitialized blocks your
suggestion:

> The other obvious solution is to call unmap_underlying_metadata() at the
> time that we zeroout the blocks.

Is is a good one. We should also call unmap_underlying_metadata() for
any blocks that are transitioning from uninitialized to initialized,
just in case there were any recently-freed buffers that might be
undergoing writeout. So I agree this is a good thing to do, and it's
probably a hole that we should plug.

However, that doesn't explaining the corruption which you are seeing,
which is zero blocks showing up where they shouldn't --- since the
freed dirty metadata blocks that we would be zapping with
unmap_underlying_metadata() are highly unlikely to be all zero's.

Am I missing something?

- Ted

2009-11-23 21:17:27

by Frank Mayhar

[permalink] [raw]
Subject: Re: Bug in extent zeroout: blocks not marked as new

On Mon, 2009-11-23 at 14:50 -0500, [email protected] wrote:
> However, that doesn't explaining the corruption which you are seeing,
> which is zero blocks showing up where they shouldn't --- since the
> freed dirty metadata blocks that we would be zapping with
> unmap_underlying_metadata() are highly unlikely to be all zero's.
>
> Am I missing something?

I think the missing bit of information here is that this is happening on
converted ext2 partitions. Older non-extent-based files are being
gradually removed while new extent-based files are being created. Last
week, Curt found that in certain cases newly-reallocated metadata blocks
are being hit; these metadata blocks just happen to be newly-emptied
indirect blocks (freed by a recent delete), which just happen to have
been filled with zeroes as they were emptied. As I recall, Curt found
that these were pretty clearly the source of the blocks being written in
this case.

To elaborate (per Curt), we're seeing these dirty metadata writes from
freed indirect-block inodes because ext4_free_branches() calls
ext4_handle_dirty_metadata() on an indirect block after freeing (some or
all of) its data blocks. We do the same thing in ext3_free_branches()
but ext2_free_branches() DOES NOT mark the metadata block as dirty, so
there will be a lot more dirty metadata floating about in ext[34] than
in ext2 and hence a lot more chance for the race we've been seeing.

Finally, we have a question about the zero-out path: Is there any
known, concrete improvement given by doing the zero-out as opposed to
just continuing to split the extents? At the moment, by the way, there
is one definite problem: Since it doesn't try to do a merge left (which
it should) it invariably leaves a 14-block extent fragment, thus
increasing fragmentation of the file. It's not a huge problem (since
the extents are in fact contiguous) but it's there.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-11-23 21:45:34

by Andreas Dilger

[permalink] [raw]
Subject: Re: Bug in extent zeroout: blocks not marked as new

On 2009-11-23, at 14:17, Frank Mayhar wrote:
> Finally, we have a question about the zero-out path: Is there any
> known, concrete improvement given by doing the zero-out as opposed
> to just continuing to split the extents? At the moment, by the way,
> there is one definite problem: Since it doesn't try to do a merge
> left (which it should) it invariably leaves a 14-block extent
> fragment, thus increasing fragmentation of the file. It's not a
> huge problem (since the extents are in fact contiguous) but it's
> there.


The intent is to avoid splitting the uninitialized extent further when
there is no longer any benefit to do so. Writing out 8kB vs. 64kB is
in the noise these days, but splitting the extent is extra overhead
(larger extent tree, more lookups, etc). If we were to continue
splitting it would leave smaller and smaller uninitialized extents.

As you point out, the newly-initialized extent should be merged with
its left neighbor. If we do the zero-out at the point where actually
writing zeroes is cost effective. At that point there is no longer an
uninitialized extent to track, and it can be merged entirely with its
left neighbor and avoid any extra overhead.

The other time we HAVE to zero out the uninitialized extent is if the
filesystem does not have any free blocks to add a new extent, so the
uninitialized extent is zeroed entirely and then converted to
initialized.

Using 64kB as the cutoff for uninitialized extents is very reasonable
these days, though we may in fact want to make that dynamic based on
the superblock s_raid_stripe_width for SSDs with 128kB erase blocks
and/or avoiding read-modify-write within a single RAID stripe.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-11-23 23:20:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug in extent zeroout: blocks not marked as new

On Mon, Nov 23, 2009 at 01:17:23PM -0800, Frank Mayhar wrote:
> I think the missing bit of information here is that this is happening on
> converted ext2 partitions. Older non-extent-based files are being
> gradually removed while new extent-based files are being created. Last
> week, Curt found that in certain cases newly-reallocated metadata blocks
> are being hit; these metadata blocks just happen to be newly-emptied
> indirect blocks (freed by a recent delete), which just happen to have
> been filled with zeroes as they were emptied. As I recall, Curt found
> that these were pretty clearly the source of the blocks being written in
> this case.

Ah, right, the zero'ed blocks are coming from indirect blocks of files
that were being unlinked/truncated. Got it.

> Finally, we have a question about the zero-out path: Is there any
> known, concrete improvement given by doing the zero-out as opposed to
> just continuing to split the extents? At the moment, by the way, there
> is one definite problem: Since it doesn't try to do a merge left (which
> it should) it invariably leaves a 14-block extent fragment, thus
> increasing fragmentation of the file. It's not a huge problem (since
> the extents are in fact contiguous) but it's there.

Reducing extents "fragmentation" (the number of extents in the extent
tree, even though the block allocation remains contiguous) was one
reason. Another was reducing the number of blocks that need to be
journaled, and in the case where adding an extent might cause a split;
this can be problematic if we don't have any more free blocks
available, and at the moment, not only do we not have any tree merging
code, we don't have code to decrease the tree depth once an extent
split forces us to increase the height of the tree.

Also, a contiguous write should be faster than needing to seek to the
extent tree leaf block as well as to the journal, although we haven't
measured whether the breakpoints for when we split the extent and when
we do the zero-out are ideal. That's something we should do.

- Ted