On the off chance that this sounds familiar to anyone out there...
I've got a situation in which data files written by an application are
showing very occasional checksum errors sometimes. The data files are
all around 8MB long, written using O_DIRECT into fallocated space.
(The entire fallocated space for the example file below is written to
with valid data; i.e., no holes, no truncation, no uninitialized
extents.)
When these occasional checksum failures show up, the data in the files
is rather odd. I've seen 4 cases of this so far, and the "bad" data
always starts on a block boundary, and always has the first 12 bytes
that are identical to what an extent header would look like (for a
header at the start of a block of extents or extent indexes):
Here's the "od -Ad -x" output from one such file:
8388608 f30a 0000 0154 0000 0000 0000 0000 0000
(I.e., the first 2 bytes are EXT4_EXT_MAGIC, and bytes 4-5 are 0x154,
or what eh_max would be for a block size of 4096 bytes.)
In this case, the "bad" data starts at block 2048. Two cases have
this pattern at block 2048; two at block 2050. A syscall trace of one
such corrupted file shows that this block was written with a single
write encompassing many adjacent blocks:
write(fd=10, size=192512, offset=8204288)
The file in question above has only two (in-inode) extents, which I
verified look valid. The block in question (2048) above is covered by
the second extent: logical blocks 2037-2050.
I've seen the amount of "bad" data (including the "extent header"
above) to be pretty variable: between 70 and 800 bytes; I haven't been
able to correlate the rest of the bad data to any particular ext4 data
structures.
My guess is that a block of extents from a truncated or removed file
was reused for data for this file, and somehow was not written
correctly. This seems (slightly) more plausible to me than the extent
metadata of an existing file was "leaked" into this one.
Does any of this ring a bell to anybody?
Thanks,
Curt
Hi Valerie:
On Tue, Sep 8, 2009 at 10:56 AM, Valerie Aurora<[email protected]> wrote:
> Hey, did you figure this out? ?If not, I want to have a bug open
> somewhere.
Yes, sorry. I was going to post a patch for this, but have been
waiting to verify that it really fixes the issue. And see the thread
started by Frank Mayhar about fsync issues as well...
The problem is a race, between the last write to a to-be-freed
metadata block (to update the extent header) and the block being
marked free in the on-disk/buddy bitmaps. Note that this only happens
without a journal, since *with* a journal the ordering is done
correctly.
Without a journal, the block buffer_head is written to, the
buffer_head is marked dirty, and the bitmaps are updated via
ext4_free_blocks(). In rare cases, the block is re-allocated for
another inode and written to -- subsequently, the writeback mechanism
will then flush the dirty extent header back to disk. That's why it
looks like "leaked extent data" in the data block.
I'm discussing with Frank whether we should handle this in
ext4_handle_dirty_metadata(), as per Ted's suggestion, or in separate
one-off patches, or what.
Thanks,
Curt
>
> Thanks,
>
> -VAL
>
> On Sat, Aug 22, 2009 at 04:10:56PM -0700, Curt Wohlgemuth wrote:
>> On the off chance that this sounds familiar to anyone out there...
>>
>> I've got a situation in which data files written by an application are
>> showing very occasional checksum errors sometimes. ?The data files are
>> all around 8MB long, written using O_DIRECT into fallocated space.
>> (The entire fallocated space for the example file below is written to
>> with valid data; i.e., no holes, no truncation, no uninitialized
>> extents.)
>>
>> When these occasional checksum failures show up, the data in the files
>> is rather odd. ?I've seen 4 cases of this so far, and the "bad" data
>> always starts on a block boundary, and always has the first 12 bytes
>> that are identical to what an extent header would look like (for a
>> header at the start of a block of extents or extent indexes):
>>
>> Here's the "od -Ad -x" output from one such file:
>>
>> ? ? ? ? ? ? ?8388608 f30a 0000 0154 0000 0000 0000 0000 0000
>>
>> (I.e., the first 2 bytes are EXT4_EXT_MAGIC, and bytes 4-5 are 0x154,
>> or what eh_max would be for a block size of 4096 bytes.)
>>
>> In this case, the "bad" data starts at block 2048. ?Two cases have
>> this pattern at block 2048; two at block 2050. ?A syscall trace of one
>> such corrupted file shows that this block was written with a single
>> write encompassing many adjacent blocks:
>>
>> ? ? ? ? ?write(fd=10, size=192512, offset=8204288)
>>
>> The file in question above has only two (in-inode) extents, which I
>> verified look valid. ?The block in question (2048) above is covered by
>> the second extent: ?logical blocks 2037-2050.
>>
>> I've seen the amount of "bad" data (including the "extent header"
>> above) to be pretty variable: between 70 and 800 bytes; I haven't been
>> able to correlate the rest of the bad data to any particular ext4 data
>> structures.
>>
>> My guess is that a block of extents from a truncated or removed file
>> was reused for data for this file, and somehow was not written
>> correctly. ?This seems (slightly) more plausible to me than the extent
>> metadata of an existing file was "leaked" into this one.
>>
>> Does any of this ring a bell to anybody?
>>
>> Thanks,
>> Curt
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
On Tue, Sep 08, 2009 at 11:21:11AM -0700, Curt Wohlgemuth wrote:
> Hi Valerie:
>
> On Tue, Sep 8, 2009 at 10:56 AM, Valerie Aurora<[email protected]> wrote:
> > Hey, did you figure this out? ?If not, I want to have a bug open
> > somewhere.
>
> Yes, sorry. I was going to post a patch for this, but have been
> waiting to verify that it really fixes the issue. And see the thread
> started by Frank Mayhar about fsync issues as well...
>
> The problem is a race, between the last write to a to-be-freed
> metadata block (to update the extent header) and the block being
> marked free in the on-disk/buddy bitmaps. Note that this only happens
> without a journal, since *with* a journal the ordering is done
> correctly.
Just to clarify, this a race that shows up even without an unclean
shutdown, right?
> Without a journal, the block buffer_head is written to, the
> buffer_head is marked dirty, and the bitmaps are updated via
> ext4_free_blocks(). In rare cases, the block is re-allocated for
> another inode and written to -- subsequently, the writeback mechanism
> will then flush the dirty extent header back to disk. That's why it
> looks like "leaked extent data" in the data block.
If this shows up even without an unclean shutdown, then it sounds like
the problem is a missing bforget() call.
- Ted
Hi Ted:
On Tue, Sep 8, 2009 at 12:40 PM, Theodore Tso<[email protected]> wrote:
> On Tue, Sep 08, 2009 at 11:21:11AM -0700, Curt Wohlgemuth wrote:
>> Hi Valerie:
>>
>> On Tue, Sep 8, 2009 at 10:56 AM, Valerie Aurora<[email protected]> wrote:
>> > Hey, did you figure this out? ?If not, I want to have a bug open
>> > somewhere.
>>
>> Yes, sorry. ?I was going to post a patch for this, but have been
>> waiting to verify that it really fixes the issue. ?And see the thread
>> started by Frank Mayhar about fsync issues as well...
>>
>> The problem is a race, between the last write to a to-be-freed
>> metadata block (to update the extent header) and the block being
>> marked free in the on-disk/buddy bitmaps. ?Note that this only happens
>> without a journal, since *with* a journal the ordering is done
>> correctly.
>
> Just to clarify, this a race that shows up even without an unclean
> shutdown, right?
Correct.
>> Without a journal, the block buffer_head is written to, the
>> buffer_head is marked dirty, and the bitmaps are updated via
>> ext4_free_blocks(). ?In rare cases, the block is re-allocated for
>> another inode and written to -- subsequently, the writeback mechanism
>> will then flush the dirty extent header back to disk. ?That's why it
>> looks like "leaked extent data" in the data block.
>
> If this shows up even without an unclean shutdown, then it sounds like
> the problem is a missing bforget() call.
I looked into this, and it may be merely my ignorance, but I don't see
how bforget() would solve the race.
All bforget() does is clear the buffer's dirty bit. Meanwhile, the
page is still marked dirty, and can be in the middle of writeback;
it's true that __block_write_full_page() will check the dirty bit for
each buffer in the page, but there doesn't seem to be any
synchronization to ensure that the write won't take place at some
point in time after bforget() is called. Which means it can be called
after the bitmap is changed.
This is why I opted to wait for the buffer to be written out before
continuing on to ext4_free_blocks().
Am I missing something?
Thanks,
Curt
On Tue, Sep 08, 2009 at 02:18:35PM -0700, Curt Wohlgemuth wrote:
>
> All bforget() does is clear the buffer's dirty bit. Meanwhile, the
> page is still marked dirty, and can be in the middle of writeback;
> it's true that __block_write_full_page() will check the dirty bit for
> each buffer in the page, but there doesn't seem to be any
> synchronization to ensure that the write won't take place at some
> point in time after bforget() is called. Which means it can be called
> after the bitmap is changed.
Let me make sure I got this right. The problem that you're worried
about is a block that had previously contained an extent tree node for
an inode that gets deleted, and then that blocks gets reallocated for
use as a data block. In ext3 and ext4, metadata blocks (such as
extent tree blocks), aren't stored in the page cache.
So I'm not sure why you're worried about the page being marked dirty.
What's the scenario you are concerned about?
If it's the case where a data block for a deleted inode getting
rewritten after the inode is deleted, when the inode is deleted,
truncate_inode_apges() end up dropping the pages from the page cache
*before* the block allocation bitmap is dropped.
> This is why I opted to wait for the buffer to be written out before
> continuing on to ext4_free_blocks().
Just to be clear, which buffer are you talking about here?
- Ted
Hi Ted:
On Tue, Sep 8, 2009 at 4:36 PM, Theodore Tso<[email protected]> wrote:
> On Tue, Sep 08, 2009 at 02:18:35PM -0700, Curt Wohlgemuth wrote:
>>
>> All bforget() does is clear the buffer's dirty bit. ?Meanwhile, the
>> page is still marked dirty, and can be in the middle of writeback;
>> it's true that __block_write_full_page() will check the dirty bit for
>> each buffer in the page, but there doesn't seem to be any
>> synchronization to ensure that the write won't take place at some
>> point in time after bforget() is called. ?Which means it can be called
>> after the bitmap is changed.
>
> Let me make sure I got this right. ?The problem that you're worried
> about is a block that had previously contained an extent tree node for
> an inode that gets deleted, and then that blocks gets reallocated for
> use as a data block.
Correct.
>?In ext3 and ext4, metadata blocks (such as
> extent tree blocks), aren't stored in the page cache.
Hmm. You're saying that in the absence of a journal, all metadata
writes go direct to disk? Where should I look for this in the code?
Looking at ext4_ext_new_meta_block() and code that uses it, I don't
see anything that prevents the use of the page cache. And if this
were the case, wouldn't the call to mark_buffer_dirty() in
__ext4_handle_dirty_metadata() (when there's no journal) do nothing?
I also put in code in submit_bio() to scan all pages for the extent
header pattern that I was seeing ("leaking" into the data pages).
When I saw it, the stack trace was always from pdflush() (from
wb_kupdate()). I.e., these are from the page cache.
> So I'm not sure why you're worried about the page being marked dirty.
> What's the scenario you are concerned about?
If you're right that metadata writes are not through the page cache,
then there is no scenario I'm worried about :-) .
The problem is that I've seen this in real life. And the patch below
seems to fix it. (Unfortunately, I haven't been able to recreate this
in a simple example, after several days work. I've only seen this in
a *very* small number of cases on heavily loaded machines.)
> If it's the case where a data block for a deleted inode getting
> rewritten after the inode is deleted, when the inode is deleted,
> truncate_inode_apges() end up dropping the pages from the page cache
> *before* the block allocation bitmap is dropped.
It's quite possible that there's an interaction with older code that
we have in our 2.6.26-based kernels -- our ext4/jbd2 code is pretty
up-to-date, but the rest of the code base is not. But really, in my
case -- and you'll have to trust me -- I've seen this pattern:
1. file A (~8MB) is written out and closed, with a final mod time of 12:08 p.m.
2. My submit_bio() scan sees the "bad extent header" written out to
physical block B at 12:15 p.m.
3. Looking at file A later, its logical block 2048 corresponds to
physical block B -- and contains the "bad extent header" pattern.
truncate_inode_pages() only deals with data blocks, right? So it
should have no effect on metadata...
>> This is why I opted to wait for the buffer to be written out before
>> continuing on to ext4_free_blocks().
>
> Just to be clear, which buffer are you talking about here?
The leaf extent blocks buffer_head. Here's the patch, as applied to a
2.6.30.3 version of extents.c:
diff -Naur orig/fs/ext4/extents.c new/fs/ext4/extents.c
--- orig/fs/ext4/extents.c 2009-09-08 20:28:46.000000000 -0700
+++ new/fs/ext4/extents.c 2009-09-08 20:31:42.000000000 -0700
@@ -1958,6 +1958,15 @@
return err;
ext_debug("index is empty, remove it, free block %llu\n", leaf);
bh = sb_find_get_block(inode->i_sb, leaf);
+
+ /*
+ * If we don't have a journal, then we've dirtied the BH for the leaf
+ * block, but we're freeing the block now. We need to wait here for
+ * the page to be written out before we proceed.
+ */
+ if (!ext4_handle_valid(handle) && bh)
+ sync_dirty_buffer(bh);
+
ext4_forget(handle, 1, inode, bh, leaf);
ext4_free_blocks(handle, inode, leaf, 1, 1);
return err;
Thanks,
Curt
On Tue, Sep 08, 2009 at 09:00:50PM -0700, Curt Wohlgemuth wrote:
>
> >?In ext3 and ext4, metadata blocks (such as
> > extent tree blocks), aren't stored in the page cache.
>
> Hmm. You're saying that in the absence of a journal, all metadata
> writes go direct to disk? Where should I look for this in the code?
Sorry, let me be more precise. All metadata writes, regardless of
whether a journal is present or not, are written via the buffer head
(bh) abstraction. They have to, because that's how we do our
journalling; the jbd/jbd2 layer is built on top of the bh I/O request
layer, and even when a journal is not present, we are still doing our
metadata I/O via the submit_bh and ll_rw_block interface.
It used to be the case (in Linux 2.4) that the buffer cache was stored
separately from the page cache. In Linux 2.6, the buffer cache is
implemented on top of the page cache, so technically, the metadata
blocks are stored in the page cache; however, they are only *accessed*
via the buffer cache abstraction.
> The problem is that I've seen this in real life. And the patch below
> seems to fix it. (Unfortunately, I haven't been able to recreate this
> in a simple example, after several days work. I've only seen this in
> a *very* small number of cases on heavily loaded machines.)
I believe that you have a problem. The problem is you have a dirty bh
which is getting written out after the block gets reallocated for use
as a data block. But a bforget() call should have the problem just as
as well. In fact, I think the real fix should be this.
commit 1b58b00e02893b4bbab2b5f137316b82feadac52
Author: Theodore Ts'o <[email protected]>
Date: Wed Sep 9 11:18:42 2009 -0400
ext4: Use bforget() in no journal mode when in ext4_journal_forget()
When ext4 is using a journal, a metadata block which is deallocated
must be passed into the journal layer so it can be "revoked". The
jbd2_journal_forget() function is also responsible for calling
bforget(). Without a journal, ext4_journal_forget() must call
bforget(), to avoid a race from a dirty metadata block getting written
back after it has been reallocated and reused for another inode's data
block.
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index eb27fd0..d4f4b39 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -44,7 +44,7 @@ int __ext4_journal_forget(const char *where, handle_t *handle,
handle, err);
}
else
- brelse(bh);
+ bforget(bh);
return err;
}
- Ted