I'm seeing some corruption in data files during heavy use on ext4 file
systems, which appears to be a bug. The symptom is this:
A random block in the middle of an otherwise undistinguished 8MB data file
has a pattern like this:
$ od -Ax -x <file>
...
001000 b4aa 0005 000c 0201 002e 0000 4e31 0005
001010 0ff4 0202 2e2e 0000 ce67 0004 000c 0102
001020 6e69 0000 ce69 0004 0fdc 0103 756f 0074
001030 0000 0000 0000 0000 0000 0000 0000 0000
*
002000 8b83 f727 10d0 b918 ad2a 8edc 67f7 e178
...
The block from 0x1000 to 0x2000 looks an awful lot like a block of directory
entries, with the dirents:
inode : 373930
rec_len : 12
name_len : 1
file_type : 2 (dir)
name : "."
inode : 347697
rec_len : 4084 (i.e., all the rest of the block
name_len : 2
file_type : 2 (dir)
name : ".."
with remnants of other, deleted dirents following it.
These corruptions are pretty rare, and I can't replicate the problem in any
sort of simple test case. But looking at the code, it seems that there's a
problem with deletion of "metadata" blocks full of dirents: ext4_forget()
is never called for them.
For all other blocks used as metadata, ext4_forget() seems to be called when
they're about to be freed up:
- extent blocks
- indirect blocks
- xattr blocks
But I don't see anywhere that we call ext4_forget() (or ext4_journal_forget
directly) for directory entries.
So when a directory is removed with "rm -rf foo" , as the files are deleted,
the directory block(s) are marked dirty. But when the directory blocks
themselves are freed up, bforget() isn't called for their bufferheads, and
so they remain dirty in the page cache, and can be written down later, after
their blocks have been reused.
This is the same problem I saw with extent metadata blocks "leaking" into
data blocks, fixed with c7acb4c16646943180bd221c167a077e0a084f9c , which
added calls to bforget() in ext4_journal_{forget,revoke}() . But in this
new case, it would seem to be an issue both with and without a journal, and
with both extent- and non-extent based directories.
Am I missing something? And if not, suggestions on the best place to fix
this? I was thinking of doing this in ext4_truncate() for all truncated
blocks if this is a directory; or in ext4_mb_free_blocks() if "metadata" is
1. But this latter one would be overkill for all those "normal" metadata
blocks which already have been "forgotten."
Thanks,
Curt
On Fri, Nov 13, 2009 at 03:46:09PM -0800, Curt Wohlgemuth wrote:
> So when a directory is removed with "rm -rf foo" , as the files are deleted,
> the directory block(s) are marked dirty. But when the directory blocks
> themselves are freed up, bforget() isn't called for their bufferheads, and
> so they remain dirty in the page cache, and can be written down later, after
> their blocks have been reused.
Well, it should be happening as part of the call to ext4_forget(). It
looks like it's happening on the code paths that release the blocks.
This is critically important if journalling is enabled, since we have
to call jbd2_journal_revoke() on directory blocks before they can be
reused as data blocks. Hmm, if the buffer was part of a transaction,
we don't call __bforget() on it in jbd2_journal_forget(). So I can
see how you might be seeing a problem with journalling enabled, but
I'm puzzled why you were also seeing a problem without journalling.
So to help debug what is going on, I tried adding the a new tracepoint
to ext4_forget(). I've attached it to the end of this message. Using
it, I can confirm that ext4_forget() is getting called for
directories. I do see a problem though that we're not checking to see
if the inode is a directory; in that case, we need to treat it as if
it were metadata, and call ext4_journal_revoke() instead of
ext4_journal_forget(). Otherwise we could have the problem that after
a crash, on a journal replay, we might end up replaying the directory
block after it has been reallocated and used as a data block.
(Hmm.... I think this might be a problem for ext3 as well.)
I am very puzzled why you are seeing a problem in no journal mode,
though. It looks like the right thing should be happening. Is the
8MB data file is getting written via direct I/O or buffered I/O?
- Ted
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 554c679..13de1dd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -89,6 +89,7 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
might_sleep();
+ trace_ext4_forget(inode, is_metadata, blocknr);
BUFFER_TRACE(bh, "enter");
jbd_debug(4, "forgetting bh %p: is_metadata = %d, mode %o, "
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d09550b..b390e1f 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -907,6 +907,32 @@ TRACE_EVENT(ext4_mballoc_free,
__entry->result_len, __entry->result_logical)
);
+TRACE_EVENT(ext4_forget,
+ TP_PROTO(struct inode *inode, int is_metadata, __u64 block),
+
+ TP_ARGS(inode, is_metadata, block),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( umode_t, mode )
+ __field( int, is_metadata )
+ __field( __u64, block )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->mode = inode->i_mode;
+ __entry->is_metadata = is_metadata;
+ __entry->block = block;
+ ),
+
+ TP_printk("dev %s ino %lu mode %d is_metadata %d block %llu",
+ jbd2_dev_to_name(__entry->dev), (unsigned long) __entry->ino,
+ __entry->mode, __entry->is_metadata, __entry->block)
+);
+
#endif /* _TRACE_EXT4_H */
/* This part must be outside protection */
When a directory gets unlinked, ext4_forget() is called on any buffer
heads corresponding to its data blocks. Data blocks from directories
must be treated as metadata, so that they are revoked by
jbd2_journal_revoke, and not just forgotten via ext4_journal_forget().
Thanks to Curt Wohlgemuth for pointing out potential problems in this
area.
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/inode.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 13de1dd..639bb84 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -97,6 +97,10 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
bh, is_metadata, inode->i_mode,
test_opt(inode->i_sb, DATA_FLAGS));
+ /* Directory blocks must be treated as metadata */
+ if (S_ISDIR(inode->i_mode))
+ is_metadata = 1;
+
/* Never use the revoke function if we are doing full data
* journaling: there is no need to, and a V1 superblock won't
* support it. Otherwise, only skip the revoke on un-journaled
--
1.6.5.216.g5288a.dirty
On Sat, Nov 14, 2009 at 07:30:59PM -0500, Theodore Ts'o wrote:
> When a directory gets unlinked, ext4_forget() is called on any buffer
> heads corresponding to its data blocks. Data blocks from directories
> must be treated as metadata, so that they are revoked by
> jbd2_journal_revoke, and not just forgotten via ext4_journal_forget().
>
> Thanks to Curt Wohlgemuth for pointing out potential problems in this
> area.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/inode.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 13de1dd..639bb84 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -97,6 +97,10 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
> bh, is_metadata, inode->i_mode,
> test_opt(inode->i_sb, DATA_FLAGS));
>
> + /* Directory blocks must be treated as metadata */
> + if (S_ISDIR(inode->i_mode))
> + is_metadata = 1;
> +
> /* Never use the revoke function if we are doing full data
> * journaling: there is no need to, and a V1 superblock won't
> * support it. Otherwise, only skip the revoke on un-journaled
I guess we need to make sure we call ext4_forget with correct is_metadata values. I
did the below patch. The xattr changes in the patch should be split as a separate one.
I am not sure why we do a get_bh there.
Another question i have is, do we actually supporting freeing directory blocks
when we delete directory entries ? I remember reading we don't have support for that.
So may be Curt is not seeing the ext4_forget being called because he is trying delete
of directory entries. I guess he will have to do a rmdir directory to see the
directory blocks freed.
If you think the changes are correct i will send proper patches with s-o-b
-aneesh
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 715264b..74dcff8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2074,7 +2074,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
ext_debug("free last %u blocks starting %llu\n", num, start);
for (i = 0; i < num; i++) {
bh = sb_find_get_block(inode->i_sb, start + i);
- ext4_forget(handle, 0, inode, bh, start + i);
+ ext4_forget(handle, metadata, inode, bh, start + i);
}
ext4_free_blocks(handle, inode, start, num, metadata);
} else if (from == le32_to_cpu(ex->ee_block)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2c8caa5..87232a5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -96,6 +96,13 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
bh, is_metadata, inode->i_mode,
test_opt(inode->i_sb, DATA_FLAGS));
+ /*
+ * The caller should have set is_metadata properly. But
+ * being extra careful here
+ */
+ if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
+ is_metadata = 1;
+
/* Never use the revoke function if we are doing full data
* journaling: there is no need to, and a V1 superblock won't
* support it. Otherwise, only skip the revoke on un-journaled
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index fed5b01..3c93a9a 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -482,9 +482,8 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
ea_bdebug(bh, "refcount now=0; freeing");
if (ce)
mb_cache_entry_free(ce);
- ext4_free_blocks(handle, inode, bh->b_blocknr, 1, 1);
- get_bh(bh);
ext4_forget(handle, 1, inode, bh, bh->b_blocknr);
+ ext4_free_blocks(handle, inode, bh->b_blocknr, 1, 1);
} else {
le32_add_cpu(&BHDR(bh)->h_refcount, -1);
error = ext4_handle_dirty_metadata(handle, inode, bh);
On Sun, Nov 15, 2009 at 12:34:48PM +0530, Aneesh Kumar K.V wrote:
> On Sat, Nov 14, 2009 at 07:30:59PM -0500, Theodore Ts'o wrote:
> > When a directory gets unlinked, ext4_forget() is called on any buffer
> > heads corresponding to its data blocks. Data blocks from directories
> > must be treated as metadata, so that they are revoked by
> > jbd2_journal_revoke, and not just forgotten via ext4_journal_forget().
> >
> > Thanks to Curt Wohlgemuth for pointing out potential problems in this
> > area.
> >
> > Signed-off-by: "Theodore Ts'o" <[email protected]>
> > ---
> > fs/ext4/inode.c | 4 ++++
> > 1 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 13de1dd..639bb84 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -97,6 +97,10 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
> > bh, is_metadata, inode->i_mode,
> > test_opt(inode->i_sb, DATA_FLAGS));
> >
> > + /* Directory blocks must be treated as metadata */
> > + if (S_ISDIR(inode->i_mode))
> > + is_metadata = 1;
> > +
> > /* Never use the revoke function if we are doing full data
> > * journaling: there is no need to, and a V1 superblock won't
> > * support it. Otherwise, only skip the revoke on un-journaled
>
> I guess we need to make sure we call ext4_forget with correct is_metadata values. I
> did the below patch. The xattr changes in the patch should be split as a separate one.
> I am not sure why we do a get_bh there.
The callers of ext4_xattr_release_block is also doing a brelse on the buffer_head.
So we need that get_bh.
-aneesh
On Sun, Nov 15, 2009 at 12:34:48PM +0530, Aneesh Kumar K.V wrote:
>
> I guess we need to make sure we call ext4_forget with correct
> is_metadata values. I did the below patch. The xattr changes in the
> patch should be split as a separate one. I am not sure why we do a
> get_bh there.
It doesn't hurt to call ext4_forget() with the correct values, but I
figured it was easier just to make ext4_forget() DTRT thing by
checking the inode type since it has access to i_mode. My patch
didn't take into account symlinks, though. Good catch on your part.
> Another question i have is, do we actually supporting freeing
> directory blocks when we delete directory entries ? I remember
> reading we don't have support for that.
No, we don't.
> So may be Curt is not
> seeing the ext4_forget being called because he is trying delete of
> directory entries. I guess he will have to do a rmdir directory to
> see the directory blocks freed.
I'm assuming the problem that Curt was seeing was due to directories
being deleted, and the blocks getting reused immediately afterwards
for data blocks. I'm guessing the right was done via direct I/O,
which means it would have been posted right away, and somehow the
dirty buffer head some managed to not get forgotten via bforget(). In
the non-journal case, I don't see how that could happen, but I must be
missing something with the code paths. My experiments show that
ext4_forget() is getting called, but apparently somehow bforget() must
be getting called after that point.
> If you think the changes are correct i will send proper patches with s-o-b
I already have a patch in the patch queue, and I'll just update it to
include checking for S_ISLNK(inode->i_mode). I suppose I can add your
change to set is_metadata in ext4_remove_blocks(), but that only
handles the extents case. The direct/indirect mapped case also has a
similar issue, which is why decided it was most straightforward to fix
it in ext4_forget().
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index fed5b01..3c93a9a 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -482,9 +482,8 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
> ea_bdebug(bh, "refcount now=0; freeing");
> if (ce)
> mb_cache_entry_free(ce);
> - ext4_free_blocks(handle, inode, bh->b_blocknr, 1, 1);
> - get_bh(bh);
> ext4_forget(handle, 1, inode, bh, bh->b_blocknr);
> + ext4_free_blocks(handle, inode, bh->b_blocknr, 1, 1);
> } else {
> le32_add_cpu(&BHDR(bh)->h_refcount, -1);
> error = ext4_handle_dirty_metadata(handle, inode, bh);
This change isn't needed, as you pointed out in a later e-mail,
ext4_xattr_release_block() isn't supposed to change the refcount of
the buffer_head; it is brelse'ed by its caller.
- Ted
When a directory gets unlinked, ext4_forget() is called on any buffer
heads corresponding to its data blocks. Data blocks from directories
must be treated as metadata, so that they are revoked by
jbd2_journal_revoke, and not just forgotten via ext4_journal_forget().
Thanks to Curt Wohlgemuth for pointing out potential problems in this
area.
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
---
fs/ext4/inode.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 13de1dd..24729ed 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -97,6 +97,10 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
bh, is_metadata, inode->i_mode,
test_opt(inode->i_sb, DATA_FLAGS));
+ /* Directory or symlink blocks must be treated as metadata */
+ if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
+ is_metadata = 1;
+
/* Never use the revoke function if we are doing full data
* journaling: there is no need to, and a V1 superblock won't
* support it. Otherwise, only skip the revoke on un-journaled
--
1.6.5.216.g5288a.dirty
On Sun, Nov 15, 2009 at 12:43 PM, Theodore Tso <[email protected]> wrote:
> On Sun, Nov 15, 2009 at 12:34:48PM +0530, Aneesh Kumar K.V wrote:
>>
>> I guess we need to make sure we call ext4_forget with correct
>> is_metadata values. I did the below patch. The xattr changes in the
>> patch should be split as a separate one. ?I am not sure why we do a
>> get_bh there.
>
> It doesn't hurt to call ext4_forget() with the correct values, but I
> figured it was easier just to make ext4_forget() DTRT thing by
> checking the inode type since it has access to i_mode. ?My patch
> didn't take into account symlinks, though. ? Good catch on your part.
>
>> Another question i have is, do we actually supporting freeing
>> directory blocks when we delete directory entries ? I remember
>> reading we don't have support for that.
>
> No, we don't.
>
>> So may be Curt is not
>> seeing the ext4_forget being called because he is trying delete of
>> directory entries. I guess he will have to do a rmdir directory to
>> see the directory blocks freed.
>
> I'm assuming the problem that Curt was seeing was due to directories
> being deleted, and the blocks getting reused immediately afterwards
> for data blocks. ?I'm guessing the right was done via direct I/O,
> which means it would have been posted right away, and somehow the
> dirty buffer head some managed to not get forgotten via bforget(). ?In
> the non-journal case, I don't see how that could happen, but I must be
> missing something with the code paths. ?My experiments show that
> ext4_forget() is getting called, but apparently somehow bforget() must
> be getting called after that point.
Yes, I'm also assuming that the problem is with deleting directories.
And yes, DIO is used for the 8MB files that are being corrupted.
I'm not sure how I missed the call to ext4_forget() in
ext4_remove_blocks() and ext4_clear_blocks(), but I did -- or at least
I didn't realize they were called for all data blocks being freed.
Thanks.
As to why this happened on our systems where no journal is being used:
I believe these were running older kernels that didn't have the
patches in c7acb4c16646943180bd221c167a077e0a084f9c, and hence weren't
calling bforget() properly.
Thanks,
Curt
>
>> If you think the changes are correct i will send proper patches with s-o-b
>
> I already have a patch in the patch queue, and I'll just update it to
> include checking for S_ISLNK(inode->i_mode). ?I suppose I can add your
> change to set is_metadata in ext4_remove_blocks(), but that only
> handles the extents case. ?The direct/indirect mapped case also has a
> similar issue, which is why decided it was most straightforward to fix
> it in ext4_forget().
>
>> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
>> index fed5b01..3c93a9a 100644
>> --- a/fs/ext4/xattr.c
>> +++ b/fs/ext4/xattr.c
>> @@ -482,9 +482,8 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>> ? ? ? ? ? ? ? ea_bdebug(bh, "refcount now=0; freeing");
>> ? ? ? ? ? ? ? if (ce)
>> ? ? ? ? ? ? ? ? ? ? ? mb_cache_entry_free(ce);
>> - ? ? ? ? ? ? ext4_free_blocks(handle, inode, bh->b_blocknr, 1, 1);
>> - ? ? ? ? ? ? get_bh(bh);
>> ? ? ? ? ? ? ? ext4_forget(handle, 1, inode, bh, bh->b_blocknr);
>> + ? ? ? ? ? ? ext4_free_blocks(handle, inode, bh->b_blocknr, 1, 1);
>> ? ? ? } else {
>> ? ? ? ? ? ? ? le32_add_cpu(&BHDR(bh)->h_refcount, -1);
>> ? ? ? ? ? ? ? error = ext4_handle_dirty_metadata(handle, inode, bh);
>
> This change isn't needed, as you pointed out in a later e-mail,
> ext4_xattr_release_block() isn't supposed to change the refcount of
> the buffer_head; it is brelse'ed by its caller.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>
On Sun, Nov 15, 2009 at 03:43:46PM -0500, Theodore Tso wrote:
> On Sun, Nov 15, 2009 at 12:34:48PM +0530, Aneesh Kumar K.V wrote:
> >
> > I guess we need to make sure we call ext4_forget with correct
> > is_metadata values. I did the below patch. The xattr changes in the
> > patch should be split as a separate one. I am not sure why we do a
> > get_bh there.
>
> It doesn't hurt to call ext4_forget() with the correct values, but I
> figured it was easier just to make ext4_forget() DTRT thing by
> checking the inode type since it has access to i_mode. My patch
> didn't take into account symlinks, though. Good catch on your part.
May be you want to merge the ext4_remove_blocks changes also. That make
sure anybody reading code doesn't have to spent time in figuring out
why ext4_forget is called with metadata = 0 and ext4_free_blocks is called
with metadata = 1.
-aneesh
On Mon, Nov 16, 2009 at 12:31:46PM +0530, Aneesh Kumar K.V wrote:
> May be you want to merge the ext4_remove_blocks changes also. That make
> sure anybody reading code doesn't have to spent time in figuring out
> why ext4_forget is called with metadata = 0 and ext4_free_blocks is called
> with metadata = 1.
Yeah, I suppose so. The reason why I didn't was because currently
ext4_forget() doesn't get called with metadata = 1 on the
direct/indirect-mapped path for directories and symlinks, and I
figured why not keep things consistent between those two callers of
ext4_forget().
Long term we should probably clean up the indirect path as well, I
suppose, and then remove the safety checks in ext4_free_blocks() and
ext4_forget(). That will save a tiny amount of CPU, which I doubt
anyone except Google will be able to measure or notice. :-)
- Ted