2009-09-10 01:35:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
>
> First, ext4_journal_forget() is called from ext4_forget() only when
> we're journalling; without a journal, ext4_journal_forget() is only
> called for various non-extent paths. ext4_forget() could be changed,
> of course...

Ext4_forget() calls either ext4_journal_forget() or
ext4_journal_revoke(). So we need to fix up both functions.

- Ted

commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
Author: Theodore Ts'o <[email protected]>
Date: Wed Sep 9 21:32:41 2009 -0400

ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()

When ext4 is using a journal, a metadata block which is deallocated
must be passed into the journal layer so it can be dropped from the
current transaction and/or revoked. This is done by calling the
functions ext4_journal_forget() and ext4_journal_revoke(), which call
jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.

Since the jbd2_journal_forget() and jbd2_journal_revoke() call
bforget(), if ext4 is not using a journal, ext4_journal_forget() and
ext4_journal_revoke() must call bforget() to avoid a dirty metadata
block overwriting a block after it has been reallocated and reused for
another inode's data block.

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index eb27fd0..ecb9ca4 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -44,7 +44,7 @@ int __ext4_journal_forget(const char *where, handle_t *handle,
handle, err);
}
else
- brelse(bh);
+ bforget(bh);
return err;
}

@@ -60,7 +60,7 @@ int __ext4_journal_revoke(const char *where, handle_t *handle,
handle, err);
}
else
- brelse(bh);
+ bforget(bh);
return err;
}



2009-09-10 06:54:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Wed, Sep 09, 2009 at 09:35:40PM -0400, Theodore Tso wrote:
> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
> >
> > First, ext4_journal_forget() is called from ext4_forget() only when
> > we're journalling; without a journal, ext4_journal_forget() is only
> > called for various non-extent paths. ext4_forget() could be changed,
> > of course...
>
> Ext4_forget() calls either ext4_journal_forget() or
> ext4_journal_revoke(). So we need to fix up both functions.
>
> - Ted
>
> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
> Author: Theodore Ts'o <[email protected]>
> Date: Wed Sep 9 21:32:41 2009 -0400
>
> ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
>
> When ext4 is using a journal, a metadata block which is deallocated
> must be passed into the journal layer so it can be dropped from the
> current transaction and/or revoked. This is done by calling the
> functions ext4_journal_forget() and ext4_journal_revoke(), which call
> jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
>
> Since the jbd2_journal_forget() and jbd2_journal_revoke() call
> bforget(), if ext4 is not using a journal, ext4_journal_forget() and
> ext4_journal_revoke() must call bforget() to avoid a dirty metadata
> block overwriting a block after it has been reallocated and reused for
> another inode's data block.
>

I am sure i am missing something. But where are we adding the buffer_head
to the mapping->private_list ?. For ext2 when we allocate meta data blocks
we do mark_buffer_dirty_inode which add the buffer_head to the inodes
private_list. Shouldn't we do something similar with Ext4 without journal ?

-aneesh

2009-09-10 15:46:43

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Wed, Sep 9, 2009 at 11:54 PM, Aneesh Kumar
K.V<[email protected]> wrote:
> On Wed, Sep 09, 2009 at 09:35:40PM -0400, Theodore Tso wrote:
>> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
>> >
>> > First, ext4_journal_forget() is called from ext4_forget() only when
>> > we're journalling; without a journal, ext4_journal_forget() is only
>> > called for various non-extent paths. ?ext4_forget() could be changed,
>> > of course...
>>
>> Ext4_forget() calls either ext4_journal_forget() or
>> ext4_journal_revoke(). ?So we need to fix up both functions.
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
>>
>> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
>> Author: Theodore Ts'o <[email protected]>
>> Date: ? Wed Sep 9 21:32:41 2009 -0400
>>
>> ? ? ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
>>
>> ? ? When ext4 is using a journal, a metadata block which is deallocated
>> ? ? must be passed into the journal layer so it can be dropped from the
>> ? ? current transaction and/or revoked. ?This is done by calling the
>> ? ? functions ext4_journal_forget() and ext4_journal_revoke(), which call
>> ? ? jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
>>
>> ? ? Since the jbd2_journal_forget() and jbd2_journal_revoke() call
>> ? ? bforget(), if ext4 is not using a journal, ext4_journal_forget() and
>> ? ? ext4_journal_revoke() must call bforget() to avoid a dirty metadata
>> ? ? block overwriting a block after it has been reallocated and reused for
>> ? ? another inode's data block.
>>
>
> I am sure i am missing something. But where are we adding the buffer_head
> to the mapping->private_list ?. For ext2 when we allocate meta data blocks
> we do mark_buffer_dirty_inode which add the buffer_head to the inodes
> private_list. Shouldn't we do something similar with Ext4 without journal ?

As Ted explained to me, all buffer heads pointing to metadata blocks
are attached to the block device inode. So pdflush writes of these
pages go through the block device address space ops. Explicit
sync_dirty_buffer() calls for the metadata buffer heads still work, of
course.

Curt

2009-09-10 15:55:15

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Wed, Sep 9, 2009 at 6:35 PM, Theodore Tso<[email protected]> wrote:
> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
>>
>> First, ext4_journal_forget() is called from ext4_forget() only when
>> we're journalling; without a journal, ext4_journal_forget() is only
>> called for various non-extent paths. ?ext4_forget() could be changed,
>> of course...
>
> Ext4_forget() calls either ext4_journal_forget() or
> ext4_journal_revoke(). ?So we need to fix up both functions.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>
> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
> Author: Theodore Ts'o <[email protected]>
> Date: ? Wed Sep 9 21:32:41 2009 -0400
>
> ? ?ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
>
> ? ?When ext4 is using a journal, a metadata block which is deallocated
> ? ?must be passed into the journal layer so it can be dropped from the
> ? ?current transaction and/or revoked. ?This is done by calling the
> ? ?functions ext4_journal_forget() and ext4_journal_revoke(), which call
> ? ?jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
>
> ? ?Since the jbd2_journal_forget() and jbd2_journal_revoke() call
> ? ?bforget(), if ext4 is not using a journal, ext4_journal_forget() and
> ? ?ext4_journal_revoke() must call bforget() to avoid a dirty metadata
> ? ?block overwriting a block after it has been reallocated and reused for
> ? ?another inode's data block.
>
> ? ?Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index eb27fd0..ecb9ca4 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -44,7 +44,7 @@ int __ext4_journal_forget(const char *where, handle_t *handle,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle, err);
> ? ? ? ?}
> ? ? ? ?else
> - ? ? ? ? ? ? ? brelse(bh);
> + ? ? ? ? ? ? ? bforget(bh);
> ? ? ? ?return err;
> ?}
>
> @@ -60,7 +60,7 @@ int __ext4_journal_revoke(const char *where, handle_t *handle,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle, err);
> ? ? ? ?}
> ? ? ? ?else
> - ? ? ? ? ? ? ? brelse(bh);
> + ? ? ? ? ? ? ? bforget(bh);
> ? ? ? ?return err;
> ?}

This looks quite reasonable to me. And of course, your patch fixes
the brelse() calls that I added back in July.

Of course, validating this patch on my end will be tough; I need at
least a week of no corruptions to convince myself that the problem is
fixed, to the same degree that my call to sync_dirty_buffer() works.

Thanks,
Curt

2009-09-10 16:24:45

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Thu, Sep 10, 2009 at 08:46:41AM -0700, Curt Wohlgemuth wrote:
> On Wed, Sep 9, 2009 at 11:54 PM, Aneesh Kumar
> K.V<[email protected]> wrote:
> > On Wed, Sep 09, 2009 at 09:35:40PM -0400, Theodore Tso wrote:
> >> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
> >> >
> >> > First, ext4_journal_forget() is called from ext4_forget() only when
> >> > we're journalling; without a journal, ext4_journal_forget() is only
> >> > called for various non-extent paths. ?ext4_forget() could be changed,
> >> > of course...
> >>
> >> Ext4_forget() calls either ext4_journal_forget() or
> >> ext4_journal_revoke(). ?So we need to fix up both functions.
> >>
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
> >>
> >> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
> >> Author: Theodore Ts'o <[email protected]>
> >> Date: ? Wed Sep 9 21:32:41 2009 -0400
> >>
> >> ? ? ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
> >>
> >> ? ? When ext4 is using a journal, a metadata block which is deallocated
> >> ? ? must be passed into the journal layer so it can be dropped from the
> >> ? ? current transaction and/or revoked. ?This is done by calling the
> >> ? ? functions ext4_journal_forget() and ext4_journal_revoke(), which call
> >> ? ? jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
> >>
> >> ? ? Since the jbd2_journal_forget() and jbd2_journal_revoke() call
> >> ? ? bforget(), if ext4 is not using a journal, ext4_journal_forget() and
> >> ? ? ext4_journal_revoke() must call bforget() to avoid a dirty metadata
> >> ? ? block overwriting a block after it has been reallocated and reused for
> >> ? ? another inode's data block.
> >>
> >
> > I am sure i am missing something. But where are we adding the buffer_head
> > to the mapping->private_list ?. For ext2 when we allocate meta data blocks
> > we do mark_buffer_dirty_inode which add the buffer_head to the inodes
> > private_list. Shouldn't we do something similar with Ext4 without journal ?
>
> As Ted explained to me, all buffer heads pointing to metadata blocks
> are attached to the block device inode. So pdflush writes of these
> pages go through the block device address space ops. Explicit
> sync_dirty_buffer() calls for the metadata buffer heads still work, of
> course.

But how would it work for fsync ? I mean

I would expect for no journal mode ext4_sync_file should be doing
simple_fsync(). That should be forcing the metadata buffer_heads
via sync_mapping_buffers. And if we reuse these meta buffers we
drop them the inode->mapping->private_list using bforget.

But I don't see any of the above in code

-aneesh

2009-09-10 18:58:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
>
> But how would it work for fsync ? I mean
>
> I would expect for no journal mode ext4_sync_file should be doing
> simple_fsync(). That should be forcing the metadata buffer_heads
> via sync_mapping_buffers. And if we reuse these meta buffers we
> drop them the inode->mapping->private_list using bforget.
>
> But I don't see any of the above in code

Aneesh, you're addressing a different problem than the one that Curt
were trying to deal with this patch. The problem we are worry about
is one where an inode's extent tree or indirect blocks are modified
right before the inode is deleted, and then one or more of those
metadata blocks get reallocated and written right away (most likely
this will happen via an O_DIRECT write), and then, because we didn't
use bforget(), the dirty metadata block in the buffer cache would get
written out, overwriting the O_DIRECT block.

What you're worrying about, is a different issue. You're concerned
about the fact that since we are not associating an inode's extent
tree or indirect blocks with the inode, those blocks won't get forced
out to disk on an fsync() in ext4 no-journal mode. This may not be a
big deal for applications which expect to recover from an unclean
using mke2fs (and thus probably don't use fsync in any case), but
here's a patch to deal with the problem you've raised.

- Ted

commit 417cf58253fbf3e36df7b3aca11c120e8367f5e6
Author: Theodore Ts'o <[email protected]>
Date: Thu Sep 10 14:58:02 2009 -0400

ext4: Assure that metadata blocks are written during fsync in no journal mode

When there is no journal present, we must attach buffer heads
associated with extent tree and indirect blocks to the inode's
mapping->private_list so that fsync() will write out the inode's
metadata blocks. This is done via mark_buffer_dirty_inode().

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index ecb9ca4..6a94099 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -89,7 +89,10 @@ int __ext4_handle_dirty_metadata(const char *where, handle_t *handle,
ext4_journal_abort_handle(where, __func__, bh,
handle, err);
} else {
- mark_buffer_dirty(bh);
+ if (inode && bh)
+ mark_buffer_dirty_inode(bh, inode);
+ else
+ mark_buffer_dirty(bh);
if (inode && inode_needs_sync(inode)) {
sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh)) {


2009-09-11 17:21:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Thu, Sep 10, 2009 at 02:58:26PM -0400, Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
> >
> > But how would it work for fsync ? I mean
> >
> > I would expect for no journal mode ext4_sync_file should be doing
> > simple_fsync(). That should be forcing the metadata buffer_heads
> > via sync_mapping_buffers. And if we reuse these meta buffers we
> > drop them the inode->mapping->private_list using bforget.
> >
> > But I don't see any of the above in code
>
> Aneesh, you're addressing a different problem than the one that Curt
> were trying to deal with this patch. The problem we are worry about
> is one where an inode's extent tree or indirect blocks are modified
> right before the inode is deleted, and then one or more of those
> metadata blocks get reallocated and written right away (most likely
> this will happen via an O_DIRECT write), and then, because we didn't
> use bforget(), the dirty metadata block in the buffer cache would get
> written out, overwriting the O_DIRECT block.
>
> What you're worrying about, is a different issue. You're concerned
> about the fact that since we are not associating an inode's extent
> tree or indirect blocks with the inode, those blocks won't get forced
> out to disk on an fsync() in ext4 no-journal mode. This may not be a
> big deal for applications which expect to recover from an unclean
> using mke2fs (and thus probably don't use fsync in any case), but
> here's a patch to deal with the problem you've raised.
>
> - Ted

But the patch you posted is using bforget which is removing the
buffer_head from the inode->mapping->private_list. What i am
trying to figure out is where does the buffer_head getting added
to the private_list. ?

-aneesh

2009-09-11 17:36:26

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Fri, Sep 11, 2009 at 10:21 AM, Aneesh Kumar K.V
<[email protected]> wrote:
> On Thu, Sep 10, 2009 at 02:58:26PM -0400, Theodore Tso wrote:
>> On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
>> >
>> > But how would it work for fsync ? I mean
>> >
>> > I would expect for no journal mode ext4_sync_file ?should be doing
>> > simple_fsync(). That should be forcing the metadata buffer_heads
>> > via sync_mapping_buffers. And if we reuse these meta buffers we
>> > drop them the inode->mapping->private_list using bforget.
>> >
>> > But I don't see any of the above in code
>>
>> Aneesh, you're addressing a different problem than the one that Curt
>> were trying to deal with this patch. ?The problem we are worry about
>> is one where an inode's extent tree or indirect blocks are modified
>> right before the inode is deleted, and then one or more of those
>> metadata blocks get reallocated and written right away (most likely
>> this will happen via an O_DIRECT write), and then, because we didn't
>> use bforget(), the dirty metadata block in the buffer cache would get
>> written out, overwriting the O_DIRECT block.
>>
>> What you're worrying about, is a different issue. ?You're concerned
>> about the fact that since we are not associating an inode's extent
>> tree or indirect blocks with the inode, those blocks won't get forced
>> out to disk on an fsync() in ext4 no-journal mode. ?This may not be a
>> big deal for applications which expect to recover from an unclean
>> using mke2fs (and thus probably don't use fsync in any case), but
>> here's a patch to deal with the problem you've raised.
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>
> But the patch you posted is using bforget which is removing the
> buffer_head from the inode->mapping->private_list. What i am
> trying to figure out is where does the buffer_head getting added
> to the private_list. ?

I don't think the buffer_head's b_assoc_map is set, because as you
say, mark_buffer_dirty_inode() isn't called from ext4.

All the bforget() call does in this case is clear the BH dirty bit,
which prevents it from being written out during writeback.

Unless I'm missing something too...

Curt

2009-09-11 18:08:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Fri, Sep 11, 2009 at 10:51:25PM +0530, Aneesh Kumar K.V wrote:
>
> But the patch you posted is using bforget which is removing the
> buffer_head from the inode->mapping->private_list. What i am
> trying to figure out is where does the buffer_head getting added
> to the private_list. ?
>

bforget() does three things; 1) it clears the buffer dirty flag, 2) it
drops the buffer from inode->mapping->private_list (if it is on such a
list), and 3) it drops the refcount (when it calls __brelse()). (n.b.
brelse() doesn't actually "release" or "free" a buffer; the name is a
holdover from the days when buffer cache was a stand-alone entity,
instead of being implemented in the page cache.)

The bug that Curt and I was chasing was related to (1) not happening,
and the fix that I proposed to fix it is "ext4: Use bforget() in no
journal mode for ext4_journal_{forget,revoke}()". This bug was caused
by the fact that brelse() never releases the bh which is part of the
problem. Even if bh->b_count is zero, the page in the buffer cache is
still present, and the buffer head will still be marked dirty. So if
you don't want a dirty buffer to be written out to disk (because its
inode has been deleted and the block might be reallocated for some
other purpose), it's not enough to brelse() it; you have to explicitly
call clear_buffer_dirty() or bforget().

Your question about where does the buffer_head is getting added to
private_list is reltaed to (2), and the answer (at least before my
proposed patch, "ext4: Assure that metadata blocks are written during
fsync in no journal mode" is 'nowhere'. So that was an entirely
separate problem, albeit one that I'm not sure Curt cares about for
his workloads. (Why take the performance hit of fsync() in your
applications if your recovery strategy in case of a crash is mke2fs
and copy from one of the other two redundant servers?)

Does that help clarify matters? Basically, there are three separate
bugs related to no journal mode that are being addressed by patches in
the ext4 patch queue:

ext4: Make non-journal fsync work properly
(Found and fixed by Frank; we need to explicitly write out the
inode structure to disk during an fsync since we can't depend
on the journal doing this for us in no-journal mode. So this is
an issue of the inode itself not getting written out by
ext4_write_inode, which is called by pdflush and fsync.
Since the inode table buffer is marked dirty, the inode will
*eventually* be written out, but on a much greater time scale.
This caused the increased fragility of ext4 in no-journal mode
after a power failure.)

ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
(Pointed out by Aneesh, fixed by Ted; this fix makes sure that
fsync will write out an inode's extent tree and/or indirect blocks,
which is kinda important. :-)

ext4: Assure that metadata blocks are written during fsync in no journal mode
(Found by Curt, with an initial fix that worked by forcing
the dirty buffer to be written to disk, and fixed in a better way
by Ted by using bforget. The problem here relates to an inode that
is being deleted, which is why there's no reason to write the
dirty block to disk; when we were about to deallocate the block ---
the better fix is to drop the dirty bit by using bforget.)

Hopefully this quick Cliff Notes(tm) summary of the ext4 no-journal
patches in the ext4 patch queue is helpful.

Regards,

- Ted

2009-09-11 18:15:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Fri, Sep 11, 2009 at 02:08:27PM -0400, Theodore Tso wrote:
> Does that help clarify matters? Basically, there are three separate
> bugs related to no journal mode that are being addressed by patches in
> the ext4 patch queue:

Whoops, I mixed up the patch names and description for two of them.
Let me try again:

ext4: Make non-journal fsync work properly
(Found and fixed by Frank; we need to explicitly write out the
inode structure to disk during an fsync since we can't depend
on the journal doing this for us in no-journal mode. So this is
an issue of the inode itself not getting written out by
ext4_write_inode, which is called by pdflush and fsync.
Since the inode table buffer is marked dirty, the inode will
*eventually* be written out, but on a much greater time scale.
This caused the increased fragility of ext4 in no-journal mode
after a power failure.)

ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
(Found by Curt, with an initial fix that worked by forcing
the dirty buffer to be written to disk, and fixed in a better way
by Ted by using bforget. The problem here relates to an inode that
is being deleted, which is why there's no reason to write the
dirty block to disk; when we were about to deallocate the block ---
the better fix is to drop the dirty bit by using bforget.)

ext4: Assure that metadata blocks are written during fsync in no journal mode
(Pointed out by Aneesh, fixed by Ted; this fix makes sure that
fsync will write out an inode's extent tree and/or indirect blocks,
which is kinda important. :-)

(I swapped the patch names/description for the last two.)

> Hopefully this quick Cliff Notes(tm) summary of the ext4 no-journal
> patches in the ext4 patch queue is helpful.

- Ted

2009-09-12 15:00:47

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Thu, Sep 10, 2009 at 02:58:26PM -0400, Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
> >
> > But how would it work for fsync ? I mean
> >
> > I would expect for no journal mode ext4_sync_file should be doing
> > simple_fsync(). That should be forcing the metadata buffer_heads
> > via sync_mapping_buffers. And if we reuse these meta buffers we
> > drop them the inode->mapping->private_list using bforget.
> >
> > But I don't see any of the above in code
>
> Aneesh, you're addressing a different problem than the one that Curt
> were trying to deal with this patch. The problem we are worry about
> is one where an inode's extent tree or indirect blocks are modified
> right before the inode is deleted, and then one or more of those
> metadata blocks get reallocated and written right away (most likely
> this will happen via an O_DIRECT write), and then, because we didn't
> use bforget(), the dirty metadata block in the buffer cache would get
> written out, overwriting the O_DIRECT block.
>
> What you're worrying about, is a different issue. You're concerned
> about the fact that since we are not associating an inode's extent
> tree or indirect blocks with the inode, those blocks won't get forced
> out to disk on an fsync() in ext4 no-journal mode. This may not be a
> big deal for applications which expect to recover from an unclean
> using mke2fs (and thus probably don't use fsync in any case), but
> here's a patch to deal with the problem you've raised.
>
> - Ted
>
> commit 417cf58253fbf3e36df7b3aca11c120e8367f5e6
> Author: Theodore Ts'o <[email protected]>
> Date: Thu Sep 10 14:58:02 2009 -0400
>
> ext4: Assure that metadata blocks are written during fsync in no journal mode
>
> When there is no journal present, we must attach buffer heads
> associated with extent tree and indirect blocks to the inode's
> mapping->private_list so that fsync() will write out the inode's
> metadata blocks. This is done via mark_buffer_dirty_inode().
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index ecb9ca4..6a94099 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -89,7 +89,10 @@ int __ext4_handle_dirty_metadata(const char *where, handle_t *handle,
> ext4_journal_abort_handle(where, __func__, bh,
> handle, err);
> } else {
> - mark_buffer_dirty(bh);
> + if (inode && bh)
> + mark_buffer_dirty_inode(bh, inode);
> + else
> + mark_buffer_dirty(bh);
> if (inode && inode_needs_sync(inode)) {
> sync_dirty_buffer(bh);
> if (buffer_req(bh) && !buffer_uptodate(bh)) {
>


This does add the meta data buffer_head to the inode->mapping->private_list.
But ext4_sync_file is not writing them. I guess we need to call sync_mapping_buffers
for no-journal mode in ext4_sync_file

-aneesh

2009-09-12 17:59:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Can we talk about bforget() and metadata blocks

On Sat, Sep 12, 2009 at 08:30:36PM +0530, Aneesh Kumar K.V wrote:
>
> This does add the meta data buffer_head to the
> inode->mapping->private_list. But ext4_sync_file is not writing
> them. I guess we need to call sync_mapping_buffers for no-journal
> mode in ext4_sync_file.

Good point, thanks for catching this. Here's a revised patch which
adds the call to ext4_sync_file().

- Ted

commit 6c6e80dc88568f4e49004967f8fd56b69c86e715
Author: Theodore Ts'o <[email protected]>
Date: Sat Sep 12 13:41:55 2009 -0400

ext4: Assure that metadata blocks are written during fsync in no journal mode

When there is no journal present, we must attach buffer heads
associated with extent tree and indirect blocks to the inode's
mapping->private_list via mark_buffer_dirty_inode() so that
ext4_sync_file() --- which is called to service fsync() and
fdatasync() system calls --- can write out the inode's metadata blocks
by calling sync_mapping_buffers().

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index ecb9ca4..6a94099 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -89,7 +89,10 @@ int __ext4_handle_dirty_metadata(const char *where, handle_t *handle,
ext4_journal_abort_handle(where, __func__, bh,
handle, err);
} else {
- mark_buffer_dirty(bh);
+ if (inode && bh)
+ mark_buffer_dirty_inode(bh, inode);
+ else
+ mark_buffer_dirty(bh);
if (inode && inode_needs_sync(inode)) {
sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh)) {
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index ab418c0..0747574 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -50,7 +50,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
- int ret = 0;
+ int err, ret = 0;

J_ASSERT(ext4_journal_current_handle() == NULL);

@@ -79,6 +79,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
goto out;
}

+ if (!journal)
+ ret = sync_mapping_buffers(inode->i_mapping);
+
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
goto out;

@@ -91,7 +94,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
- ret = sync_inode(inode, &wbc);
+ err = sync_inode(inode, &wbc);
+ if (ret == 0)
+ ret = err;
}
out:
if (journal && (journal->j_flags & JBD2_BARRIER))