2009-09-05 02:55:05

by Frank Mayhar

[permalink] [raw]
Subject: [PATCH] Make non-journal fsync work properly.

Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
mode: If we're not using a journal, ext4_write_inode() now calls
ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
with a new "do_sync" parameter. If that parameter is nonzero
ext4_do_update_inode() calls sync_dirty_buffer() instead of
ext4_handle_dirty_metadata().

This problem was found in power-fail testing, checking the amount of
loss of files and blocks after a power failure when using fsync() and
when not using fsync(). It turned out that using fsync() was actually
worse than not doing so, possibly because it increased the likelihood
that the inodes would remain unflushed and would therefore be lost at
the power failure.

Signed-off-by: Frank Mayhar <[email protected]>

fs/ext4/inode.c | 52 ++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d87f6a0..24dbfb6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4741,7 +4741,8 @@ static int ext4_inode_blocks_set(handle_t *handle,
*/
static int ext4_do_update_inode(handle_t *handle,
struct inode *inode,
- struct ext4_iloc *iloc)
+ struct ext4_iloc *iloc,
+ int do_sync)
{
struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -4843,10 +4844,20 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
}

- BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- rc = ext4_handle_dirty_metadata(handle, inode, bh);
- if (!err)
- err = rc;
+ /*
+ * This is only true if we're not using a journal and we were called
+ * from ext4_write_inode() to sync the inode. We can therefore just
+ * use sync_dirty_buffer() directly to do the dirty work.
+ */
+ if (do_sync) {
+ BUFFER_TRACE(bh, "call sync_dirty_buffer");
+ sync_dirty_buffer(bh);
+ } else {
+ BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+ rc = ext4_handle_dirty_metadata(handle, inode, bh);
+ if (!err)
+ err = rc;
+ }
ei->i_state &= ~EXT4_STATE_NEW;

out_brelse:
@@ -4892,19 +4903,32 @@ out_brelse:
*/
int ext4_write_inode(struct inode *inode, int wait)
{
+ int err;
+
if (current->flags & PF_MEMALLOC)
return 0;

- if (ext4_journal_current_handle()) {
- jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
- dump_stack();
- return -EIO;
- }
+ if (EXT4_SB(inode->i_sb)->s_journal) {
+ if (ext4_journal_current_handle()) {
+ jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
+ dump_stack();
+ return -EIO;
+ }

- if (!wait)
- return 0;
+ if (!wait)
+ return 0;
+
+ err = ext4_force_commit(inode->i_sb);
+ } else {
+ struct ext4_iloc iloc;

- return ext4_force_commit(inode->i_sb);
+ err = ext4_get_inode_loc(inode, &iloc);
+ if (err)
+ return err;
+ err = ext4_do_update_inode(EXT4_NOJOURNAL_HANDLE,
+ inode, &iloc, wait);
+ }
+ return err;
}

/*
@@ -5198,7 +5222,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
get_bh(iloc->bh);

/* ext4_do_update_inode() does jbd2_journal_dirty_metadata */
- err = ext4_do_update_inode(handle, inode, iloc);
+ err = ext4_do_update_inode(handle, inode, iloc, 0);
put_bh(iloc->bh);
return err;
}

--
Frank Mayhar <[email protected]>
Google, Inc.



2009-09-08 05:06:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Fri, Sep 04, 2009 at 07:55:00PM -0700, Frank Mayhar wrote:
> Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
> mode: If we're not using a journal, ext4_write_inode() now calls
> ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
> with a new "do_sync" parameter. If that parameter is nonzero
> ext4_do_update_inode() calls sync_dirty_buffer() instead of
> ext4_handle_dirty_metadata().

Hi Frank,

The problem with this patch is that it's only safe to call
sync_dirty_buffer() if we are not journalling. If we are using the
journal, we must *not* call sync_dirty_buffer(), but instead must use
jbd2_journal_dirty_metadata().

The problem is that there are paths where ext4_do_update_inode() can
get called with do_sync==1, even when journalling is enabled.
Specifically, if ext4_write_inode() is called with wait==1, wait is
passed to ext4_do_update_inode() as do_sync, and then when a journal
is present, we will end up calling sync_dirty_buffer(), which means we
will be writing out the modified metadata *before* the transaction has
committed.

If you try using your patch with journalling enabled, and you try
doing some power fail testing, my code inspection leads me to believe
with 99% certainty that the filesystem will be corrupted as a result.

I think what you need to do instead is to add an extra parameter
do_sync to ext4_handle_dirty_metadata(), and continue to call
ext4_handle_dirty_metadata. However in code paths where we will later
force a commit to guarantee that the metadata has been written out
(i.e., in the fsync() code path), ext4_handle_dirty_metadata() should
be called with the new do_sync parameter set to 1.

Does that make sense?

- Ted

2009-09-08 14:57:05

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

Hi Ted:

On Mon, Sep 7, 2009 at 10:06 PM, Theodore Tso<[email protected]> wrote:
> On Fri, Sep 04, 2009 at 07:55:00PM -0700, Frank Mayhar wrote:
>> Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
>> mode: ?If we're not using a journal, ext4_write_inode() now calls
>> ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
>> with a new "do_sync" parameter. ?If that parameter is nonzero
>> ext4_do_update_inode() calls sync_dirty_buffer() instead of
>> ext4_handle_dirty_metadata().
>
> Hi Frank,
>
> The problem with this patch is that it's only safe to call
> sync_dirty_buffer() if we are not journalling. ?If we are using the
> journal, we must *not* call sync_dirty_buffer(), but instead must use
> jbd2_journal_dirty_metadata().
>
> The problem is that there are paths where ext4_do_update_inode() can
> get called with do_sync==1, even when journalling is enabled.
> Specifically, if ext4_write_inode() is called with wait==1, wait is
> passed to ext4_do_update_inode() as do_sync, and then when a journal
> is present, we will end up calling sync_dirty_buffer(), which means we
> will be writing out the modified metadata *before* the transaction has
> committed.
>
> If you try using your patch with journalling enabled, and you try
> doing some power fail testing, my code inspection leads me to believe
> with 99% certainty that the filesystem will be corrupted as a result.
>
> I think what you need to do instead is to add an extra parameter
> do_sync to ext4_handle_dirty_metadata(), and continue to call
> ext4_handle_dirty_metadata. ?However in code paths where we will later
> force a commit to guarantee that the metadata has been written out
> (i.e., in the fsync() code path), ext4_handle_dirty_metadata() should
> be called with the new do_sync parameter set to 1.
>
> Does that make sense?

I think we can take a look at this, but there are a lot of calls to
ext4_handle_dirty_metadata(), and it's not clear on a quick inspection
that we'd be able to determine which would need to be called with
do_sync = 1...

On the other hand, this would take care of a similar problem that I
was going to be sending a patch for this week: where removing an
extent block without a journal requires a sync_dirty_buffer() in order
to avoid writeback of the extent header in the block, *after* the
block is marked free in the bitmap.

There are probably other cases where, without a journal, an explicit
sync_dirty_buffer() is needed for metadata. Handling this in
ext4_handle_dirty_metadata() may be the best way to solve this.

Thanks,
Curt

>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2009-09-08 15:41:07

by Frank Mayhar

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Tue, 2009-09-08 at 01:06 -0400, Theodore Tso wrote:
> On Fri, Sep 04, 2009 at 07:55:00PM -0700, Frank Mayhar wrote:
> > Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
> > mode: If we're not using a journal, ext4_write_inode() now calls
> > ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
> > with a new "do_sync" parameter. If that parameter is nonzero
> > ext4_do_update_inode() calls sync_dirty_buffer() instead of
> > ext4_handle_dirty_metadata().
>
> Hi Frank,
>
> The problem with this patch is that it's only safe to call
> sync_dirty_buffer() if we are not journalling. If we are using the
> journal, we must *not* call sync_dirty_buffer(), but instead must use
> jbd2_journal_dirty_metadata().
>
> The problem is that there are paths where ext4_do_update_inode() can
> get called with do_sync==1, even when journalling is enabled.
> Specifically, if ext4_write_inode() is called with wait==1, wait is
> passed to ext4_do_update_inode() as do_sync, and then when a journal
> is present, we will end up calling sync_dirty_buffer(), which means we
> will be writing out the modified metadata *before* the transaction has
> committed.

I needed to doublecheck before answering but I think I've covered that
angle. Specifically, in ext4_write_inode the patch only calls
ext4_do_update_inode() if s_journal is NULL, otherwise it takes the
current path.

So I think your concern is covered by the current patch. Can you take
another look and let me know if you agree? Thanks.

> I think what you need to do instead is to add an extra parameter
> do_sync to ext4_handle_dirty_metadata(), and continue to call
> ext4_handle_dirty_metadata. However in code paths where we will later
> force a commit to guarantee that the metadata has been written out
> (i.e., in the fsync() code path), ext4_handle_dirty_metadata() should
> be called with the new do_sync parameter set to 1.
>
> Does that make sense?

Actually, yes it does (my above comment notwithstanding) and I
considered that approach. Unfortunately, as Curt pointed out, there are
a metric buttload of calls to ext4_handle_dirty_metadata(). The "right"
way to fix this might be to change all of them and use
handle_dirty_metadata to deal with this but it seems like an awfully
intrusive change to fix this one problem.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-09-08 21:41:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Tue, Sep 08, 2009 at 07:57:02AM -0700, Curt Wohlgemuth wrote:
>
> I think we can take a look at this, but there are a lot of calls to
> ext4_handle_dirty_metadata(), and it's not clear on a quick inspection
> that we'd be able to determine which would need to be called with
> do_sync = 1...

Well, it's already the case that ext4_handle_dirty_metadata is a
#define:

#define ext4_handle_dirty_metadata(handle, inode, bh) \
__ext4_handle_dirty_metadata(__func__, (handle), (inode), (bh))

So all we need to do is change it to be:

#define ext4_handle_dirty_metadata(handle, inode, bh) \
__ext4_handle_dirty_metadata(__func__, (handle), (inode), (bh), 0)

#define ext4_handle_dirty_metadata_sync(handle, inode, bh) \
__ext4_handle_dirty_metadata(__func__, (handle), (inode), (bh), 1)

And then add the extra function parameter to
__ext4_handle_dirty_metadata(). Hey, presto! :-)

> On the other hand, this would take care of a similar problem that I
> was going to be sending a patch for this week: where removing an
> extent block without a journal requires a sync_dirty_buffer() in order
> to avoid writeback of the extent header in the block, *after* the
> block is marked free in the bitmap.

As I mentioned in the other message, the other thing we can do is to
use bforget(); that is more efficient than using sync_dirty_buffer(),
since it eliminates the write altogether. Since the file has been
deleted, there's no point writing the dirty buffer out; simply using
bforget() to drop the dirty buffer is all that is necessary.

Regards,

- Ted

2009-09-08 22:05:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Tue, Sep 08, 2009 at 08:41:05AM -0700, Frank Mayhar wrote:
> I needed to doublecheck before answering but I think I've covered that
> angle. Specifically, in ext4_write_inode the patch only calls
> ext4_do_update_inode() if s_journal is NULL, otherwise it takes the
> current path.
>
> So I think your concern is covered by the current patch. Can you take
> another look and let me know if you agree? Thanks.

It wasn't obvious from reading the diff, but after I applied the patch
and looked more closely, you're right. I'm still worried though that
the code is a bit fragile. At the very *least* the restriction that
ext4_do_update_inode's do_sync flag should only be called when there
is no journal needs to be explicitly documented. Possibly we should
have a BUG() check to enforce this restriction; although a comment
before ext4_do_update_inode() is probably enough.

- Ted




2009-09-08 22:39:56

by Frank Mayhar

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Tue, 2009-09-08 at 18:05 -0400, Theodore Tso wrote:
> On Tue, Sep 08, 2009 at 08:41:05AM -0700, Frank Mayhar wrote:
> > I needed to doublecheck before answering but I think I've covered that
> > angle. Specifically, in ext4_write_inode the patch only calls
> > ext4_do_update_inode() if s_journal is NULL, otherwise it takes the
> > current path.
> >
> > So I think your concern is covered by the current patch. Can you take
> > another look and let me know if you agree? Thanks.
>
> It wasn't obvious from reading the diff, but after I applied the patch
> and looked more closely, you're right. I'm still worried though that
> the code is a bit fragile. At the very *least* the restriction that
> ext4_do_update_inode's do_sync flag should only be called when there
> is no journal needs to be explicitly documented. Possibly we should
> have a BUG() check to enforce this restriction; although a comment
> before ext4_do_update_inode() is probably enough.

I agree that the code as-is is a bit fragile. Commentary is good but it
would probably be better to enforce the journal/no journal distinction
in ext4_do_update_inode() itself.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-09-09 17:34:27

by Frank Mayhar

[permalink] [raw]
Subject: [PATCH] ext4: Make non-journal fsync work properly. REPOST

Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
mode: If we're not using a journal, ext4_write_inode() now calls
ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
with a new "do_sync" parameter. If that parameter is nonzero _and_ we're
not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
instead of ext4_handle_dirty_metadata().

This problem was found in power-fail testing, checking the amount of
loss of files and blocks after a power failure when using fsync() and
when not using fsync(). It turned out that using fsync() was actually
worse than not doing so, possibly because it increased the likelihood
that the inodes would remain unflushed and would therefore be lost at
the power failure.

Signed-off-by: Frank Mayhar <[email protected]>

fs/ext4/inode.c | 54 ++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d87f6a0..ef2e780 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4741,7 +4741,8 @@ static int ext4_inode_blocks_set(handle_t *handle,
*/
static int ext4_do_update_inode(handle_t *handle,
struct inode *inode,
- struct ext4_iloc *iloc)
+ struct ext4_iloc *iloc,
+ int do_sync)
{
struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -4843,10 +4844,22 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
}

- BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- rc = ext4_handle_dirty_metadata(handle, inode, bh);
- if (!err)
- err = rc;
+ /*
+ * If we're not using a journal and we were called from
+ * ext4_write_inode() to sync the inode (making do_sync true),
+ * we can just use sync_dirty_buffer() directly to do our dirty
+ * work. Testing s_journal here is a bit redundant but it's
+ * worth it to avoid potential future trouble.
+ */
+ if (EXT4_SB(inode->i_sb)->s_journal == NULL && do_sync) {
+ BUFFER_TRACE(bh, "call sync_dirty_buffer");
+ sync_dirty_buffer(bh);
+ } else {
+ BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+ rc = ext4_handle_dirty_metadata(handle, inode, bh);
+ if (!err)
+ err = rc;
+ }
ei->i_state &= ~EXT4_STATE_NEW;

out_brelse:
@@ -4892,19 +4905,32 @@ out_brelse:
*/
int ext4_write_inode(struct inode *inode, int wait)
{
+ int err;
+
if (current->flags & PF_MEMALLOC)
return 0;

- if (ext4_journal_current_handle()) {
- jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
- dump_stack();
- return -EIO;
- }
+ if (EXT4_SB(inode->i_sb)->s_journal) {
+ if (ext4_journal_current_handle()) {
+ jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
+ dump_stack();
+ return -EIO;
+ }

- if (!wait)
- return 0;
+ if (!wait)
+ return 0;
+
+ err = ext4_force_commit(inode->i_sb);
+ } else {
+ struct ext4_iloc iloc;

- return ext4_force_commit(inode->i_sb);
+ err = ext4_get_inode_loc(inode, &iloc);
+ if (err)
+ return err;
+ err = ext4_do_update_inode(EXT4_NOJOURNAL_HANDLE,
+ inode, &iloc, wait);
+ }
+ return err;
}

/*
@@ -5198,7 +5224,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
get_bh(iloc->bh);

/* ext4_do_update_inode() does jbd2_journal_dirty_metadata */
- err = ext4_do_update_inode(handle, inode, iloc);
+ err = ext4_do_update_inode(handle, inode, iloc, 0);
put_bh(iloc->bh);
return err;
}

--
Frank Mayhar <[email protected]>
Google, Inc.


2009-09-10 02:55:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Make non-journal fsync work properly. REPOST

On Wed, Sep 09, 2009 at 10:34:24AM -0700, Frank Mayhar wrote:
> Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
> mode: If we're not using a journal, ext4_write_inode() now calls
> ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
> with a new "do_sync" parameter. If that parameter is nonzero _and_ we're
> not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
> instead of ext4_handle_dirty_metadata().
>
> This problem was found in power-fail testing, checking the amount of
> loss of files and blocks after a power failure when using fsync() and
> when not using fsync(). It turned out that using fsync() was actually
> worse than not doing so, possibly because it increased the likelihood
> that the inodes would remain unflushed and would therefore be lost at
> the power failure.
>
> Signed-off-by: Frank Mayhar <[email protected]>

Added to the ext4 patch queue, thanks.

- Ted

2009-09-10 06:57:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Fri, Sep 04, 2009 at 07:55:00PM -0700, Frank Mayhar wrote:
> Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
> mode: If we're not using a journal, ext4_write_inode() now calls
> ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
> with a new "do_sync" parameter. If that parameter is nonzero
> ext4_do_update_inode() calls sync_dirty_buffer() instead of
> ext4_handle_dirty_metadata().
>
> This problem was found in power-fail testing, checking the amount of
> loss of files and blocks after a power failure when using fsync() and
> when not using fsync(). It turned out that using fsync() was actually
> worse than not doing so, possibly because it increased the likelihood
> that the inodes would remain unflushed and would therefore be lost at
> the power failure.
>

I think this is related to the other thread discussing the extent leak
with non journal mode. I don't find ext4 without journal adding meta
data blocks to the inode's address space mapping private_list. That
would mean sync_mapping_buffers -> fsync_buffers_list won't sync
the related metadata blocks.Tell me what i am missing

-aneesh

2009-09-10 15:33:12

by Frank Mayhar

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Thu, 2009-09-10 at 12:27 +0530, Aneesh Kumar K.V wrote:
> On Fri, Sep 04, 2009 at 07:55:00PM -0700, Frank Mayhar wrote:
> > Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
> > mode: If we're not using a journal, ext4_write_inode() now calls
> > ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
> > with a new "do_sync" parameter. If that parameter is nonzero
> > ext4_do_update_inode() calls sync_dirty_buffer() instead of
> > ext4_handle_dirty_metadata().
> >
> > This problem was found in power-fail testing, checking the amount of
> > loss of files and blocks after a power failure when using fsync() and
> > when not using fsync(). It turned out that using fsync() was actually
> > worse than not doing so, possibly because it increased the likelihood
> > that the inodes would remain unflushed and would therefore be lost at
> > the power failure.
> >
>
> I think this is related to the other thread discussing the extent leak
> with non journal mode. I don't find ext4 without journal adding meta
> data blocks to the inode's address space mapping private_list. That
> would mean sync_mapping_buffers -> fsync_buffers_list won't sync
> the related metadata blocks.Tell me what i am missing

I've been following the other thread as well. I think I'm beginning to
get a handle on just how the buffer_heads and ext4 inodes work but I
still have some learning to do. That having been said, however, it's
clear that this change does make things work much, much better, as seen
by the improvement in our power-fail tests. One way or another, the
inodes are getting flushed. After reading the other thread, I'm
beginning to suspect that it's more as a side effect of the current
tangle rather than because of it. I'll have to look further to
understand just why it's working, though.

In any event, I think this change does the right thing or is at least a
step in the right direction.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-09-10 19:45:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Make non-journal fsync work properly.

On Thu, Sep 10, 2009 at 08:33:06AM -0700, Frank Mayhar wrote:
>
> I've been following the other thread as well. I think I'm beginning to
> get a handle on just how the buffer_heads and ext4 inodes work but I
> still have some learning to do. That having been said, however, it's
> clear that this change does make things work much, much better, as seen
> by the improvement in our power-fail tests. One way or another, the
> inodes are getting flushed. After reading the other thread, I'm
> beginning to suspect that it's more as a side effect of the current
> tangle rather than because of it. I'll have to look further to
> understand just why it's working, though.
>
> In any event, I think this change does the right thing or is at least a
> step in the right direction.

Could you also test and give feedback for my patch, "ext4: Assure that
metadata blocks are written during fsync in no journal mode"? I'm not
sure how much your workloads care about fsync() working correctly, but
it's something we should get right for no-journal mode.

Thanks,

- Ted

2009-09-14 16:54:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext4: Make non-journal fsync work properly. REPOST

On Wed, Sep 09, 2009 at 10:34:24AM -0700, Frank Mayhar wrote:
> Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
> mode: If we're not using a journal, ext4_write_inode() now calls
> ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
> with a new "do_sync" parameter. If that parameter is nonzero _and_ we're
> not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
> instead of ext4_handle_dirty_metadata().
>
> This problem was found in power-fail testing, checking the amount of
> loss of files and blocks after a power failure when using fsync() and
> when not using fsync(). It turned out that using fsync() was actually
> worse than not doing so, possibly because it increased the likelihood
> that the inodes would remain unflushed and would therefore be lost at
> the power failure.
>
> Signed-off-by: Frank Mayhar <[email protected]>
>
> fs/ext4/inode.c | 54 ++++++++++++++++++++++++++++++++++++++++--------------
> 1 files changed, 40 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d87f6a0..ef2e780 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4741,7 +4741,8 @@ static int ext4_inode_blocks_set(handle_t *handle,
> */
> static int ext4_do_update_inode(handle_t *handle,
> struct inode *inode,
> - struct ext4_iloc *iloc)
> + struct ext4_iloc *iloc,
> + int do_sync)
> {
> struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
> struct ext4_inode_info *ei = EXT4_I(inode);
> @@ -4843,10 +4844,22 @@ static int ext4_do_update_inode(handle_t *handle,
> raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
> }
>
> - BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> - rc = ext4_handle_dirty_metadata(handle, inode, bh);
> - if (!err)
> - err = rc;
> + /*
> + * If we're not using a journal and we were called from
> + * ext4_write_inode() to sync the inode (making do_sync true),
> + * we can just use sync_dirty_buffer() directly to do our dirty
> + * work. Testing s_journal here is a bit redundant but it's
> + * worth it to avoid potential future trouble.
> + */
> + if (EXT4_SB(inode->i_sb)->s_journal == NULL && do_sync) {
> + BUFFER_TRACE(bh, "call sync_dirty_buffer");
> + sync_dirty_buffer(bh);
> + } else {
> + BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> + rc = ext4_handle_dirty_metadata(handle, inode, bh);
> + if (!err)
> + err = rc;
> + }
> ei->i_state &= ~EXT4_STATE_NEW;
>
> out_brelse:
> @@ -4892,19 +4905,32 @@ out_brelse:
> */
> int ext4_write_inode(struct inode *inode, int wait)
> {
> + int err;
> +
> if (current->flags & PF_MEMALLOC)
> return 0;
>
> - if (ext4_journal_current_handle()) {
> - jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
> - dump_stack();
> - return -EIO;
> - }
> + if (EXT4_SB(inode->i_sb)->s_journal) {
> + if (ext4_journal_current_handle()) {
> + jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
> + dump_stack();
> + return -EIO;
> + }
>
> - if (!wait)
> - return 0;
> + if (!wait)
> + return 0;
> +
> + err = ext4_force_commit(inode->i_sb);
> + } else {
> + struct ext4_iloc iloc;
>
> - return ext4_force_commit(inode->i_sb);
> + err = ext4_get_inode_loc(inode, &iloc);
> + if (err)
> + return err;
> + err = ext4_do_update_inode(EXT4_NOJOURNAL_HANDLE,
> + inode, &iloc, wait);
> + }
> + return err;
> }


Why not just do

err = ext4_get_inode_loc(inode, &iloc);
if (err)
return err;
if (wait)
sync_dirty_buffer(iloc.bh);


because when we updated inode we would have called ext4_mark_inode_dirty which
internally does ext4_mark_iloc_dirty -> ext4_do_update_inode

May be we also want to check the error from sync_dirty_buffer

sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh)) {
ext4_error(inode->i_sb, __func__,
"IO error syncing inode, "
"inode=%lu, block=%llu",
inode->i_ino,
(unsigned long long) bh->b_blocknr);
err = -EIO;
}


>
> /*
> @@ -5198,7 +5224,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> get_bh(iloc->bh);
>
> /* ext4_do_update_inode() does jbd2_journal_dirty_metadata */
> - err = ext4_do_update_inode(handle, inode, iloc);
> + err = ext4_do_update_inode(handle, inode, iloc, 0);
> put_bh(iloc->bh);
> return err;
> }
>

-aneesh

2009-09-14 17:43:30

by Frank Mayhar

[permalink] [raw]
Subject: Re: [PATCH] ext4: Make non-journal fsync work properly. REPOST

On Mon, 2009-09-14 at 22:24 +0530, Aneesh Kumar K.V wrote:
> On Wed, Sep 09, 2009 at 10:34:24AM -0700, Frank Mayhar wrote:
> > Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
> > mode: If we're not using a journal, ext4_write_inode() now calls
> > ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
> > with a new "do_sync" parameter. If that parameter is nonzero _and_ we're
> > not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
> > instead of ext4_handle_dirty_metadata().
> >
> > This problem was found in power-fail testing, checking the amount of
> > loss of files and blocks after a power failure when using fsync() and
> > when not using fsync(). It turned out that using fsync() was actually
> > worse than not doing so, possibly because it increased the likelihood
> > that the inodes would remain unflushed and would therefore be lost at
> > the power failure.
> >
> > Signed-off-by: Frank Mayhar <[email protected]>
> >
> > fs/ext4/inode.c | 54 ++++++++++++++++++++++++++++++++++++++++--------------
> > 1 files changed, 40 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index d87f6a0..ef2e780 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4741,7 +4741,8 @@ static int ext4_inode_blocks_set(handle_t *handle,
> > */
> > static int ext4_do_update_inode(handle_t *handle,
> > struct inode *inode,
> > - struct ext4_iloc *iloc)
> > + struct ext4_iloc *iloc,
> > + int do_sync)
> > {
> > struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
> > struct ext4_inode_info *ei = EXT4_I(inode);
> > @@ -4843,10 +4844,22 @@ static int ext4_do_update_inode(handle_t *handle,
> > raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
> > }
> >
> > - BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> > - rc = ext4_handle_dirty_metadata(handle, inode, bh);
> > - if (!err)
> > - err = rc;
> > + /*
> > + * If we're not using a journal and we were called from
> > + * ext4_write_inode() to sync the inode (making do_sync true),
> > + * we can just use sync_dirty_buffer() directly to do our dirty
> > + * work. Testing s_journal here is a bit redundant but it's
> > + * worth it to avoid potential future trouble.
> > + */
> > + if (EXT4_SB(inode->i_sb)->s_journal == NULL && do_sync) {
> > + BUFFER_TRACE(bh, "call sync_dirty_buffer");
> > + sync_dirty_buffer(bh);
> > + } else {
> > + BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> > + rc = ext4_handle_dirty_metadata(handle, inode, bh);
> > + if (!err)
> > + err = rc;
> > + }
> > ei->i_state &= ~EXT4_STATE_NEW;
> >
> > out_brelse:
> > @@ -4892,19 +4905,32 @@ out_brelse:
> > */
> > int ext4_write_inode(struct inode *inode, int wait)
> > {
> > + int err;
> > +
> > if (current->flags & PF_MEMALLOC)
> > return 0;
> >
> > - if (ext4_journal_current_handle()) {
> > - jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
> > - dump_stack();
> > - return -EIO;
> > - }
> > + if (EXT4_SB(inode->i_sb)->s_journal) {
> > + if (ext4_journal_current_handle()) {
> > + jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
> > + dump_stack();
> > + return -EIO;
> > + }
> >
> > - if (!wait)
> > - return 0;
> > + if (!wait)
> > + return 0;
> > +
> > + err = ext4_force_commit(inode->i_sb);
> > + } else {
> > + struct ext4_iloc iloc;
> >
> > - return ext4_force_commit(inode->i_sb);
> > + err = ext4_get_inode_loc(inode, &iloc);
> > + if (err)
> > + return err;
> > + err = ext4_do_update_inode(EXT4_NOJOURNAL_HANDLE,
> > + inode, &iloc, wait);
> > + }
> > + return err;
> > }
>
>
> Why not just do
>
> err = ext4_get_inode_loc(inode, &iloc);
> if (err)
> return err;
> if (wait)
> sync_dirty_buffer(iloc.bh);
>
>
> because when we updated inode we would have called ext4_mark_inode_dirty which
> internally does ext4_mark_iloc_dirty -> ext4_do_update_inode

Hmm. Yeah, you're right. I was thinking that the inode could be
dirtied without calling do_update_inode() but that's apparently not the
case.

Another version of the patch will be forthcoming shortly.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-09-26 00:39:08

by Frank Mayhar

[permalink] [raw]
Subject: [PATCH] ext4: Make non-journal fsync work properly. (Version 3)

This patch follows Aneesh' suggestion of just calling
sync_dirty_buffer() directly. Sorry about the delay, I've been busy

Teach ext4_write_inode() about non-journal mode in a better way,
suggested upstream. Instead of using ext4_do_update_inode(), it
now calls sync_dirty_buffer() directly. Also handle possible
error return from that function.

Signed-off-by: Frank Mayhar <[email protected]>

fs/ext4/inode.c | 37 +++++++++++++++++++++++++++++--------
1 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d87f6a0..ab31cc4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4892,19 +4892,40 @@ out_brelse:
*/
int ext4_write_inode(struct inode *inode, int wait)
{
+ int err;
+
if (current->flags & PF_MEMALLOC)
return 0;

- if (ext4_journal_current_handle()) {
- jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
- dump_stack();
- return -EIO;
- }
+ if (EXT4_SB(inode->i_sb)->s_journal) {
+ if (ext4_journal_current_handle()) {
+ jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
+ dump_stack();
+ return -EIO;
+ }

- if (!wait)
- return 0;
+ if (!wait)
+ return 0;
+
+ err = ext4_force_commit(inode->i_sb);
+ } else {
+ struct ext4_iloc iloc;

- return ext4_force_commit(inode->i_sb);
+ err = ext4_get_inode_loc(inode, &iloc);
+ if (err)
+ return err;
+ if (wait)
+ sync_dirty_buffer(iloc.bh);
+ if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
+ ext4_error(inode->i_sb, __func__,
+ "IO error syncing inode, "
+ "inode=%lu, block=%llu",
+ inode->i_ino,
+ (unsigned long long)iloc.bh->b_blocknr);
+ err = -EIO;
+ }
+ }
+ return err;
}

/*

--
Frank Mayhar <[email protected]>
Google, Inc.


2009-09-29 14:09:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Make non-journal fsync work properly. (Version 3)

On Fri, Sep 25, 2009 at 05:39:05PM -0700, Frank Mayhar wrote:
> This patch follows Aneesh' suggestion of just calling
> sync_dirty_buffer() directly. Sorry about the delay, I've been busy
>
> Teach ext4_write_inode() about non-journal mode in a better way,
> suggested upstream. Instead of using ext4_do_update_inode(), it
> now calls sync_dirty_buffer() directly. Also handle possible
> error return from that function.
>
> Signed-off-by: Frank Mayhar <[email protected]>

Since the V2 version of your patch has already been merged, what I've
added to the ext4 patch queue was the delta patch between V2 and V3
version of the patch.

- Ted

commit 49130f6fbdaf8d9f3a39e316ed285ca2ff520122
Author: Frank Mayhar <[email protected]>
Date: Tue Sep 29 10:07:47 2009 -0400

ext4: Avoid updating the inode table bh twice in no journal mode

This is a cleanup of commit 91ac6f4. Since ext4_mark_inode_dirty()
has already called ext4_mark_iloc_dirty(), which in turn calls
ext4_do_update_inode(), it's not necessary to have ext4_write_inode()
call ext4_do_update_inode() in no journal mode. Indeed, it would be
duplicated work.

Reviewed-by: "Aneesh Kumar K.V" <[email protected]>
Signed-off-by: Frank Mayhar <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bfc867a..7afb697 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4986,8 +4986,7 @@ static int ext4_inode_blocks_set(handle_t *handle,
*/
static int ext4_do_update_inode(handle_t *handle,
struct inode *inode,
- struct ext4_iloc *iloc,
- int do_sync)
+ struct ext4_iloc *iloc)
{
struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -5088,22 +5087,10 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
}

- /*
- * If we're not using a journal and we were called from
- * ext4_write_inode() to sync the inode (making do_sync true),
- * we can just use sync_dirty_buffer() directly to do our dirty
- * work. Testing s_journal here is a bit redundant but it's
- * worth it to avoid potential future trouble.
- */
- if (EXT4_SB(inode->i_sb)->s_journal == NULL && do_sync) {
- BUFFER_TRACE(bh, "call sync_dirty_buffer");
- sync_dirty_buffer(bh);
- } else {
- BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- rc = ext4_handle_dirty_metadata(handle, inode, bh);
- if (!err)
- err = rc;
- }
+ BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+ rc = ext4_handle_dirty_metadata(handle, inode, bh);
+ if (!err)
+ err = rc;
ei->i_state &= ~EXT4_STATE_NEW;

out_brelse:
@@ -5171,8 +5158,16 @@ int ext4_write_inode(struct inode *inode, int wait)
err = ext4_get_inode_loc(inode, &iloc);
if (err)
return err;
- err = ext4_do_update_inode(EXT4_NOJOURNAL_HANDLE,
- inode, &iloc, wait);
+ if (wait)
+ sync_dirty_buffer(iloc.bh);
+ if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
+ ext4_error(inode->i_sb, __func__,
+ "IO error syncing inode, "
+ "inode=%lu, block=%llu",
+ inode->i_ino,
+ (unsigned long long)iloc.bh->b_blocknr);
+ err = -EIO;
+ }
}
return err;
}
@@ -5468,7 +5463,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
get_bh(iloc->bh);

/* ext4_do_update_inode() does jbd2_journal_dirty_metadata */
- err = ext4_do_update_inode(handle, inode, iloc, 0);
+ err = ext4_do_update_inode(handle, inode, iloc);
put_bh(iloc->bh);
return err;
}