2010-03-11 15:40:34

by Dmitry Monakhov

[permalink] [raw]
Subject: [PATCH] ext4: fix io-barrier logic for external journal case

We have to submit barrier before we start journal commit process.
otherwise transaction may be committed before data flushed to disk.
There is no difference from performance of view, but definitely
fsync becomes more correct.

If jbd2_log_start_commit return 0 then it means that transaction
was already committed. So we don't have to issue barrier for
ordered mode, because it was already done during commit.

By unknown reason we ignored ret val from jbd2_log_wait_commit()
so even in case of EIO fsync will succeed.

Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/ext4/fsync.c | 28 +++++++++++++---------------
1 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0d0c323..621a8ed 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -88,21 +88,19 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
return ext4_force_commit(inode->i_sb);

commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
- if (jbd2_log_start_commit(journal, commit_tid)) {
- /*
- * When the journal is on a different device than the
- * fs data disk, we need to issue the barrier in
- * writeback mode. (In ordered mode, the jbd2 layer
- * will take care of issuing the barrier. In
- * data=journal, all of the data blocks are written to
- * the journal device.)
- */
- if (ext4_should_writeback_data(inode) &&
- (journal->j_fs_dev != journal->j_dev) &&
- (journal->j_flags & JBD2_BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
- jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
+ /*
+ * When the journal is on a different device than the
+ * fs data disk, we need to issue the barrier in
+ * writeback mode. (In ordered mode, the jbd2 layer
+ * will take care of issuing the barrier. In
+ * data=journal, all of the data blocks are written to
+ * the journal device.)
+ */
+ if (ext4_should_writeback_data(inode) &&
+ (journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ if (jbd2_log_start_commit(journal, commit_tid))
+ ret = jbd2_log_wait_commit(journal, commit_tid);
return ret;
}
--
1.6.6



2010-03-11 16:27:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix io-barrier logic for external journal case

> We have to submit barrier before we start journal commit process.
> otherwise transaction may be committed before data flushed to disk.
> There is no difference from performance of view, but definitely
> fsync becomes more correct.
>
> If jbd2_log_start_commit return 0 then it means that transaction
> was already committed. So we don't have to issue barrier for
> ordered mode, because it was already done during commit.
Umm, we have to - when a file has just been rewritten (i.e. no block
allocation), then i_datasync_tid is not updated and thus we won't commit
any transaction as a part of fdatasync (and that is correct because there
are no metadata that need to be written for that fdatasync). But we still
have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.

> By unknown reason we ignored ret val from jbd2_log_wait_commit()
> so even in case of EIO fsync will succeed.
I just forgot jbd2_log_wait_commit can return a failure...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2010-03-12 17:20:10

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: check missed return value ext4_sync_file

Dmitry Monakhov <[email protected]> writes:

> Jan Kara <[email protected]> writes:
>
>>> We have to submit barrier before we start journal commit process.
>>> otherwise transaction may be committed before data flushed to disk.
>>> There is no difference from performance of view, but definitely
>>> fsync becomes more correct.
> Unfortunately this change does affect performance because latency
> will be increased since we have to wait barrier before we start
> journal commit.
>>>
>>> If jbd2_log_start_commit return 0 then it means that transaction
>>> was already committed. So we don't have to issue barrier for
>>> ordered mode, because it was already done during commit.
>> Umm, we have to - when a file has just been rewritten (i.e. no block
>> allocation), then i_datasync_tid is not updated and thus we won't commit
>> any transaction as a part of fdatasync (and that is correct because there
>> are no metadata that need to be written for that fdatasync). But we still
>> have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
> Yepp. I've missed that. i thought that transaction id updated
> even in that case.
> The most unpleasant part in ext4_sync_file implementation is that
> barrier is issued on each fsync() call. So some bad user may perform:
> while(1) fsync(fd);
> which result in bad system performance. And since barrier request is
> empty it is hard to detect the reason of troubles.
> Off course we may solve it by introducing some sort of dirty flag
> which is set in write_page, and clear in fsync. But it looks as
> ugly workaround.
>>
>>> By unknown reason we ignored ret val from jbd2_log_wait_commit()
>>> so even in case of EIO fsync will succeed.
>> I just forgot jbd2_log_wait_commit can return a failure...
> In respect to previous comments the patch reduced to simple missed
> error check fix.
It is fun but I've found what journalled mode is still broken in ext4
in case of external journal. We forget to issue io-barrier to j_fs_dev
if transaction has only metadata and has no data blocks :)
This affect all data modes.
It is easy to reproduce on classic test-case with data=journall
for(i=0; i < 3; i++) {
memset(buf, 'a'+i);
pwrite(fd, buf, 1024*1024, 0)
fsync(fd);
}
/* At this time transaction was committed so journal is empty */
<<POWER_OFF
Later i've found old data('b' chars) at the end of the file.
So i've prepared another patch which supersede previous one.
> BTW: While investigating similar code in ext3 i've found what
> fsync is broken in case of external journal. JBD itself does not
> send barrier to j_fs_dev. So if fsync goes via
> log_start_commit/log_wait_commit path data loss is still possible.
> I'm able to reproduce this via simple write test
> wile (1) {
> write(fd, buf, 1024*1024)
> fsync(fd);
> }
> and then reboot in the middle of operation.
> Later file content check spotted data inconsistency.
> Will send a fix ASAP.
>
> From 1f7382ea4a8b8e3880e1938d161f924ea572a1e1 Mon Sep 17 00:00:00 2001
> From: Dmitry Monakhov <[email protected]>
> Date: Thu, 11 Mar 2010 20:14:13 +0300
> Subject: [PATCH] ext4: check missed return value ext4_sync_file
>
>
> Signed-off-by: Dmitry Monakhov <[email protected]>
> ---
> fs/ext4/fsync.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 0d0c323..42bd94a 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -101,7 +101,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> (journal->j_fs_dev != journal->j_dev) &&
> (journal->j_flags & JBD2_BARRIER))
> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> - jbd2_log_wait_commit(journal, commit_tid);
> + ret = jbd2_log_wait_commit(journal, commit_tid);
> } else if (journal->j_flags & JBD2_BARRIER)
> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> return ret;

2010-03-17 11:23:03

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: check missed return value ext4_sync_file

> Jan Kara <[email protected]> writes:
> >>
> >> If jbd2_log_start_commit return 0 then it means that transaction
> >> was already committed. So we don't have to issue barrier for
> >> ordered mode, because it was already done during commit.
> > Umm, we have to - when a file has just been rewritten (i.e. no block
> > allocation), then i_datasync_tid is not updated and thus we won't commit
> > any transaction as a part of fdatasync (and that is correct because there
> > are no metadata that need to be written for that fdatasync). But we still
> > have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
> Yepp. I've missed that. i thought that transaction id updated even in that
> case. The most unpleasant part in ext4_sync_file implementation is that
> barrier is issued on each fsync() call. So some bad user may perform:
> while(1) fsync(fd); which result in bad system performance. And since barrier
> request is empty it is hard to detect the reason of troubles.
Actually, you'll be able to see the barrier requests in the blktrace dump
so it won't be that hard to detect.

> Off course we may solve it by introducing some sort of dirty flag which is
> set in write_page, and clear in fsync. But it looks as ugly workaround.
I agree that sending barrier request on each fsync isn't very nice but
in common case, I'd assume that an application calls fsync only if it has
written something to the file previously. So I wouldn't invest much into
solving this until I see a realistic use case where it matters...

> >> By unknown reason we ignored ret val from jbd2_log_wait_commit()
> >> so even in case of EIO fsync will succeed.
> > I just forgot jbd2_log_wait_commit can return a failure...
> In respect to previous comments the patch reduced to simple missed
> error check fix.
I guess you can resend the fix to Ted directly to catch his attention.

> BTW: While investigating similar code in ext3 i've found what fsync is broken
> in case of external journal.
Yes, I've noticed this recently as well. So will you send a fix or should
I go and backport ext4 fixes of this?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2010-03-17 11:24:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: check missed return value ext4_sync_file

> > Jan Kara <[email protected]> writes:
> > >>
> > >> If jbd2_log_start_commit return 0 then it means that transaction
> > >> was already committed. So we don't have to issue barrier for
> > >> ordered mode, because it was already done during commit.
> > > Umm, we have to - when a file has just been rewritten (i.e. no block
> > > allocation), then i_datasync_tid is not updated and thus we won't commit
> > > any transaction as a part of fdatasync (and that is correct because there
> > > are no metadata that need to be written for that fdatasync). But we still
> > > have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
> > Yepp. I've missed that. i thought that transaction id updated even in that
> > case. The most unpleasant part in ext4_sync_file implementation is that
> > barrier is issued on each fsync() call. So some bad user may perform:
> > while(1) fsync(fd); which result in bad system performance. And since barrier
> > request is empty it is hard to detect the reason of troubles.
> Actually, you'll be able to see the barrier requests in the blktrace dump
> so it won't be that hard to detect.
>
> > Off course we may solve it by introducing some sort of dirty flag which is
> > set in write_page, and clear in fsync. But it looks as ugly workaround.
> I agree that sending barrier request on each fsync isn't very nice but
> in common case, I'd assume that an application calls fsync only if it has
> written something to the file previously. So I wouldn't invest much into
> solving this until I see a realistic use case where it matters...
>
> > >> By unknown reason we ignored ret val from jbd2_log_wait_commit()
> > >> so even in case of EIO fsync will succeed.
> > > I just forgot jbd2_log_wait_commit can return a failure...
> > In respect to previous comments the patch reduced to simple missed
> > error check fix.
> I guess you can resend the fix to Ted directly to catch his attention.
>
> > BTW: While investigating similar code in ext3 i've found what fsync is broken
> > in case of external journal.
> Yes, I've noticed this recently as well. So will you send a fix or should
> I go and backport ext4 fixes of this?
Oops, sorry, I've notice you sent the patches to the list already...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2010-03-17 11:38:15

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: check missed return value ext4_sync_file

Jan Kara <[email protected]> writes:

>> Jan Kara <[email protected]> writes:
>> >>
>> >> If jbd2_log_start_commit return 0 then it means that transaction
>> >> was already committed. So we don't have to issue barrier for
>> >> ordered mode, because it was already done during commit.
>> > Umm, we have to - when a file has just been rewritten (i.e. no block
>> > allocation), then i_datasync_tid is not updated and thus we won't commit
>> > any transaction as a part of fdatasync (and that is correct because there
>> > are no metadata that need to be written for that fdatasync). But we still
>> > have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
>> Yepp. I've missed that. i thought that transaction id updated even in that
>> case. The most unpleasant part in ext4_sync_file implementation is that
>> barrier is issued on each fsync() call. So some bad user may perform:
>> while(1) fsync(fd); which result in bad system performance. And since barrier
>> request is empty it is hard to detect the reason of troubles.
> Actually, you'll be able to see the barrier requests in the blktrace dump
> so it won't be that hard to detect.
>
>> Off course we may solve it by introducing some sort of dirty flag which is
>> set in write_page, and clear in fsync. But it looks as ugly workaround.
> I agree that sending barrier request on each fsync isn't very nice but
> in common case, I'd assume that an application calls fsync only if it has
> written something to the file previously. So I wouldn't invest much into
> solving this until I see a realistic use case where it matters...
>
>> >> By unknown reason we ignored ret val from jbd2_log_wait_commit()
>> >> so even in case of EIO fsync will succeed.
>> > I just forgot jbd2_log_wait_commit can return a failure...
>> In respect to previous comments the patch reduced to simple missed
>> error check fix.
> I guess you can resend the fix to Ted directly to catch his attention.
Ohh.. After this letter i've found new issues with metadata, as result
new patch version was sent.
http://marc.info/?l=linux-ext4&m=126841481923132&w=2
>
>> BTW: While investigating similar code in ext3 i've found what fsync is broken
>> in case of external journal.
> Yes, I've noticed this recently as well. So will you send a fix or should
> I go and backport ext4 fixes of this?
I've already done that
http://marc.info/?l=linux-ext4&m=126841482023138&w=2
It already contains fix for metadata handling logic.


>
> Honza

2010-03-22 02:12:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: check missed return value ext4_sync_file

On Fri, Mar 12, 2010 at 11:37:43AM +0300, Dmitry Monakhov wrote:
> The most unpleasant part in ext4_sync_file implementation is that
> barrier is issued on each fsync() call. So some bad user may perform:
> while(1) fsync(fd);
> which result in bad system performance. And since barrier request is
> empty it is hard to detect the reason of troubles.
> Off course we may solve it by introducing some sort of dirty flag
> which is set in write_page, and clear in fsync. But it looks as
> ugly workaround.

We could potentially put the dirty flag in the inode instead, and set
it write_prepare() and writepages() code paths. I'm not entirely sure
it's worth it, though.

> In respect to previous comments the patch reduced to simple missed
> error check fix.

I've added this to the ext4 patch queue, and I will ignore your
earlier version of the patch.

- Ted