2009-01-13 13:14:28

by Pavel Machek

[permalink] [raw]
Subject: ext2 + -osync: not as easy as it seems

Hi!

I tried to use ext2+ -osync, and was surprised that my data do not hit
the (SATA) disk. Fortunately I soon realized that I need hdparm -W0
/dev/sda.

Could the ext2 be teached to use barriers? -W0 hurts performance more
than neccessary...

Plus it has a nasty behaviour where it reverts to -W1 if disk
connection is momentarily lost. (If you unplug/replug SATA disk, linux
will happily rediscover and use it, but -W0 was already forgotten at
that point, right?)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


2009-01-13 13:45:28

by Alan

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

> Plus it has a nasty behaviour where it reverts to -W1 if disk
> connection is momentarily lost. (If you unplug/replug SATA disk, linux
> will happily rediscover and use it, but -W0 was already forgotten at
> that point, right?)

If you momentarily lose power your disk state isn't defined anyway. If
you do that with the SATA code it should treat it the same as a scsi disk
swap so it'll get a new device and the old fs will go down.

2009-01-13 14:04:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

Adding a barrier shouldn't be that hard; just a matter adding a call
to blkdev_issue_flush() to ext2_sync_file() before it returns.

BTW, I think there's a stale documentation bug in in
block/blk-barrier.c, around line 305:

* Description:
* Issue a flush for the block device in question. Caller can supply
* room for storing the error offset in case of a flush error, if they
- * wish to. Caller must run wait_for_completion() on its own.
+ * wish to.
*/
int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
{

Jens, is that right?

- Ted

2009-01-13 14:08:41

by Jens Axboe

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Tue, Jan 13 2009, Theodore Tso wrote:
> Adding a barrier shouldn't be that hard; just a matter adding a call
> to blkdev_issue_flush() to ext2_sync_file() before it returns.
>
> BTW, I think there's a stale documentation bug in in
> block/blk-barrier.c, around line 305:
>
> * Description:
> * Issue a flush for the block device in question. Caller can supply
> * room for storing the error offset in case of a flush error, if they
> - * wish to. Caller must run wait_for_completion() on its own.
> + * wish to.
> */
> int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
> {
>
> Jens, is that right?

Yep, that is indeed stale!

--
Jens Axboe

2009-01-13 14:26:37

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] block: Fix documentation for blkdev_issue_flush()

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
block/blk-barrier.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index 8eba4e4..f7dae57 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -302,7 +302,7 @@ static void bio_end_empty_barrier(struct bio *bio, int err)
* Description:
* Issue a flush for the block device in question. Caller can supply
* room for storing the error offset in case of a flush error, if they
- * wish to. Caller must run wait_for_completion() on its own.
+ * wish to.
*/
int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
{
--
1.6.0.4.8.g36f27.dirty

2009-01-13 14:30:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] block: Fix documentation for blkdev_issue_flush()

On Tue, Jan 13 2009, Theodore Ts'o wrote:
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> block/blk-barrier.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/block/blk-barrier.c b/block/blk-barrier.c
> index 8eba4e4..f7dae57 100644
> --- a/block/blk-barrier.c
> +++ b/block/blk-barrier.c
> @@ -302,7 +302,7 @@ static void bio_end_empty_barrier(struct bio *bio, int err)
> * Description:
> * Issue a flush for the block device in question. Caller can supply
> * room for storing the error offset in case of a flush error, if they
> - * wish to. Caller must run wait_for_completion() on its own.
> + * wish to.
> */
> int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
> {

Thanks Ted, applied.

--
Jens Axboe

2009-01-13 14:30:45

by Jan Kara

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Tue 13-01-09 09:03:47, Theodore Tso wrote:
> Adding a barrier shouldn't be that hard; just a matter adding a call
> to blkdev_issue_flush() to ext2_sync_file() before it returns.
Yes. Something like the patch below?

But it's not the whole story. Strictly speaking we should also call
blkdev_issue_flush() whenever we write things because of O_SYNC or
O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
patch I've sent a while before).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

---
>From 0f5a1865a9f1538886699c5d13d3527343c88c11 Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Tue, 13 Jan 2009 15:09:18 +0100
Subject: [PATCH] ext2: Add blk_issue_flush() to syncing paths

To be really safe that the data hit the platter, we should also flush drive's
writeback caches on fsync and for O_SYNC files or O_DIRSYNC inodes.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/dir.c | 5 ++++-
fs/ext2/fsync.c | 7 +++++--
fs/ext2/inode.c | 7 +++++++
fs/ext2/xattr.c | 6 +++++-
4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7fba549..9ab9347 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -25,6 +25,7 @@
#include <linux/buffer_head.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
+#include <linux/blkdev.h> /* for blkdev_issue_flush() */

typedef struct ext2_dir_entry_2 ext2_dirent;

@@ -96,8 +97,10 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
}
unlock_page(page);

- if (IS_DIRSYNC(dir))
+ if (IS_DIRSYNC(dir)) {
err = ext2_sync_inode(dir);
+ blkdev_issue_flush(dir->i_sb->s_bdev, NULL);
+ }

return err;
}
diff --git a/fs/ext2/fsync.c b/fs/ext2/fsync.c
index fc66c93..9cd1838 100644
--- a/fs/ext2/fsync.c
+++ b/fs/ext2/fsync.c
@@ -24,6 +24,7 @@

#include "ext2.h"
#include <linux/buffer_head.h> /* for sync_mapping_buffers() */
+#include <linux/blkdev.h> /* for blkdev_issue_flush() */


/*
@@ -39,12 +40,14 @@ int ext2_sync_file(struct file *file, struct dentry *dentry, int datasync)

ret = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
- return ret;
+ goto out;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- return ret;
+ goto out;

err = ext2_sync_inode(inode);
if (ret == 0)
ret = err;
+out:
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
return ret;
}
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 23fff2f..49b479e 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -33,6 +33,7 @@
#include <linux/mpage.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/blkdev.h> /* for blkdev_issue_flush() */
#include "ext2.h"
#include "acl.h"
#include "xip.h"
@@ -68,9 +69,14 @@ void ext2_delete_inode (struct inode * inode)
mark_inode_dirty(inode);
ext2_update_inode(inode, inode_needs_sync(inode));

+ /* Make sure inode deletion really gets to disk. Disk write caches
+ * are flushed either in ext2_truncate() or we do it explicitly */
inode->i_size = 0;
if (inode->i_blocks)
ext2_truncate (inode);
+ else if (inode_needs_sync(inode))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+
ext2_free_inode (inode);

return;
@@ -1104,6 +1110,7 @@ do_indirects:
if (inode_needs_sync(inode)) {
sync_mapping_buffers(inode->i_mapping);
ext2_sync_inode (inode);
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
} else {
mark_inode_dirty(inode);
}
diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 987a526..d480216 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -60,6 +60,7 @@
#include <linux/mbcache.h>
#include <linux/quotaops.h>
#include <linux/rwsem.h>
+#include <linux/blkdev.h> /* for blkdev_issue_flush() */
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
@@ -702,6 +703,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
DQUOT_FREE_BLOCK(inode, 1);
goto cleanup;
}
+ blkdev_issue_flush(sb->s_bdev, NULL);
} else
mark_inode_dirty(inode);

@@ -792,8 +794,10 @@ ext2_xattr_delete_inode(struct inode *inode)
le32_to_cpu(HDR(bh)->h_refcount));
unlock_buffer(bh);
mark_buffer_dirty(bh);
- if (IS_SYNC(inode))
+ if (IS_SYNC(inode)) {
sync_dirty_buffer(bh);
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ }
DQUOT_FREE_BLOCK(inode, 1);
}
EXT2_I(inode)->i_file_acl = 0;
--
1.6.0.2

2009-01-13 14:40:40

by Pavel Machek

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Tue 2009-01-13 13:45:03, Alan Cox wrote:
> > Plus it has a nasty behaviour where it reverts to -W1 if disk
> > connection is momentarily lost. (If you unplug/replug SATA disk, linux
> > will happily rediscover and use it, but -W0 was already forgotten at
> > that point, right?)
>
> If you momentarily lose power your disk state isn't defined anyway. If
> you do that with the SATA code it should treat it the same as a scsi disk
> swap so it'll get a new device and the old fs will go down.

No, that is not what happened :-(. If I replug disk within 10 seconds,
it just behaves as if nothing happened, and continues operating on the
disk. AMD machine... but I don't think that matters.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-01-13 14:46:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Tue, Jan 13, 2009 at 03:30:12PM +0100, Jan Kara wrote:
>
> But it's not the whole story. Strictly speaking we should also call
> blkdev_issue_flush() whenever we write things because of O_SYNC or
> O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
> patch I've sent a while before).

Looks good to me.

Acked-by: "Theodore Ts'o" <[email protected]>

- Ted

Subject: Re: ext2 + -osync: not as easy as it seems

(CCing Eric Sandeen)

On Tue, 2009-01-13 at 15:30 +0100, Jan Kara wrote:
> On Tue 13-01-09 09:03:47, Theodore Tso wrote:
> > Adding a barrier shouldn't be that hard; just a matter adding a call
> > to blkdev_issue_flush() to ext2_sync_file() before it returns.
> Yes. Something like the patch below?
>
> But it's not the whole story. Strictly speaking we should also call
> blkdev_issue_flush() whenever we write things because of O_SYNC or
> O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
> patch I've sent a while before).

Commit d755fb384250d6bd7fd18a0930e71965acc8e72e added a call to
blkdev_issue_flush to ext4_sync_file, and looking at its ext3
counterpart it seems it might be needed there too.

I may be missing something, but is it possible to ensure the inode hits
the platter without the patch below?

--

From: Fernando Luis Vazquez Cao <[email protected]>
Subject: ext3: call blkdev_issue_flush on fsync

To ensure that bits are truly on-disk after an fsync, we should call
blkdev_issue_flush if barriers are supported.

This is a straight port of a similar patch written by Eric Sandeen for
ext4 (commit d755fb384250d6bd7fd18a0930e71965acc8e72e).

Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
---

diff -urNp linux-2.6.29-rc1-orig/fs/ext3/fsync.c linux-2.6.29-rc1/fs/ext3/fsync.c
--- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-14 11:45:47.000000000 +0900
@@ -27,6 +27,7 @@
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/jbd.h>
+#include <linux/blkdev.h>
#include <linux/ext3_fs.h>
#include <linux/ext3_jbd.h>

@@ -45,6 +46,7 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
int ret = 0;

J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -85,6 +87,9 @@ int ext3_sync_file(struct file * file, s
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+
+ if (journal && (journal->j_flags & JFS_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
}
out:
return ret;

2009-01-14 10:35:46

by Jan Kara

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed 14-01-09 12:37:19, Fernando Luis V?zquez Cao wrote:
> (CCing Eric Sandeen)
>
> On Tue, 2009-01-13 at 15:30 +0100, Jan Kara wrote:
> > On Tue 13-01-09 09:03:47, Theodore Tso wrote:
> > > Adding a barrier shouldn't be that hard; just a matter adding a call
> > > to blkdev_issue_flush() to ext2_sync_file() before it returns.
> > Yes. Something like the patch below?
> >
> > But it's not the whole story. Strictly speaking we should also call
> > blkdev_issue_flush() whenever we write things because of O_SYNC or
> > O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
> > patch I've sent a while before).
>
> Commit d755fb384250d6bd7fd18a0930e71965acc8e72e added a call to
> blkdev_issue_flush to ext4_sync_file, and looking at its ext3
> counterpart it seems it might be needed there too.
>
> I may be missing something, but is it possible to ensure the inode hits
> the platter without the patch below?
Yes, I noticed that yesterday as well. But then I was puzzled why ext4
would need the flush where it has it... sync_inode() has started and
committed a transaction which issued a barrier when the commit was done.
The only reason I could imagine is that barrier (although it is usually
translated to flushing writeback caches) actually means just an ordering
requirement and hence does not necessarily mean that the caches are
properly flushed. Is that so Eric?
BTW: We should also issue the flush in the fdatasync() case, shouldn't
we?

Honza

> From: Fernando Luis Vazquez Cao <[email protected]>
> Subject: ext3: call blkdev_issue_flush on fsync
>
> To ensure that bits are truly on-disk after an fsync, we should call
> blkdev_issue_flush if barriers are supported.
>
> This is a straight port of a similar patch written by Eric Sandeen for
> ext4 (commit d755fb384250d6bd7fd18a0930e71965acc8e72e).
>
> Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> ---
>
> diff -urNp linux-2.6.29-rc1-orig/fs/ext3/fsync.c linux-2.6.29-rc1/fs/ext3/fsync.c
> --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-14 11:45:47.000000000 +0900
> @@ -27,6 +27,7 @@
> #include <linux/sched.h>
> #include <linux/writeback.h>
> #include <linux/jbd.h>
> +#include <linux/blkdev.h>
> #include <linux/ext3_fs.h>
> #include <linux/ext3_jbd.h>
>
> @@ -45,6 +46,7 @@
> int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> {
> struct inode *inode = dentry->d_inode;
> + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> int ret = 0;
>
> J_ASSERT(ext3_journal_current_handle() == NULL);
> @@ -85,6 +87,9 @@ int ext3_sync_file(struct file * file, s
> .nr_to_write = 0, /* sys_fsync did this */
> };
> ret = sync_inode(inode, &wbc);
> +
> + if (journal && (journal->j_flags & JFS_BARRIER))
> + blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> }
> out:
> return ret;
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-01-14 13:24:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed, Jan 14, 2009 at 11:35:33AM +0100, Jan Kara wrote:
> Yes, I noticed that yesterday as well. But then I was puzzled why ext4
> would need the flush where it has it... sync_inode() has started and
> committed a transaction which issued a barrier when the commit was done.

You're right; I'm not convinced we need the flush in ext4 (or ext3) at
all. We write the data blocks, *then* we call ext4_write_inode(),
which will force a commit. Now, if we apply that patch which
optimizes out commits if there are no dirty blocks, then we'll be
trouble, because we won't know for sure whether or not
ext4_write_inode() will have forced a journal commit.

If we optimize out the journal commit when there are no blocks
attached to the transaction, we could change the patch to only force a
flush if inode->i_state did not have I_DIRTY before the call to
sync_inode(). Does that sound sane?

> The only reason I could imagine is that barrier (although it is usually
> translated to flushing writeback caches) actually means just an ordering
> requirement and hence does not necessarily mean that the caches are
> properly flushed. Is that so Eric?

I'm not sure what you mean; if the barrier operation isn't flushing
all of the caches all the way out to the iron oxide, it's not going to
be working properly no matter where it is being called, whether it's
in ext4_sync_file() or in jbd2's journal_submit_commit_record().

- Ted

2009-01-14 14:05:43

by Jan Kara

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed 14-01-09 08:21:46, Theodore Tso wrote:
> On Wed, Jan 14, 2009 at 11:35:33AM +0100, Jan Kara wrote:
> > Yes, I noticed that yesterday as well. But then I was puzzled why ext4
> > would need the flush where it has it... sync_inode() has started and
> > committed a transaction which issued a barrier when the commit was done.
>
> You're right; I'm not convinced we need the flush in ext4 (or ext3) at
> all. We write the data blocks, *then* we call ext4_write_inode(),
> which will force a commit. Now, if we apply that patch which
> optimizes out commits if there are no dirty blocks, then we'll be
> trouble, because we won't know for sure whether or not
> ext4_write_inode() will have forced a journal commit.
>
> If we optimize out the journal commit when there are no blocks
> attached to the transaction, we could change the patch to only force a
> flush if inode->i_state did not have I_DIRTY before the call to
> sync_inode(). Does that sound sane?
Yes. And also add a flush in case of fdatasync().

> > The only reason I could imagine is that barrier (although it is usually
> > translated to flushing writeback caches) actually means just an ordering
> > requirement and hence does not necessarily mean that the caches are
> > properly flushed. Is that so Eric?
>
> I'm not sure what you mean; if the barrier operation isn't flushing
> all of the caches all the way out to the iron oxide, it's not going to
> be working properly no matter where it is being called, whether it's
> in ext4_sync_file() or in jbd2's journal_submit_commit_record().
Well, I thought that a barrier, as an abstraction, only guarantees that
any IO which happened before the barrier hits the iron before any IO which
has been submitted after a barrier. This is actually enough for a
journalling to work correctly but it's not enough for fsync() guarantees.
But I might be wrong...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-01-14 14:09:33

by Jens Axboe

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed, Jan 14 2009, Jan Kara wrote:
> > I'm not sure what you mean; if the barrier operation isn't flushing
> > all of the caches all the way out to the iron oxide, it's not going to
> > be working properly no matter where it is being called, whether it's
> > in ext4_sync_file() or in jbd2's journal_submit_commit_record().
> Well, I thought that a barrier, as an abstraction, only guarantees that
> any IO which happened before the barrier hits the iron before any IO which
> has been submitted after a barrier. This is actually enough for a
> journalling to work correctly but it's not enough for fsync() guarantees.
> But I might be wrong...

It also guarentees that when you get a completion for that barrier
write, it's on safe storage. Think of it as a flush-write-flush
operation, in the presence of write back caching.

--
Jens Axboe

2009-01-14 14:13:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed, Jan 14, 2009 at 03:05:32PM +0100, Jan Kara wrote:
> On Wed 14-01-09 08:21:46, Theodore Tso wrote:
> >
> > If we optimize out the journal commit when there are no blocks
> > attached to the transaction, we could change the patch to only force a
> > flush if inode->i_state did not have I_DIRTY before the call to
> > sync_inode(). Does that sound sane?
> Yes. And also add a flush in case of fdatasync().

Um, we have that already; the sync_inode() followed by
blkdev_issue_flush() is the path taken by fdatasync(), I do believe.

> Well, I thought that a barrier, as an abstraction, only guarantees that
> any IO which happened before the barrier hits the iron before any IO which
> has been submitted after a barrier. This is actually enough for a
> journalling to work correctly but it's not enough for fsync() guarantees.
> But I might be wrong...

Ah, yes, that's what you're getting at. True, but for better or for
worse, we have no other interface other than blkdev_issue_flush().
This will guarantee that the data has made it to the disk controller,
but it won't necessarily guarantee that it will have made it onto the
disk platter, as I understand things; but I don't think we have any
other interfaces available to us at this point.

- Ted

2009-01-14 14:36:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed, Jan 14, 2009 at 03:08:04PM +0100, Jens Axboe wrote:
>
> It also guarentees that when you get a completion for that barrier
> write, it's on safe storage. Think of it as a flush-write-flush
> operation, in the presence of write back caching.
>

Is that true even if the barrier isn't attached to a write operation,
i.e., when using

blkdev_issue_flush(sb, NULL);

?

- Ted

2009-01-14 14:38:15

by Jan Kara

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed 14-01-09 09:12:04, Theodore Tso wrote:
> On Wed, Jan 14, 2009 at 03:05:32PM +0100, Jan Kara wrote:
> > On Wed 14-01-09 08:21:46, Theodore Tso wrote:
> > >
> > > If we optimize out the journal commit when there are no blocks
> > > attached to the transaction, we could change the patch to only force a
> > > flush if inode->i_state did not have I_DIRTY before the call to
> > > sync_inode(). Does that sound sane?
> > Yes. And also add a flush in case of fdatasync().
>
> Um, we have that already; the sync_inode() followed by
> blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
Maybe ext4-patch-queue changes that area but in Linus's tree I see:

if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
goto out;

So if we just overwrite some data, we send them to disk via fdatawrite()
and then we quickly bail out from ext4_sync_file() without doing
blkdev_issue_flush().

> > Well, I thought that a barrier, as an abstraction, only guarantees that
> > any IO which happened before the barrier hits the iron before any IO which
> > has been submitted after a barrier. This is actually enough for a
> > journalling to work correctly but it's not enough for fsync() guarantees.
> > But I might be wrong...
>
> Ah, yes, that's what you're getting at. True, but for better or for
> worse, we have no other interface other than blkdev_issue_flush().
> This will guarantee that the data has made it to the disk controller,
> but it won't necessarily guarantee that it will have made it onto the
> disk platter, as I understand things; but I don't think we have any
> other interfaces available to us at this point.
As Jens wrote, it seems barrier guarantees more than I thought so we are
correct.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-01-14 14:45:25

by Jens Axboe

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed, Jan 14 2009, Theodore Tso wrote:
> On Wed, Jan 14, 2009 at 03:08:04PM +0100, Jens Axboe wrote:
> >
> > It also guarentees that when you get a completion for that barrier
> > write, it's on safe storage. Think of it as a flush-write-flush
> > operation, in the presence of write back caching.
> >
>
> Is that true even if the barrier isn't attached to a write operation,
> i.e., when using
>
> blkdev_issue_flush(sb, NULL);
>
> ?

No, since there's no specific write to protect in that case. The above
will flush existing writes to platter, but has no bearing on writes
submitted later (other than the obvious effect that they will be on disk
later then previously submitted writes). So for blkdev_issue_flush(),
when that returns with good status, then the writes are safe. In the
absence of the barrier flag on a write, completion of that write may
just mean that the disk drive is aware of it, not that it is in any way
safe.

The hardware may even treat the flush as a noop and do absolutely
nothing, which is perfectly valid if it guarantees us that it will never
lose a submitted write.

--
Jens Axboe

2009-01-14 17:02:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > Um, we have that already; the sync_inode() followed by
> > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
>
> Maybe ext4-patch-queue changes that area but in Linus's tree I see:
>
> if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> goto out;
>
> So if we just overwrite some data, we send them to disk via fdatawrite()
> and then we quickly bail out from ext4_sync_file() without doing
> blkdev_issue_flush().

So you're thinking about fdatawrite() being called by some code path
other than ext4_sync_file() before we call fsync()? Yeah, that could
happen.... I think that will only happen if the file is opened
O_SYNC, but that raises another issue, which is that we're not forcing
a flush for writes when the file is opened O_SYNC.

- Ted

Subject: Re: ext2 + -osync: not as easy as it seems

On Wed, 2009-01-14 at 11:59 -0500, Theodore Tso wrote:
> On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > > Um, we have that already; the sync_inode() followed by
> > > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
> >
> > Maybe ext4-patch-queue changes that area but in Linus's tree I see:
> >
> > if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > goto out;
> >
> > So if we just overwrite some data, we send them to disk via fdatawrite()
> > and then we quickly bail out from ext4_sync_file() without doing
> > blkdev_issue_flush().
>
> So you're thinking about fdatawrite() being called by some code path
> other than ext4_sync_file() before we call fsync()? Yeah, that could
> happen.... I think that will only happen if the file is opened
> O_SYNC, but that raises another issue, which is that we're not forcing
> a flush for writes when the file is opened O_SYNC.

Hi Jan, Ted

Is something like the patch below what you had in mind?

--

From: Fernando Luis Vazquez Cao <[email protected]>
Subject: ext3: call blkdev_issue_flush on fsync

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should call blkdev_issue_flush if barriers are supported.

Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
---

--- linux-2.6.29-rc1-orig/fs/ext4/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext4/fsync.c 2009-01-15 21:03:19.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
{
struct inode *inode = dentry->d_inode;
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;

J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -79,22 +80,35 @@ int ext4_sync_file(struct file *file, st
goto out;
}

- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC))
+ goto flush_blkdev;

/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
- if (journal && (journal->j_flags & JBD2_BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ /*
+ * When there are no blocks attached to the journal transaction
+ * some optimizations are possible, but if there were dirty
+ * pages sync_inode() should have ensured that all data gets
+ * actually written to disk. Thus, we can skip
+ * blkdev_issue_flush() below.
+ */
+ if (!(i_state & I_DIRTY_PAGES))
+ goto flush_blkdev;
}
+
+ goto out;
+
+flush_blkdev:
+ if (journal && (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
out:
return ret;
}

2009-01-15 23:46:34

by Jan Kara

[permalink] [raw]
Subject: Re: ext2 + -osync: not as easy as it seems

On Thu 15-01-09 21:06:51, Fernando Luis V?zquez Cao wrote:
> On Wed, 2009-01-14 at 11:59 -0500, Theodore Tso wrote:
> > On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > > > Um, we have that already; the sync_inode() followed by
> > > > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
> > >
> > > Maybe ext4-patch-queue changes that area but in Linus's tree I see:
> > >
> > > if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > goto out;
> > >
> > > So if we just overwrite some data, we send them to disk via fdatawrite()
> > > and then we quickly bail out from ext4_sync_file() without doing
> > > blkdev_issue_flush().
> >
> > So you're thinking about fdatawrite() being called by some code path
> > other than ext4_sync_file() before we call fsync()? Yeah, that could
> > happen.... I think that will only happen if the file is opened
> > O_SYNC, but that raises another issue, which is that we're not forcing
> > a flush for writes when the file is opened O_SYNC.
>
> Hi Jan, Ted
>
> Is something like the patch below what you had in mind?
>
> --
>
> From: Fernando Luis Vazquez Cao <[email protected]>
> Subject: ext3: call blkdev_issue_flush on fsync
>
> To ensure that bits are truly on-disk after an fsync or fdatasync, we
> should call blkdev_issue_flush if barriers are supported.
>
> Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> ---
>
> --- linux-2.6.29-rc1-orig/fs/ext4/fsync.c 2008-12-25 08:26:37.000000000 +0900
> +++ linux-2.6.29-rc1/fs/ext4/fsync.c 2009-01-15 21:03:19.000000000 +0900
> @@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
> {
> struct inode *inode = dentry->d_inode;
> journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> + unsigned long i_state = inode->i_state;
> int ret = 0;
>
> J_ASSERT(ext4_journal_current_handle() == NULL);
> @@ -79,22 +80,35 @@ int ext4_sync_file(struct file *file, st
> goto out;
> }
>
> - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> - goto out;
> + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> + goto flush_blkdev;
>
> /*
> * The VFS has written the file data. If the inode is unaltered
> * then we need not start a commit.
> */
> - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_ALL,
> .nr_to_write = 0, /* sys_fsync did this */
> };
> ret = sync_inode(inode, &wbc);
> - if (journal && (journal->j_flags & JBD2_BARRIER))
> - blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> + /*
> + * When there are no blocks attached to the journal transaction
> + * some optimizations are possible, but if there were dirty
> + * pages sync_inode() should have ensured that all data gets
> + * actually written to disk. Thus, we can skip
> + * blkdev_issue_flush() below.
> + */
> + if (!(i_state & I_DIRTY_PAGES))
> + goto flush_blkdev;
Uh. Here I don't get it. When we did sync_inode(), blkdev_issue_flush()
is needed only if the journal does not do barriers. So I'd expect here:
if (!(journal->j_flags & JBD2_BARRIER))
goto flush_blkdev;
goto out;

> }
> +
> + goto out;
But here we might need to issue a flush if there are some data written. So
I'd have here:
if (!(i_state & I_DIRTY_PAGES))
goto out;

> +
> +flush_blkdev:
> + if (journal && (journal->j_flags & JBD2_BARRIER))
> + blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> out:
> return ret;
> }

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: ext2 + -osync: not as easy as it seems

On Fri, 2009-01-16 at 00:45 +0100, Jan Kara wrote:
> On Thu 15-01-09 21:06:51, Fernando Luis Vázquez Cao wrote:
> > On Wed, 2009-01-14 at 11:59 -0500, Theodore Tso wrote:
> > > On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > > > > Um, we have that already; the sync_inode() followed by
> > > > > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
> > > >
> > > > Maybe ext4-patch-queue changes that area but in Linus's tree I see:
> > > >
> > > > if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > goto out;
> > > >
> > > > So if we just overwrite some data, we send them to disk via fdatawrite()
> > > > and then we quickly bail out from ext4_sync_file() without doing
> > > > blkdev_issue_flush().
> > >
> > > So you're thinking about fdatawrite() being called by some code path
> > > other than ext4_sync_file() before we call fsync()? Yeah, that could
> > > happen.... I think that will only happen if the file is opened
> > > O_SYNC, but that raises another issue, which is that we're not forcing
> > > a flush for writes when the file is opened O_SYNC.
> >
> > Hi Jan, Ted
> >
> > Is something like the patch below what you had in mind?
> >
> > --
> >
> > From: Fernando Luis Vazquez Cao <[email protected]>
> > Subject: ext3: call blkdev_issue_flush on fsync
> >
> > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > should call blkdev_issue_flush if barriers are supported.
> >
> > Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> > ---
> >
> > --- linux-2.6.29-rc1-orig/fs/ext4/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > +++ linux-2.6.29-rc1/fs/ext4/fsync.c 2009-01-15 21:03:19.000000000 +0900
> > @@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
> > {
> > struct inode *inode = dentry->d_inode;
> > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> > + unsigned long i_state = inode->i_state;
> > int ret = 0;
> >
> > J_ASSERT(ext4_journal_current_handle() == NULL);
> > @@ -79,22 +80,35 @@ int ext4_sync_file(struct file *file, st
> > goto out;
> > }
> >
> > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > - goto out;
> > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > + goto flush_blkdev;
> >
> > /*
> > * The VFS has written the file data. If the inode is unaltered
> > * then we need not start a commit.
> > */
> > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > struct writeback_control wbc = {
> > .sync_mode = WB_SYNC_ALL,
> > .nr_to_write = 0, /* sys_fsync did this */
> > };
> > ret = sync_inode(inode, &wbc);
> > - if (journal && (journal->j_flags & JBD2_BARRIER))
> > - blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> > + /*
> > + * When there are no blocks attached to the journal transaction
> > + * some optimizations are possible, but if there were dirty
> > + * pages sync_inode() should have ensured that all data gets
> > + * actually written to disk. Thus, we can skip
> > + * blkdev_issue_flush() below.
> > + */
> > + if (!(i_state & I_DIRTY_PAGES))
> > + goto flush_blkdev;
> Uh. Here I don't get it. When we did sync_inode(), blkdev_issue_flush()
> is needed only if the journal does not do barriers. So I'd expect here:
> if (!(journal->j_flags & JBD2_BARRIER))
> goto flush_blkdev;
> goto out;

Ups, you are right. I somehow managed to mangle the logic that I
intended to put here and under flush_blkdev. By the way, I think that
the same check may be needed for the data==journal case too.

Thank you for the feedback, Jan.

I'll be replying to this email with new patches for ext2/ext3.

- Fernando

Subject: ext3: call blkdev_issue_flush on fsync

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
---

--- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
@@ -27,6 +27,7 @@
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/jbd.h>
+#include <linux/blkdev.h>
#include <linux/ext3_fs.h>
#include <linux/ext3_jbd.h>

@@ -45,6 +46,8 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;

J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
*/
if (ext3_should_journal_data(inode)) {
ret = ext3_force_commit(inode->i_sb);
+ if (!(journal->j_flags & JFS_BARRIER))
+ goto no_journal_barrier;
goto out;
}

- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC))
+ goto flush_blkdev;

/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+ if (journal && !(journal->j_flags & JFS_BARRIER))
+ goto no_journal_barrier;
}
+
+flush_blkdev:
+ if (!(i_state & I_DIRTY_PAGES))
+ goto out;
+no_journal_barrier:
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
out:
return ret;
}

Subject: ext4: call blkdev_issue_flush on fsync

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
---

--- linux-2.6.29-rc1-orig/fs/ext4/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext4/fsync.c 2009-01-16 22:18:51.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
{
struct inode *inode = dentry->d_inode;
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;

J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,33 @@ int ext4_sync_file(struct file *file, st
*/
if (ext4_should_journal_data(inode)) {
ret = ext4_force_commit(inode->i_sb);
+ if (!(journal->j_flags & JBD2_BARRIER))
+ goto no_journal_barrier;
goto out;
}

- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC))
+ goto flush_blkdev;

/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
- if (journal && (journal->j_flags & JBD2_BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ if (journal && !(journal->j_flags & JBD2_BARRIER))
+ goto no_journal_barrier;
}
+
+flush_blkdev:
+ if (!(i_state & I_DIRTY_PAGES))
+ goto out;
+no_journal_barrier:
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
out:
return ret;
}

2009-01-16 16:31:20

by Jan Kara

[permalink] [raw]
Subject: Re: ext3: call blkdev_issue_flush on fsync

On Fri 16-01-09 22:55:01, Fernando Luis V?zquez Cao wrote:
> To ensure that bits are truly on-disk after an fsync or fdatasync, we
> should force a disk flush explicitly when there is dirty data/metadata
> and the journal didn't emit a write barrier (either because metadata is
> not being synched or barriers are disabled).
>
> Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> ---
Only two minor nits:

> --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> @@ -27,6 +27,7 @@
> #include <linux/sched.h>
> #include <linux/writeback.h>
> #include <linux/jbd.h>
> +#include <linux/blkdev.h>
> #include <linux/ext3_fs.h>
> #include <linux/ext3_jbd.h>
>
> @@ -45,6 +46,8 @@
> int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> {
> struct inode *inode = dentry->d_inode;
> + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> + unsigned long i_state = inode->i_state;
> int ret = 0;
>
> J_ASSERT(ext3_journal_current_handle() == NULL);
> @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> */
> if (ext3_should_journal_data(inode)) {
> ret = ext3_force_commit(inode->i_sb);
> + if (!(journal->j_flags & JFS_BARRIER))
> + goto no_journal_barrier;
> goto out;
> }
>
> - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> - goto out;
> + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> + goto flush_blkdev;
>
> /*
> * The VFS has written the file data. If the inode is unaltered
> * then we need not start a commit.
> */
> - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_ALL,
> .nr_to_write = 0, /* sys_fsync did this */
> };
> ret = sync_inode(inode, &wbc);
> + if (journal && !(journal->j_flags & JFS_BARRIER))
> + goto no_journal_barrier;
I cannot imagine "journal" will be NULL here.
And we can also optimize here a bit and do "goto out" because here
we know the barrier has been issued.

> }
> +
> +flush_blkdev:
> + if (!(i_state & I_DIRTY_PAGES))
> + goto out;
> +no_journal_barrier:
> + blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> out:
> return ret;
> }

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: ext3: call blkdev_issue_flush on fsync

On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > should force a disk flush explicitly when there is dirty data/metadata
> > and the journal didn't emit a write barrier (either because metadata is
> > not being synched or barriers are disabled).
> >
> > Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> > ---
> Only two minor nits:
>
> > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> > @@ -27,6 +27,7 @@
> > #include <linux/sched.h>
> > #include <linux/writeback.h>
> > #include <linux/jbd.h>
> > +#include <linux/blkdev.h>
> > #include <linux/ext3_fs.h>
> > #include <linux/ext3_jbd.h>
> >
> > @@ -45,6 +46,8 @@
> > int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > {
> > struct inode *inode = dentry->d_inode;
> > + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > + unsigned long i_state = inode->i_state;
> > int ret = 0;
> >
> > J_ASSERT(ext3_journal_current_handle() == NULL);
> > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > */
> > if (ext3_should_journal_data(inode)) {
> > ret = ext3_force_commit(inode->i_sb);
> > + if (!(journal->j_flags & JFS_BARRIER))
> > + goto no_journal_barrier;
> > goto out;
> > }
> >
> > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > - goto out;
> > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > + goto flush_blkdev;
> >
> > /*
> > * The VFS has written the file data. If the inode is unaltered
> > * then we need not start a commit.
> > */
> > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > struct writeback_control wbc = {
> > .sync_mode = WB_SYNC_ALL,
> > .nr_to_write = 0, /* sys_fsync did this */
> > };
> > ret = sync_inode(inode, &wbc);
> > + if (journal && !(journal->j_flags & JFS_BARRIER))
> > + goto no_journal_barrier;
> I cannot imagine "journal" will be NULL here.

I'll try to check whether that is always so just in case.

> And we can also optimize here a bit and do "goto out" because here
> we know the barrier has been issued.

Yep, I was considering the same optimization. By the way, I was
wondering if we should honor ext3 and ext4's "barrier" mount option for
sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
What are your thoughts on this?

Subject: Re: ext3: call blkdev_issue_flush on fsync

On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > should force a disk flush explicitly when there is dirty data/metadata
> > > and the journal didn't emit a write barrier (either because metadata is
> > > not being synched or barriers are disabled).
> > >
> > > Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> > > ---
> > Only two minor nits:
> >
> > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> > > @@ -27,6 +27,7 @@
> > > #include <linux/sched.h>
> > > #include <linux/writeback.h>
> > > #include <linux/jbd.h>
> > > +#include <linux/blkdev.h>
> > > #include <linux/ext3_fs.h>
> > > #include <linux/ext3_jbd.h>
> > >
> > > @@ -45,6 +46,8 @@
> > > int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > {
> > > struct inode *inode = dentry->d_inode;
> > > + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > + unsigned long i_state = inode->i_state;
> > > int ret = 0;
> > >
> > > J_ASSERT(ext3_journal_current_handle() == NULL);
> > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > */
> > > if (ext3_should_journal_data(inode)) {
> > > ret = ext3_force_commit(inode->i_sb);
> > > + if (!(journal->j_flags & JFS_BARRIER))
> > > + goto no_journal_barrier;
> > > goto out;
> > > }
> > >
> > > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > - goto out;
> > > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > + goto flush_blkdev;
> > >
> > > /*
> > > * The VFS has written the file data. If the inode is unaltered
> > > * then we need not start a commit.
> > > */
> > > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > struct writeback_control wbc = {
> > > .sync_mode = WB_SYNC_ALL,
> > > .nr_to_write = 0, /* sys_fsync did this */
> > > };
> > > ret = sync_inode(inode, &wbc);
> > > + if (journal && !(journal->j_flags & JFS_BARRIER))
> > > + goto no_journal_barrier;
> > I cannot imagine "journal" will be NULL here.
>
> I'll try to check whether that is always so just in case.
>
> > And we can also optimize here a bit and do "goto out" because here
> > we know the barrier has been issued.
>
> Yep, I was considering the same optimization. By the way, I was
> wondering if we should honor ext3 and ext4's "barrier" mount option for
> sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".

The last phrase should read " do not force a flush when "barrier=0" ".

Sorry for the noise.

- Fernando

2009-01-19 12:04:01

by Jan Kara

[permalink] [raw]
Subject: Re: ext3: call blkdev_issue_flush on fsync

On Sat 17-01-09 19:00:49, Fernando Luis V?zquez Cao wrote:
> On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis V?zquez Cao wrote:
> > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > On Fri 16-01-09 22:55:01, Fernando Luis V?zquez Cao wrote:
> > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > and the journal didn't emit a write barrier (either because metadata is
> > > > not being synched or barriers are disabled).
> > > >
> > > > Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> > > > ---
> > > Only two minor nits:
> > >
> > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> > > > @@ -27,6 +27,7 @@
> > > > #include <linux/sched.h>
> > > > #include <linux/writeback.h>
> > > > #include <linux/jbd.h>
> > > > +#include <linux/blkdev.h>
> > > > #include <linux/ext3_fs.h>
> > > > #include <linux/ext3_jbd.h>
> > > >
> > > > @@ -45,6 +46,8 @@
> > > > int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > {
> > > > struct inode *inode = dentry->d_inode;
> > > > + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > + unsigned long i_state = inode->i_state;
> > > > int ret = 0;
> > > >
> > > > J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > */
> > > > if (ext3_should_journal_data(inode)) {
> > > > ret = ext3_force_commit(inode->i_sb);
> > > > + if (!(journal->j_flags & JFS_BARRIER))
> > > > + goto no_journal_barrier;
> > > > goto out;
> > > > }
> > > >
> > > > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > - goto out;
> > > > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > + goto flush_blkdev;
> > > >
> > > > /*
> > > > * The VFS has written the file data. If the inode is unaltered
> > > > * then we need not start a commit.
> > > > */
> > > > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > struct writeback_control wbc = {
> > > > .sync_mode = WB_SYNC_ALL,
> > > > .nr_to_write = 0, /* sys_fsync did this */
> > > > };
> > > > ret = sync_inode(inode, &wbc);
> > > > + if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > + goto no_journal_barrier;
> > > I cannot imagine "journal" will be NULL here.
> >
> > I'll try to check whether that is always so just in case.
> >
> > > And we can also optimize here a bit and do "goto out" because here
> > > we know the barrier has been issued.
> >
> > Yep, I was considering the same optimization. By the way, I was
> > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
>
> The last phrase should read " do not force a flush when "barrier=0" ".
I was hesitating about this a bit. But I don't think so. The reason is
that POSIX (or any other reasonable specification) mandates that fsync()
should return only after the data is safely on storage. So if we don't
flush blockdevice caches, we effectively violate POSIX and we should never
do that. With barriers the matter is a bit different - that is just a
filesystem specific thing, no standard guarantees anything.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: ext3: call blkdev_issue_flush on fsync

On Mon, 2009-01-19 at 13:03 +0100, Jan Kara wrote:
> On Sat 17-01-09 19:00:49, Fernando Luis Vázquez Cao wrote:
> > On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> > > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > > and the journal didn't emit a write barrier (either because metadata is
> > > > > not being synched or barriers are disabled).
> > > > >
> > > > > Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> > > > > ---
> > > > Only two minor nits:
> > > >
> > > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> > > > > @@ -27,6 +27,7 @@
> > > > > #include <linux/sched.h>
> > > > > #include <linux/writeback.h>
> > > > > #include <linux/jbd.h>
> > > > > +#include <linux/blkdev.h>
> > > > > #include <linux/ext3_fs.h>
> > > > > #include <linux/ext3_jbd.h>
> > > > >
> > > > > @@ -45,6 +46,8 @@
> > > > > int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > > {
> > > > > struct inode *inode = dentry->d_inode;
> > > > > + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > > + unsigned long i_state = inode->i_state;
> > > > > int ret = 0;
> > > > >
> > > > > J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > > */
> > > > > if (ext3_should_journal_data(inode)) {
> > > > > ret = ext3_force_commit(inode->i_sb);
> > > > > + if (!(journal->j_flags & JFS_BARRIER))
> > > > > + goto no_journal_barrier;
> > > > > goto out;
> > > > > }
> > > > >
> > > > > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > > - goto out;
> > > > > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > > + goto flush_blkdev;
> > > > >
> > > > > /*
> > > > > * The VFS has written the file data. If the inode is unaltered
> > > > > * then we need not start a commit.
> > > > > */
> > > > > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > struct writeback_control wbc = {
> > > > > .sync_mode = WB_SYNC_ALL,
> > > > > .nr_to_write = 0, /* sys_fsync did this */
> > > > > };
> > > > > ret = sync_inode(inode, &wbc);
> > > > > + if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > > + goto no_journal_barrier;
> > > > I cannot imagine "journal" will be NULL here.
> > >
> > > I'll try to check whether that is always so just in case.
> > >
> > > > And we can also optimize here a bit and do "goto out" because here
> > > > we know the barrier has been issued.
> > >
> > > Yep, I was considering the same optimization. By the way, I was
> > > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
> >
> > The last phrase should read " do not force a flush when "barrier=0" ".
> I was hesitating about this a bit. But I don't think so. The reason is
> that POSIX (or any other reasonable specification) mandates that fsync()
> should return only after the data is safely on storage. So if we don't
> flush blockdevice caches, we effectively violate POSIX and we should never
> do that. With barriers the matter is a bit different - that is just a
> filesystem specific thing, no standard guarantees anything.

Hi Jan,

Sorry it's taken me so long to get back to you.

Thinking a lit bit more about this issue, it occurred to me that adding
a new mount option à la existing "barrier" is likely to be preferable.
As an example where such an option could make sense, let's consider a
system with battery-backup cache devices. Since the battery-backup
guarantees the data still not committed to the platter will not vanish
in the event of a power down, it should be possible to obtain a
performance gain by optimizing out the device flush on every
fsync()/fdatasync() call.

If there is consensus on the propriety of this approach, I will send
updated patches.

Regards,

Fernando

2009-01-28 09:55:34

by Jan Kara

[permalink] [raw]
Subject: Re: ext3: call blkdev_issue_flush on fsync

On Wed 28-01-09 18:45:13, Fernando Luis V?zquez Cao wrote:
> On Mon, 2009-01-19 at 13:03 +0100, Jan Kara wrote:
> > On Sat 17-01-09 19:00:49, Fernando Luis V?zquez Cao wrote:
> > > On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis V?zquez Cao wrote:
> > > > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > > > On Fri 16-01-09 22:55:01, Fernando Luis V?zquez Cao wrote:
> > > > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > > > and the journal didn't emit a write barrier (either because metadata is
> > > > > > not being synched or barriers are disabled).
> > > > > >
> > > > > > Signed-off-by: Fernando Luis Vazquez Cao <[email protected]>
> > > > > > ---
> > > > > Only two minor nits:
> > > > >
> > > > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > > > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> > > > > > @@ -27,6 +27,7 @@
> > > > > > #include <linux/sched.h>
> > > > > > #include <linux/writeback.h>
> > > > > > #include <linux/jbd.h>
> > > > > > +#include <linux/blkdev.h>
> > > > > > #include <linux/ext3_fs.h>
> > > > > > #include <linux/ext3_jbd.h>
> > > > > >
> > > > > > @@ -45,6 +46,8 @@
> > > > > > int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > > > {
> > > > > > struct inode *inode = dentry->d_inode;
> > > > > > + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > > > + unsigned long i_state = inode->i_state;
> > > > > > int ret = 0;
> > > > > >
> > > > > > J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > > > */
> > > > > > if (ext3_should_journal_data(inode)) {
> > > > > > ret = ext3_force_commit(inode->i_sb);
> > > > > > + if (!(journal->j_flags & JFS_BARRIER))
> > > > > > + goto no_journal_barrier;
> > > > > > goto out;
> > > > > > }
> > > > > >
> > > > > > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > > > - goto out;
> > > > > > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > > > + goto flush_blkdev;
> > > > > >
> > > > > > /*
> > > > > > * The VFS has written the file data. If the inode is unaltered
> > > > > > * then we need not start a commit.
> > > > > > */
> > > > > > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > > struct writeback_control wbc = {
> > > > > > .sync_mode = WB_SYNC_ALL,
> > > > > > .nr_to_write = 0, /* sys_fsync did this */
> > > > > > };
> > > > > > ret = sync_inode(inode, &wbc);
> > > > > > + if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > > > + goto no_journal_barrier;
> > > > > I cannot imagine "journal" will be NULL here.
> > > >
> > > > I'll try to check whether that is always so just in case.
> > > >
> > > > > And we can also optimize here a bit and do "goto out" because here
> > > > > we know the barrier has been issued.
> > > >
> > > > Yep, I was considering the same optimization. By the way, I was
> > > > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > > > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
> > >
> > > The last phrase should read " do not force a flush when "barrier=0" ".
> > I was hesitating about this a bit. But I don't think so. The reason is
> > that POSIX (or any other reasonable specification) mandates that fsync()
> > should return only after the data is safely on storage. So if we don't
> > flush blockdevice caches, we effectively violate POSIX and we should never
> > do that. With barriers the matter is a bit different - that is just a
> > filesystem specific thing, no standard guarantees anything.
>
> Hi Jan,
>
> Sorry it's taken me so long to get back to you.
>
> Thinking a lit bit more about this issue, it occurred to me that adding
> a new mount option ? la existing "barrier" is likely to be preferable.
> As an example where such an option could make sense, let's consider a
> system with battery-backup cache devices. Since the battery-backup
> guarantees the data still not committed to the platter will not vanish
> in the event of a power down, it should be possible to obtain a
> performance gain by optimizing out the device flush on every
> fsync()/fdatasync() call.
Yes, I came to this conclusion as well when we were discussing how to fix
other filesystems which fail to issue flush in fsync
(https://kerneltrap.org/mailarchive/linux-ext4/2009/1/22/4788004/thread).
I think it makes sence there's a general mount option recognized at VFS
level which would set superblock flag whether flush on fsync is needed or
not. Then you could use this flag in your patch.
So if you have time, please write a (separate) patch introducing this
generic option. Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR