2010-04-30 17:57:16

by djwong

[permalink] [raw]
Subject: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

Hmm. A while ago I was complaining that an evil program that calls fsync() in
a loop will send a continuous stream of write barriers to the hard disk. Ted
theorized that it might be possible to set a flag in ext4_writepage and clear
it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
set (meaning that nothing has been dirtied since the last fsync()) then we
could skip issuing the barrier.

Here's an experimental patch to do something sort of like that. From a quick
run with blktrace, it seems to skip the redundant barriers and improves the ffsb
mail server scores. However, I haven't done extensive power failure testing to
see how much data it can destroy. For that matter I'm not even 100% sure it's
correct at what it aims to do.

Just throwing this out there, though. Nothing's blown up ... yet. :P
---
Signed-off-by: Darrick J. Wong <[email protected]>
---

fs/ext4/ext4.h | 2 ++
fs/ext4/fsync.c | 7 +++++--
fs/ext4/inode.c | 5 +++++
3 files changed, 12 insertions(+), 2 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..3b70195 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1025,6 +1025,8 @@ struct ext4_sb_info {

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ atomic_t unflushed_writes;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0d0c323..441f872 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -52,7 +52,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ journal_t *journal = sbi->s_journal;
int ret;
tid_t commit_tid;

@@ -102,7 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
(journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
+ } else if (journal->j_flags & JBD2_BARRIER && atomic_read(&sbi->unflushed_writes)) {
+ atomic_set(&sbi->unflushed_writes, 0);
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ }
return ret;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5381802..e501abd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2718,6 +2718,7 @@ static int ext4_writepage(struct page *page,
unsigned int len;
struct buffer_head *page_bufs = NULL;
struct inode *inode = page->mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(page->mapping->host->i_sb);

trace_ext4_writepage(inode, page);
size = i_size_read(inode);
@@ -2726,6 +2727,8 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_CACHE_SIZE;

+ atomic_set(&sbi->unflushed_writes, 1);
+
if (page_has_buffers(page)) {
page_bufs = page_buffers(page);
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
@@ -2872,6 +2875,8 @@ static int ext4_da_writepages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;

+ atomic_set(&sbi->unflushed_writes, 1);
+
range_cyclic = wbc->range_cyclic;
if (wbc->range_cyclic) {
index = mapping->writeback_index;


2010-06-30 12:48:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, May 04, 2010 at 11:45:53AM -0400, Christoph Hellwig wrote:
> On Tue, May 04, 2010 at 10:16:37AM -0400, Ric Wheeler wrote:
> > Checking per inode is actually incorrect - we do not want to short cut
> > the need to flush the target storage device's write cache just because a
> > specific file has no dirty pages. If a power hit occurs, having sent
> > the pages from to the storage device is not sufficient.
>
> As long as we're only using the information for fsync doing it per inode
> is the correct thing. We only want to flush the cache if the inode
> (data or metadata) is dirty in some way. Note that this includes writes
> via O_DIRECT which are quite different to track - I've not found the
> original patch in my mbox so I can't comment if this is done right.

I agree.

I wonder if it's worthwhile to think about a new system call which
allows users to provide an array of fd's which are collectively should
be fsync'ed out at the same time. Otherwise, we end up issuing
multiple barrier operations in cases where the application needs to
do:

fsync(control_fd);
fsync(data_fd);

- Ted

2010-06-30 13:21:20

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 06/30/2010 08:48 AM, [email protected] wrote:
> On Tue, May 04, 2010 at 11:45:53AM -0400, Christoph Hellwig wrote:
>> On Tue, May 04, 2010 at 10:16:37AM -0400, Ric Wheeler wrote:
>>> Checking per inode is actually incorrect - we do not want to short cut
>>> the need to flush the target storage device's write cache just because a
>>> specific file has no dirty pages. If a power hit occurs, having sent
>>> the pages from to the storage device is not sufficient.
>>
>> As long as we're only using the information for fsync doing it per inode
>> is the correct thing. We only want to flush the cache if the inode
>> (data or metadata) is dirty in some way. Note that this includes writes
>> via O_DIRECT which are quite different to track - I've not found the
>> original patch in my mbox so I can't comment if this is done right.
>
> I agree.
>
> I wonder if it's worthwhile to think about a new system call which
> allows users to provide an array of fd's which are collectively should
> be fsync'ed out at the same time. Otherwise, we end up issuing
> multiple barrier operations in cases where the application needs to
> do:
>
> fsync(control_fd);
> fsync(data_fd);
>
> - Ted

The problem with not issuing a cache flush when you have dirty meta data or data
is that it does not have any tie to the state of the volatile write cache of the
target storage device.

We do need to have fsync() issue the cache flush command even when there is no
dirty state for the inode in our local page cache in order to flush data that
was pushed out/cleaned and not followed by a flush.

It would definitely be *very* useful to have an array of fd's that all need
fsync()'ed at home time....

Ric

2010-06-30 13:44:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
>
> The problem with not issuing a cache flush when you have dirty meta
> data or data is that it does not have any tie to the state of the
> volatile write cache of the target storage device.

We track whether or not there is any metadata updates associated with
the inode already; if it does, we force a journal commit, and this
implies a barrier operation.

The case we're talking about here is one where either (a) there is no
journal, or (b) there have been no metadata updates (I'm simplifying a
little here; in fact we track whether there have been fdatasync()- vs
fsync()- worthy metadata updates), and so there hasn't been a journal
commit to do the cache flush.

In this case, we want to track when is the last time an fsync() has
been issued, versus when was the last time data blocks for a
particular inode have been pushed out to disk.

To use an example I used as motivation for why we might want an
fsync2(int fd[], int flags[], int num) syscall, consider the situation
of:

fsync(control_fd);
fdatasync(data_fd);

The first fsync() will have executed a cache flush operation. So when
we do the fdatasync() (assuming that no metadata needs to be flushed
out to disk), there is no need for the cache flush operation.

If we had an enhanced fsync command, we would also be able to
eliminate a second journal commit in the case where data_fd also had
some metadata that needed to be flushed out to disk.

> It would definitely be *very* useful to have an array of fd's that
> all need fsync()'ed at home time....

Yes, but it would require applications to change their code.

One thing that I would like about a new fsync2() system call is with a
flags field, we could add some new, more expressive flags:

#define FSYNC_DATA 0x0001 /* Only flush metadata if needed to access data */
#define FSYNC_NOWAIT 0x0002 /* Initiate the flush operations but don't wait
for them to complete */
#define FSYNC_NOBARRER 0x004 /* FS may skip the barrier if not needed for fs
consistency */

etc.

- Ted

2010-06-30 13:54:42

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 06/30/2010 09:44 AM, [email protected] wrote:
> On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
>>
>> The problem with not issuing a cache flush when you have dirty meta
>> data or data is that it does not have any tie to the state of the
>> volatile write cache of the target storage device.
>
> We track whether or not there is any metadata updates associated with
> the inode already; if it does, we force a journal commit, and this
> implies a barrier operation.
>
> The case we're talking about here is one where either (a) there is no
> journal, or (b) there have been no metadata updates (I'm simplifying a
> little here; in fact we track whether there have been fdatasync()- vs
> fsync()- worthy metadata updates), and so there hasn't been a journal
> commit to do the cache flush.
>
> In this case, we want to track when is the last time an fsync() has
> been issued, versus when was the last time data blocks for a
> particular inode have been pushed out to disk.

I think that the state that we want to track is the last time the write cache on
the target device has been flushed. If the last fsync() did do a full barrier,
that would be equivalent :-)

ric

>
> To use an example I used as motivation for why we might want an
> fsync2(int fd[], int flags[], int num) syscall, consider the situation
> of:
>
> fsync(control_fd);
> fdatasync(data_fd);
>
> The first fsync() will have executed a cache flush operation. So when
> we do the fdatasync() (assuming that no metadata needs to be flushed
> out to disk), there is no need for the cache flush operation.
>
> If we had an enhanced fsync command, we would also be able to
> eliminate a second journal commit in the case where data_fd also had
> some metadata that needed to be flushed out to disk.
>
>> It would definitely be *very* useful to have an array of fd's that
>> all need fsync()'ed at home time....
>
> Yes, but it would require applications to change their code.
>
> One thing that I would like about a new fsync2() system call is with a
> flags field, we could add some new, more expressive flags:
>
> #define FSYNC_DATA 0x0001 /* Only flush metadata if needed to access data */
> #define FSYNC_NOWAIT 0x0002 /* Initiate the flush operations but don't wait
> for them to complete */
> #define FSYNC_NOBARRER 0x004 /* FS may skip the barrier if not needed for fs
> consistency */
>
> etc.
>
> - Ted

2010-06-30 19:05:21

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 2010-06-30, at 07:54, Ric Wheeler wrote:
> On 06/30/2010 09:44 AM, [email protected] wrote:
>> We track whether or not there is any metadata updates associated with
>> the inode already; if it does, we force a journal commit, and this
>> implies a barrier operation.
>>
>> The case we're talking about here is one where either (a) there is no
>> journal, or (b) there have been no metadata updates (I'm simplifying a
>> little here; in fact we track whether there have been fdatasync()- vs
>> fsync()- worthy metadata updates), and so there hasn't been a journal
>> commit to do the cache flush.
>>
>> In this case, we want to track when is the last time an fsync() has
>> been issued, versus when was the last time data blocks for a
>> particular inode have been pushed out to disk.
>
> I think that the state that we want to track is the last time the write cache on the target device has been flushed. If the last fsync() did do a full barrier, that would be equivalent :-)

We had a similar problem in Lustre, where we want to ensure the integrity of some data on disk, but don't want to force an extra journal commit/barrier if there was already one since the time the write was submitted and before we need it to be on disk.

We fixed this in a similar manner but it is optimized somewhat. In your case there is a flag on the inode in question, but you should also registered a journal commit callback after the IO has been submitted that clears the flag when the journal commits (which also implies a barrier). This avoids a gratuitous barrier if fsync() is called on this (or any other similarly marked) inode after the journal has already issued the barrier.

The best part is that this gives "POSIXly correct" semantics for applications that are issuing the f{,data}sync() on the modified files, without penalizing them again if the journal happened to do this already in the background in aggregate.

Cheers, Andreas




2010-07-21 17:16:14

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

Hi,

> On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> >
> > The problem with not issuing a cache flush when you have dirty meta
> > data or data is that it does not have any tie to the state of the
> > volatile write cache of the target storage device.
>
> We track whether or not there is any metadata updates associated with
> the inode already; if it does, we force a journal commit, and this
> implies a barrier operation.
>
> The case we're talking about here is one where either (a) there is no
> journal, or (b) there have been no metadata updates (I'm simplifying a
> little here; in fact we track whether there have been fdatasync()- vs
> fsync()- worthy metadata updates), and so there hasn't been a journal
> commit to do the cache flush.
>
> In this case, we want to track when is the last time an fsync() has
> been issued, versus when was the last time data blocks for a
> particular inode have been pushed out to disk.
>
> To use an example I used as motivation for why we might want an
> fsync2(int fd[], int flags[], int num) syscall, consider the situation
> of:
>
> fsync(control_fd);
> fdatasync(data_fd);
>
> The first fsync() will have executed a cache flush operation. So when
> we do the fdatasync() (assuming that no metadata needs to be flushed
> out to disk), there is no need for the cache flush operation.
>
> If we had an enhanced fsync command, we would also be able to
> eliminate a second journal commit in the case where data_fd also had
> some metadata that needed to be flushed out to disk.
Current implementation already avoids journal commit because of
fdatasync(data_fd). We remeber a transaction ID when inode metadata has
last been updated and do not force a transaction commit if it is already
committed. Thus the first fsync might force a transaction commit but second
fdatasync likely won't.
We could actually improve the scheme to work for data as well. I wrote
a proof-of-concept patches (attached) and they nicely avoid second barrier
when doing:
echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1

Ted, would you be interested in something like this?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs


Attachments:
(No filename) (2.15 kB)
0001-block-Introduce-barrier-counters.patch (2.05 kB)
0002-ext4-Send-barriers-on-fsync-only-when-needed.patch (3.04 kB)
Download all attachments

2010-08-03 00:09:46

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote:
> Hi,
>
> > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > >
> > > The problem with not issuing a cache flush when you have dirty meta
> > > data or data is that it does not have any tie to the state of the
> > > volatile write cache of the target storage device.
> >
> > We track whether or not there is any metadata updates associated with
> > the inode already; if it does, we force a journal commit, and this
> > implies a barrier operation.
> >
> > The case we're talking about here is one where either (a) there is no
> > journal, or (b) there have been no metadata updates (I'm simplifying a
> > little here; in fact we track whether there have been fdatasync()- vs
> > fsync()- worthy metadata updates), and so there hasn't been a journal
> > commit to do the cache flush.
> >
> > In this case, we want to track when is the last time an fsync() has
> > been issued, versus when was the last time data blocks for a
> > particular inode have been pushed out to disk.
> >
> > To use an example I used as motivation for why we might want an
> > fsync2(int fd[], int flags[], int num) syscall, consider the situation
> > of:
> >
> > fsync(control_fd);
> > fdatasync(data_fd);
> >
> > The first fsync() will have executed a cache flush operation. So when
> > we do the fdatasync() (assuming that no metadata needs to be flushed
> > out to disk), there is no need for the cache flush operation.
> >
> > If we had an enhanced fsync command, we would also be able to
> > eliminate a second journal commit in the case where data_fd also had
> > some metadata that needed to be flushed out to disk.
> Current implementation already avoids journal commit because of
> fdatasync(data_fd). We remeber a transaction ID when inode metadata has
> last been updated and do not force a transaction commit if it is already
> committed. Thus the first fsync might force a transaction commit but second
> fdatasync likely won't.
> We could actually improve the scheme to work for data as well. I wrote
> a proof-of-concept patches (attached) and they nicely avoid second barrier
> when doing:
> echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1
>
> Ted, would you be interested in something like this?

Well... on my fsync-happy workloads, this seems to cut the barrier count down
by about 20%, and speeds it up by about 20%.

I also have a patch to ext4_sync_files that batches the fsync requests together
for a further 20% decrease in barrier IOs, which makes it run another 20%
faster. I'll send that one out shortly, though I've not safety-tested it at
all.

--D

2010-08-03 09:01:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote:
> Well... on my fsync-happy workloads, this seems to cut the barrier count down
> by about 20%, and speeds it up by about 20%.

Care to share the test case for this? I'd be especially interesting on
how it behaves with non-draining barriers / cache flushes in fsync.

2010-08-03 13:22:28

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Mon 02-08-10 17:09:39, Darrick J. Wong wrote:
> On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote:
> > Hi,
> >
> > > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > > >
> > > > The problem with not issuing a cache flush when you have dirty meta
> > > > data or data is that it does not have any tie to the state of the
> > > > volatile write cache of the target storage device.
> > >
> > > We track whether or not there is any metadata updates associated with
> > > the inode already; if it does, we force a journal commit, and this
> > > implies a barrier operation.
> > >
> > > The case we're talking about here is one where either (a) there is no
> > > journal, or (b) there have been no metadata updates (I'm simplifying a
> > > little here; in fact we track whether there have been fdatasync()- vs
> > > fsync()- worthy metadata updates), and so there hasn't been a journal
> > > commit to do the cache flush.
> > >
> > > In this case, we want to track when is the last time an fsync() has
> > > been issued, versus when was the last time data blocks for a
> > > particular inode have been pushed out to disk.
> > >
> > > To use an example I used as motivation for why we might want an
> > > fsync2(int fd[], int flags[], int num) syscall, consider the situation
> > > of:
> > >
> > > fsync(control_fd);
> > > fdatasync(data_fd);
> > >
> > > The first fsync() will have executed a cache flush operation. So when
> > > we do the fdatasync() (assuming that no metadata needs to be flushed
> > > out to disk), there is no need for the cache flush operation.
> > >
> > > If we had an enhanced fsync command, we would also be able to
> > > eliminate a second journal commit in the case where data_fd also had
> > > some metadata that needed to be flushed out to disk.
> > Current implementation already avoids journal commit because of
> > fdatasync(data_fd). We remeber a transaction ID when inode metadata has
> > last been updated and do not force a transaction commit if it is already
> > committed. Thus the first fsync might force a transaction commit but second
> > fdatasync likely won't.
> > We could actually improve the scheme to work for data as well. I wrote
> > a proof-of-concept patches (attached) and they nicely avoid second barrier
> > when doing:
> > echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1
> >
> > Ted, would you be interested in something like this?
>
> Well... on my fsync-happy workloads, this seems to cut the barrier count down
> by about 20%, and speeds it up by about 20%.
Nice, thanks for measurement.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-03 13:24:58

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 06/30/2010 03:48 PM, [email protected] wrote:
>
> I wonder if it's worthwhile to think about a new system call which
> allows users to provide an array of fd's which are collectively should
> be fsync'ed out at the same time. Otherwise, we end up issuing
> multiple barrier operations in cases where the application needs to
> do:
>
> fsync(control_fd);
> fsync(data_fd);
>

The system call exists, it's called io_submit().

--
error compiling committee.c: too many arguments to function

2010-08-04 18:16:26

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, Aug 03, 2010 at 05:01:52AM -0400, Christoph Hellwig wrote:
> On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote:
> > Well... on my fsync-happy workloads, this seems to cut the barrier count down
> > by about 20%, and speeds it up by about 20%.
>
> Care to share the test case for this? I'd be especially interesting on
> how it behaves with non-draining barriers / cache flushes in fsync.

Sure. When I run blktrace with the ffsb profile, I get these results:

barriers transactions/sec
16212 206
15625 201
10442 269
10870 266
15658 201

Without Jan's patch:
barriers transactions/sec
20855 177
20963 177
20340 174
20908 177

The two ~270 results are a little odd... if we ignore them, the net gain with
Jan's patch is about a 25% reduction in barriers issued and about a 15%
increase in tps. (If we don't, it's ~30% and ~30%, respectively.) That said,
I was running mkfs between runs, so it's possible that the disk layout could
have shifted a bit. If I turn off the fsync parts of the ffsb profile, the
barrier counts drop to about a couple every second or so, which means that
Jan's patch doesn't have much of an effect. But it does help if someone is
hammering on the filesystem with fsync.

The ffsb profile is attached below.

--D

-----------

time=300
alignio=1
directio=1

[filesystem0]
location=/mnt/
num_files=100000
num_dirs=1000

reuse=1
# File sizes range from 1kB to 1MB.
size_weight 1KB 10
size_weight 2KB 15
size_weight 4KB 16
size_weight 8KB 16
size_weight 16KB 15
size_weight 32KB 10
size_weight 64KB 8
size_weight 128KB 4
size_weight 256KB 3
size_weight 512KB 2
size_weight 1MB 1

create_blocksize=1048576
[end0]

[threadgroup0]
num_threads=64

readall_weight=4
create_fsync_weight=2
delete_weight=1

append_weight = 1
append_fsync_weight = 1
stat_weight = 1
create_weight = 1
writeall_weight = 1
writeall_fsync_weight = 1
open_close_weight = 1


write_size=64KB
write_blocksize=512KB

read_size=64KB
read_blocksize=512KB

[stats]
enable_stats=1
enable_range=1

msec_range 0.00 0.01
msec_range 0.01 0.02
msec_range 0.02 0.05
msec_range 0.05 0.10
msec_range 0.10 0.20
msec_range 0.20 0.50
msec_range 0.50 1.00
msec_range 1.00 2.00
msec_range 2.00 5.00
msec_range 5.00 10.00
msec_range 10.00 20.00
msec_range 20.00 50.00
msec_range 50.00 100.00
msec_range 100.00 200.00
msec_range 200.00 500.00
msec_range 500.00 1000.00
msec_range 1000.00 2000.00
msec_range 2000.00 5000.00
msec_range 5000.00 10000.00
[end]
[end0]

2010-08-04 23:32:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, Aug 03, 2010 at 04:24:49PM +0300, Avi Kivity wrote:
> On 06/30/2010 03:48 PM, [email protected] wrote:
> >
> >I wonder if it's worthwhile to think about a new system call which
> >allows users to provide an array of fd's which are collectively should
> >be fsync'ed out at the same time. Otherwise, we end up issuing
> >multiple barrier operations in cases where the application needs to
> >do:
> >
> > fsync(control_fd);
> > fsync(data_fd);
> >
>
> The system call exists, it's called io_submit().

Um, not the same thing at all.

- Ted

2010-08-05 02:20:26

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 08/05/2010 02:32 AM, Ted Ts'o wrote:
> On Tue, Aug 03, 2010 at 04:24:49PM +0300, Avi Kivity wrote:
>> On 06/30/2010 03:48 PM, [email protected] wrote:
>>> I wonder if it's worthwhile to think about a new system call which
>>> allows users to provide an array of fd's which are collectively should
>>> be fsync'ed out at the same time. Otherwise, we end up issuing
>>> multiple barrier operations in cases where the application needs to
>>> do:
>>>
>>> fsync(control_fd);
>>> fsync(data_fd);
>>>
>> The system call exists, it's called io_submit().
> Um, not the same thing at all.

Why not? To be clear, I'm talking about an io_submit() with multiple
IO_CMD_FSYNC requests, with a kernel implementation that is able to
batch these requests.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-08-05 16:17:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 05:20:12AM +0300, Avi Kivity wrote:
>
> Why not? To be clear, I'm talking about an io_submit() with
> multiple IO_CMD_FSYNC requests, with a kernel implementation that is
> able to batch these requests.

IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have
to add a new syscall. I find the aio interface to be horribly
complicated, and it would mean that programs would have to link
against libaio, which again isn't my favorite set of interfaces.

All of that being said, I do agree that adding a new IO_CMD_FSYNC,
IO_CMD_FSYNCDATA, IO_CMD_FSYNC_NOBARRIER, and
IOCMD_FSYNC_DATA_NOBARRIER would be the simplist thing to do from a
kernel implementation perspective.

- Ted

2010-08-05 19:13:54

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

"Ted Ts'o" <[email protected]> writes:

> On Thu, Aug 05, 2010 at 05:20:12AM +0300, Avi Kivity wrote:
>>
>> Why not? To be clear, I'm talking about an io_submit() with
>> multiple IO_CMD_FSYNC requests, with a kernel implementation that is
>> able to batch these requests.
>
> IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have

Well, there's IOCB_CMD_FSYNC. But still, this isn't the same thing as
what's requested. If I understand correctly, what is requested is a
mechanism to flush out all data for multiple file descriptors and follow
that with a single barrier/flush (and yes, Ted did give a summary of the
commands that would be required to accomplish that).

There still remains the question of why this should be tied to the AIO
submission interface.


Cheers,
Jeff

2010-08-05 20:39:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 03:13:44PM -0400, Jeff Moyer wrote:
> > IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have
>
> Well, there's IOCB_CMD_FSYNC. But still, this isn't the same thing as
> what's requested. If I understand correctly, what is requested is a
> mechanism to flush out all data for multiple file descriptors and follow
> that with a single barrier/flush (and yes, Ted did give a summary of the
> commands that would be required to accomplish that).
>
> There still remains the question of why this should be tied to the AIO
> submission interface.

I don't think it should, personally. The only excuse might be if
someone wanted to do an asynchronous fsync(), but I don't think that
makes sense in most cases.

- Ted

2010-08-05 20:45:08

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

"Ted Ts'o" <[email protected]> writes:

> On Thu, Aug 05, 2010 at 03:13:44PM -0400, Jeff Moyer wrote:
>> > IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have
>>
>> Well, there's IOCB_CMD_FSYNC. But still, this isn't the same thing as
>> what's requested. If I understand correctly, what is requested is a
>> mechanism to flush out all data for multiple file descriptors and follow
>> that with a single barrier/flush (and yes, Ted did give a summary of the
>> commands that would be required to accomplish that).
>>
>> There still remains the question of why this should be tied to the AIO
>> submission interface.
>
> I don't think it should, personally. The only excuse might be if
> someone wanted to do an asynchronous fsync(), but I don't think that
> makes sense in most cases.

In case it wasn't clear, we are in agreement on this.

Cheers,
Jeff