2010-06-29 20:51:20

by djwong

[permalink] [raw]
Subject: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

Hmm. A while ago I was complaining that an evil program that calls fsync() in
a loop will send a continuous stream of write barriers to the hard disk. Ted
theorized that it might be possible to set a flag in ext4_writepage and clear
it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
set (meaning that nothing has been dirtied since the last fsync()) then we
could skip issuing the barrier.

Here's an experimental patch to do something sort of like that. From a quick
run with blktrace, it seems to skip the redundant barriers and improves the ffsb
mail server scores. However, I haven't done extensive power failure testing to
see how much data it can destroy. For that matter I'm not even 100% sure it's
correct at what it aims to do.

This second version of the patch uses the inode state flags and (suboptimally)
also catches directio writes. It might be a better idea to try to coordinate
all the barrier requests across the whole filesystem, though that's a bit more
difficult.

Signed-off-by: Darrick J. Wong <[email protected]>
---

fs/ext4/ext4.h | 1 +
fs/ext4/fsync.c | 5 ++++-
fs/ext4/inode.c | 7 +++++++
3 files changed, 12 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..d2e8e40 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1181,6 +1181,7 @@ enum {
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
+ EXT4_STATE_DIRTY_DATA, /* dirty data, need barrier */
};

#define EXT4_INODE_BIT_FNS(name, field) \
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..96625c3 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -130,8 +130,11 @@ int ext4_sync_file(struct file *file, int datasync)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
NULL, BLKDEV_IFL_WAIT);
ret = jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
+ } else if (journal->j_flags & JBD2_BARRIER &&
+ ext4_test_inode_state(inode, EXT4_STATE_DIRTY_DATA)) {
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
BLKDEV_IFL_WAIT);
+ ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+ }
return ret;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..486d349 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2685,6 +2685,8 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_CACHE_SIZE;

+ ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
if (page_has_buffers(page)) {
page_bufs = page_buffers(page);
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
@@ -2948,6 +2950,8 @@ static int ext4_da_writepages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;

+ ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
range_cyclic = wbc->range_cyclic;
if (wbc->range_cyclic) {
index = mapping->writeback_index;
@@ -3996,6 +4000,9 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;

+ if (rw == WRITE)
+ ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
return ext4_ext_direct_IO(rw, iocb, iov, offset, nr_segs);


2010-08-05 16:40:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, Jun 29, 2010 at 01:51:02PM -0700, Darrick J. Wong wrote:
>
> This second version of the patch uses the inode state flags and
> (suboptimally) also catches directio writes. It might be a better
> idea to try to coordinate all the barrier requests across the whole
> filesystem, though that's a bit more difficult.

Hi Darrick,

When I looked at this patch more closely, and thought about it hard,
the fact that this helps the FFSB mail server benchmark surprised me,
and then I realized it's because it doesn't really accurately emulate
a mail server at all. Or at least, not a MTA. In a MTA, only one CPU
will touch a queue file, so there should never be a case of a double
fsync to a single file. This is why I was thinking about a
coordinating barrier requests across the whole filesystem --- it helps
out in the case where you have all your CPU threads hammering
/var/spool/mqueue, or /var/spool/exim4/input, and where they are all
creating queue files, and calling fsync() in parallel. This patch
won't help that case.

It will help the case of a MDA --- Mail Delivery Agent --- if you have
multiple e-mails all getting delivered at the same time into the same
/var/mail/<username> file, with an fsync() following after a mail
message is appended to the file. This is a much rarer case, and I
can't think of any other workload where you will have multiple
processes racing against each other and fsync'ing the same inode.
Even in the MDA case, it's rare that you will have one mbox getting so
many deliveries that this case would be hit.

So while I was thinking about accepting this patch, I now find myself
hesitating. There _is_ a minor race in the patch that I noticed,
which I'll point out below, but that's easily fixed. The bigger issue
is it's not clear this patch will actually make a difference in the
real world. I trying and failing to think of a real-life application
which is stupid enough to do back-to-back fsync commands, even if it's
because it has multiple threads all trying to write to the file and
fsync to it in an uncoordinated fashion. It would be easily enough to
add instrumentation that would trigger a printk if the patch optimized
out a barrier --- and if someone can point out even one badly written
application --- whether it's mysql, postgresql, a GNOME or KDE
application, db2, Oracle, etc., I'd say sure. But adding even a tiny
amount of extra complexity for something which is _only_ helpful for a
benchmark grates against my soul....

So if you can think of something, please point it out to me. If it
would help ext4 users in real life, I'd be all for it. But at this
point, I'm thinking that perhaps the real issue is that the mail
server benchmark isn't accurately reflecting a real life workload.

Am I missing something?

- Ted

> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 592adf2..96625c3 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -130,8 +130,11 @@ int ext4_sync_file(struct file *file, int datasync)
> blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
> NULL, BLKDEV_IFL_WAIT);
> ret = jbd2_log_wait_commit(journal, commit_tid);
> - } else if (journal->j_flags & JBD2_BARRIER)
> + } else if (journal->j_flags & JBD2_BARRIER &&
> + ext4_test_inode_state(inode, EXT4_STATE_DIRTY_DATA)) {
> blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
> BLKDEV_IFL_WAIT);
> + ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
> + }
> return ret;

This is the minor race I was talking about; you should move the
ext4_clear_inode_state() call above blkdev_issue_flush(). If there is
a race, you want to fail safe, by accidentally issuing a second
barrier, instead of possibly skipping a barrier if a page gets dirtied
*after* the blkdev_issue_flush() has taken effect, but *before* we
have a chance to clear the EXT4_STATE_DIRTY_DATA flag.

BTW, my apologies for not looking at this sooner, and giving you this
feedback earlier. This summer has been crazy busy, and I didn't have
time until the merge window provided a forcing function to look at
outstanding patches.

2010-08-05 16:45:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

P.S. If it wasn't clear, I'm still in favor of trying to coordinate
barriers across the whole file system, since that is much more likely
to help use cases that arise in real life.

- Ted

2010-08-06 07:04:30

by djwong

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 12:45:04PM -0400, Ted Ts'o wrote:
> P.S. If it wasn't clear, I'm still in favor of trying to coordinate
> barriers across the whole file system, since that is much more likely
> to help use cases that arise in real life.

Ok. I have a rough sketch of a patch to do that, and I was going to send it
out today, but the test machine caught on fire while I was hammering it with
the fsync tests one last time and ... yeah. I'm fairly sure the patch didn't
cause the fire, but I'll check anyway after I finish cleaning up.

"[PATCH] ext4: Don't set my machine ablaze with barrier requests" :P

(The patch did seem to cut barrier requests counts by about 20% though the
impact on performance was pretty small.)

--D

2010-08-06 07:14:05

by djwong

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 12:40:08PM -0400, Ted Ts'o wrote:
> On Tue, Jun 29, 2010 at 01:51:02PM -0700, Darrick J. Wong wrote:
> >
> > This second version of the patch uses the inode state flags and
> > (suboptimally) also catches directio writes. It might be a better
> > idea to try to coordinate all the barrier requests across the whole
> > filesystem, though that's a bit more difficult.
>
> Hi Darrick,
>
> When I looked at this patch more closely, and thought about it hard,
> the fact that this helps the FFSB mail server benchmark surprised me,
> and then I realized it's because it doesn't really accurately emulate
> a mail server at all. Or at least, not a MTA. In a MTA, only one CPU
> will touch a queue file, so there should never be a case of a double
> fsync to a single file. This is why I was thinking about a
> coordinating barrier requests across the whole filesystem --- it helps
> out in the case where you have all your CPU threads hammering
> /var/spool/mqueue, or /var/spool/exim4/input, and where they are all
> creating queue files, and calling fsync() in parallel. This patch
> won't help that case.
>
> It will help the case of a MDA --- Mail Delivery Agent --- if you have
> multiple e-mails all getting delivered at the same time into the same
> /var/mail/<username> file, with an fsync() following after a mail
> message is appended to the file. This is a much rarer case, and I
> can't think of any other workload where you will have multiple
> processes racing against each other and fsync'ing the same inode.
> Even in the MDA case, it's rare that you will have one mbox getting so
> many deliveries that this case would be hit.
>
> So while I was thinking about accepting this patch, I now find myself
> hesitating. There _is_ a minor race in the patch that I noticed,
> which I'll point out below, but that's easily fixed. The bigger issue
> is it's not clear this patch will actually make a difference in the
> real world. I trying and failing to think of a real-life application
> which is stupid enough to do back-to-back fsync commands, even if it's
> because it has multiple threads all trying to write to the file and
> fsync to it in an uncoordinated fashion. It would be easily enough to
> add instrumentation that would trigger a printk if the patch optimized
> out a barrier --- and if someone can point out even one badly written
> application --- whether it's mysql, postgresql, a GNOME or KDE
> application, db2, Oracle, etc., I'd say sure. But adding even a tiny
> amount of extra complexity for something which is _only_ helpful for a
> benchmark grates against my soul....
>
> So if you can think of something, please point it out to me. If it
> would help ext4 users in real life, I'd be all for it. But at this
> point, I'm thinking that perhaps the real issue is that the mail
> server benchmark isn't accurately reflecting a real life workload.

Yes, it's a proxy for something else. One of our larger products would like to
use fsync() to flush dirty data out to disk (right now it looks like they use
O_SYNC), but they're concerned that the many threads they use can create an
fsync() storm. So, they wanted to know how to mitigate the effects of those
storms. Not calling fsync() except when they really need to guarantee a disk
write is a good start, but I'd like to get ahead of them to pick off more low
hanging fruit like the barrier coordination and not sending barriers when
there's no dirty data ... before they run into it. :)

--D

2010-08-06 10:17:58

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 08/06/2010 03:04 AM, Darrick J. Wong wrote:
> On Thu, Aug 05, 2010 at 12:45:04PM -0400, Ted Ts'o wrote:
>> P.S. If it wasn't clear, I'm still in favor of trying to coordinate
>> barriers across the whole file system, since that is much more likely
>> to help use cases that arise in real life.
> Ok. I have a rough sketch of a patch to do that, and I was going to send it
> out today, but the test machine caught on fire while I was hammering it with
> the fsync tests one last time and ... yeah. I'm fairly sure the patch didn't
> cause the fire, but I'll check anyway after I finish cleaning up.
>
> "[PATCH] ext4: Don't set my machine ablaze with barrier requests" :P
>
> (The patch did seem to cut barrier requests counts by about 20% though the
> impact on performance was pretty small.)
>
> --D

Just a note, one thing that we have been doing is trying to get a reasonable
regression test in place for testing data integrity. That might be useful to
share as we float patches around barrier changes.

Basic test:

(1) Get a box with an external e-sata (or USB) connected drive

(2) Fire off some large load on that drive (Chris Mason had one, some of our QE
engineers have been using fs_mark (fs_mark -d /your_fs/test_dir -S 0 -t 8 -F)

(3) Pull the power cable to that external box.

Of course, you can use any system and drop power, but the above setup will make
sure that we kill the write cache on the device without letting the firmware
destage the cache contents.

The test passes if you can now do the following:

(1) Mount the file system without error

(2) Unmount and force an fsck - that should run without reporting errors as well.

Note that the above does not use fsync in the testing.

Thanks!

Ric

2010-08-06 18:05:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Fri, Aug 06, 2010 at 12:13:56AM -0700, Darrick J. Wong wrote:
> Yes, it's a proxy for something else. One of our larger products would like to
> use fsync() to flush dirty data out to disk (right now it looks like they use
> O_SYNC), but they're concerned that the many threads they use can create an
> fsync() storm. So, they wanted to know how to mitigate the effects of those
> storms. Not calling fsync() except when they really need to guarantee a disk
> write is a good start, but I'd like to get ahead of them to pick off more low
> hanging fruit like the barrier coordination and not sending barriers when
> there's no dirty data ... before they run into it. :)

Do they need a barrier operation, or do they just want to initiate the
I/O? One of the reasons I found it hard to believe you would have
multiple threads all fsync()'ing the same file is that keeping the the
file consistent is very hard to do in such a scenario. Maintaining
ACID-level consistency without a single thread which coordinates when
commit records gets written is I'm sure theoretically possible, but in
practice, I wasn't sure any applications would actually be _written_
that way.

If the goal is just to make sure I/O is getting initiated, without
necessarily waiting for assurance that a specific file write has hit
the disk platters, it may be that the Linux-specific
sync_file_range(2) system call might be a far more efficient way of
achieving those ends. Without more details about what this product is
doing, it's hard to say, of course.

- Ted

2010-08-09 19:36:40

by djwong

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Fri, Aug 06, 2010 at 02:04:54PM -0400, Ted Ts'o wrote:
> On Fri, Aug 06, 2010 at 12:13:56AM -0700, Darrick J. Wong wrote:
> > Yes, it's a proxy for something else. One of our larger products would like to
> > use fsync() to flush dirty data out to disk (right now it looks like they use
> > O_SYNC), but they're concerned that the many threads they use can create an
> > fsync() storm. So, they wanted to know how to mitigate the effects of those
> > storms. Not calling fsync() except when they really need to guarantee a disk
> > write is a good start, but I'd like to get ahead of them to pick off more low
> > hanging fruit like the barrier coordination and not sending barriers when
> > there's no dirty data ... before they run into it. :)
>
> Do they need a barrier operation, or do they just want to initiate the
> I/O? One of the reasons I found it hard to believe you would have
> multiple threads all fsync()'ing the same file is that keeping the the
> file consistent is very hard to do in such a scenario. Maintaining
> ACID-level consistency without a single thread which coordinates when
> commit records gets written is I'm sure theoretically possible, but in
> practice, I wasn't sure any applications would actually be _written_
> that way.

> If the goal is just to make sure I/O is getting initiated, without
> necessarily waiting for assurance that a specific file write has hit
> the disk platters, it may be that the Linux-specific
> sync_file_range(2) system call might be a far more efficient way of
> achieving those ends. Without more details about what this product is
> doing, it's hard to say, of course.

I don't know for sure, though given what I've seen of the app behavior I
suspect they simply want the disk cache flushed, and don't need the full
ordering semantics. That said, I do think they want to make sure that data
actually hits the disk platters.

--D

2010-08-09 19:53:31

by djwong

[permalink] [raw]
Subject: [RFC v3] ext4: Combine barrier requests coming from fsync

This patch attempts to coordinate barrier requests being sent in by fsync.
Instead of each fsync call initiating its own barrier, there's now a flag to
indicate if (0) no barriers are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a barrier.

So, if someone calls ext4_sync_file and no barriers are in progress, the flag
shifts from 0->1 and the thread delays for 500us to see if there are any other
threads that are close behind in ext4_sync_file. After that wait, the state
transitions to 2 and the barrier is issued. Once that's done, the state goes
back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled. Instead of issuing a barrier themselves, they simply
wait for that first thread to do it for them.

Without Jan's prior barrier generation patch, I can run my 5min fsync-happy
workload and see the barrier count drop from ~150000 to ~37000, and the
transaction rate increase from ~630 to ~680.

With Jan's patch, I see the barrier count drop from ~18000 to ~12000, and the
transaction rate jumps from ~720 to ~760 when using this patch.

There are some things to clean up -- the wait duration probably ought to be a
mount option and not a module parameter, but this is just a test patch. I'm
also not sure that it won't totally eat the performance of a single thread that
calls fsync frequently, though ... that seems like sort of bad behavior.
If you set the delay time to 0 you get the old behavior.

I'm wondering at this point if this is useful? Ted, is this the sort of fsync
coordination that you were looking for?

---

fs/ext4/ext4.h | 5 +++++
fs/ext4/fsync.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/super.c | 4 ++++
3 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..e51535a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1143,6 +1143,11 @@ struct ext4_sb_info {

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ /* fsync barrier reduction */
+ spinlock_t barrier_flag_lock;
+ int in_barrier;
+ struct completion barrier_completion;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..c5afb45 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -57,6 +57,10 @@ static void ext4_sync_parent(struct inode *inode)
}
}

+static int barrier_wait = 500;
+module_param(barrier_wait, int, 0644);
+MODULE_PARM_DESC(barrier_wait, "fsync barrier wait time (usec)");
+
/*
* akpm: A new design for ext4_sync_file().
*
@@ -75,7 +79,8 @@ int ext4_sync_file(struct file *file, int datasync)
{
struct inode *inode = file->f_mapping->host;
struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+ journal_t *journal = sb->s_journal;
int ret;
tid_t commit_tid;

@@ -130,8 +135,42 @@ int ext4_sync_file(struct file *file, int datasync)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
NULL, BLKDEV_IFL_WAIT);
ret = jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
- BLKDEV_IFL_WAIT);
+ } else if (journal->j_flags & JBD2_BARRIER) {
+ if (!barrier_wait) {
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
+ BLKDEV_IFL_WAIT);
+ goto fast_out;
+ }
+again:
+ spin_lock(&sb->barrier_flag_lock);
+ if (sb->in_barrier == 2) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ goto again;
+ } else if (sb->in_barrier) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ } else {
+ sb->in_barrier = 1;
+ INIT_COMPLETION(sb->barrier_completion);
+ spin_unlock(&sb->barrier_flag_lock);
+
+ udelay(barrier_wait);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 2;
+ spin_unlock(&sb->barrier_flag_lock);
+
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT);
+ complete_all(&sb->barrier_completion);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 0;
+ spin_unlock(&sb->barrier_flag_lock);
+ }
+fast_out:
+ if (!ret)
+ ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+ }
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..0dc96d2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3045,6 +3045,10 @@ no_journal:
ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
"Opts: %s", descr, orig_data);

+ EXT4_SB(sb)->in_barrier = 0;
+ spin_lock_init(&EXT4_SB(sb)->barrier_flag_lock);
+ init_completion(&EXT4_SB(sb)->barrier_completion);
+
lock_kernel();
kfree(orig_data);
return 0;

2010-08-09 21:07:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

Can you try with the new barrier implementation in the

[PATCH, RFC] relaxed barriers

by making cache flushes just that and not complicated drain barrier
it should speed this case up a lot.

2010-08-09 21:19:26

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On 2010-08-09, at 15:53, Darrick J. Wong wrote:
> This patch attempts to coordinate barrier requests being sent in by fsync. Instead of each fsync call initiating its own barrier, there's now a flag to indicate if (0) no barriers are ongoing, (1) we're delaying a short time to collect other fsync threads, or (2) we're actually in-progress on a barrier.
>
> So, if someone calls ext4_sync_file and no barriers are in progress, the flag shifts from 0->1 and the thread delays for 500us to see if there are any other threads that are close behind in ext4_sync_file. After that wait, the state transitions to 2 and the barrier is issued. Once that's done, the state goes back to 0 and a completion is signalled.

You shouldn't use a fixed delay for the thread. 500us _seems_ reasonable, if you have a single HDD. If you have an SSD, or an NVRAM-backed array, then 2000 IOPS is a serious limitation.

What is done in the JBD2 code is to scale the commit sleep interval based on the average commit time. In fact, the ext4_force_commit-> ...->jbd2_journal_force_commit() call will itself be waiting in the jbd2 code to merge journal commits. It looks like we are duplicating some of this machinery in ext4_sync_file() already.

It seems like a better idea to have a single piece of code to wait to merge the IOs. For the non-journal ext4 filesystems it should implement the wait for merges explicitly, otherwise it should defer the wait to jbd2.

Cheers, Andreas




2010-08-09 23:46:13

by djwong

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Mon, Aug 09, 2010 at 05:19:22PM -0400, Andreas Dilger wrote:
> On 2010-08-09, at 15:53, Darrick J. Wong wrote:
> > This patch attempts to coordinate barrier requests being sent in by fsync.
> > Instead of each fsync call initiating its own barrier, there's now a flag
> > to indicate if (0) no barriers are ongoing, (1) we're delaying a short time
> > to collect other fsync threads, or (2) we're actually in-progress on a
> > barrier.
> >
> > So, if someone calls ext4_sync_file and no barriers are in progress, the
> > flag shifts from 0->1 and the thread delays for 500us to see if there are
> > any other threads that are close behind in ext4_sync_file. After that
> > wait, the state transitions to 2 and the barrier is issued. Once that's
> > done, the state goes back to 0 and a completion is signalled.
>
> You shouldn't use a fixed delay for the thread. 500us _seems_ reasonable, if
> you have a single HDD. If you have an SSD, or an NVRAM-backed array, then
> 2000 IOPS is a serious limitation.

2000 fsyncs per second, anyway. I wasn't explicitly trying to limit any other
types of IO.

> What is done in the JBD2 code is to scale the commit sleep interval based on
> the average commit time. In fact, the ext4_force_commit->
> ...->jbd2_journal_force_commit() call will itself be waiting in the jbd2 code
> to merge journal commits. It looks like we are duplicating some of this
> machinery in ext4_sync_file() already.

I actually picked 500us arbitrarily because it seemed to work, even for SSDs.
It was a convenient test vehicle, and not much more. That said, I like your
recommendation much better. I'll look into that.

> It seems like a better idea to have a single piece of code to wait to merge
> the IOs. For the non-journal ext4 filesystems it should implement the wait
> for merges explicitly, otherwise it should defer the wait to jbd2.

I wondered if this would have been better off in the block layer than ext4?
Though I suppose that could imply two kinds of flush: flush-immediately, and
flush-shortly. I intend to try those flush drain elimination patches before I
think about this much more.

--D

2010-08-16 16:15:06

by djwong

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Mon, Aug 09, 2010 at 05:07:23PM -0400, Christoph Hellwig wrote:
> Can you try with the new barrier implementation in the
>
> [PATCH, RFC] relaxed barriers
>
> by making cache flushes just that and not complicated drain barrier
> it should speed this case up a lot.

Indeed it does! The barrier count increases to about 21000, but I also see
much higher throughput, about 830 transactions per second (versus 12000 and 760
respectively before Tejun's patch).

--D

2010-08-19 02:07:39

by djwong

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Mon, Aug 16, 2010 at 09:14:33AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 09, 2010 at 05:07:23PM -0400, Christoph Hellwig wrote:
> > Can you try with the new barrier implementation in the
> >
> > [PATCH, RFC] relaxed barriers
> >
> > by making cache flushes just that and not complicated drain barrier
> > it should speed this case up a lot.
>
> Indeed it does! The barrier count increases to about 21000, but I also see
> much higher throughput, about 830 transactions per second (versus 12000 and 760
> respectively before Tejun's patch).

Oddly, I ran the entire suite of tests against a larger set of machines, and
with Tejun's RFC patchset I didn't see nearly as much of an improvement. I
have been trying to put together a new tree based on "replace barrier with
sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
gotten lost in the weeds. :(

I also experienced some sort of crash with Tejun's relaxed barrier patch on one
of my systems. I was hitting the BUG_ON in drivers/scsi/scsi_lib.c, line 1115.

Rather than hold on to (aging) test results any further, I'll be posting a new
fsync coordination patch shortly that includes Andreas' suggestion to use the
average barrier time instead of a static 500us, and a spreadsheet that shows
what happens with various patches, and on a wider range of hardware.

--D

2010-08-19 02:14:51

by djwong

[permalink] [raw]
Subject: [RFC v4] ext4: Coordinate fsync requests

This patch attempts to coordinate barrier requests being sent in by fsync.
Instead of each fsync call initiating its own barrier, there's now a flag to
indicate if (0) no barriers are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a barrier.

So, if someone calls ext4_sync_file and no barriers are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are
any other threads that are close behind in ext4_sync_file. After that wait,
the state transitions to 2 and the barrier is issued. Once that's done, the
state goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled. Instead of issuing a barrier themselves, they simply
wait for that first thread to do it for them.

In an earlier version of this patch, the "short time" was 500ms, adjustable by
a module parameter. Now, it's a mount option, and the mount option has three
values: x = 0, which gets you the old behavior; x = -1, which uses the average
commit time for the delay period; and x > 0, which sets the delay time to
min(x, average commit time).

So I gave this patchset some wider testing in my lab, and came up with the
following spreadsheet:
https://spreadsheets.google.com/ccc?key=0AtDnBlSkyNjodDZ4OEdSZC01X2haVi1ncmpGOXpfNkE&hl=en&authkey=CMznqcgH

These results do not reflect Tejun/Christoph's latest barrier/flush semantic
changes because ... I haven't figured out the correct sequence of 2.6.35+,
tejun's patchset, and hch's patchset. With Tejun's original set of patches,
the performance actually decreased a little bit, so I haven't bothered to
upload those results while I try to put together a new tree.

The comma-separated numbers in the leftmost column indicates which options were
in effect during the run. From left to right, they are (1) whether or not
directio is enabled; (2) the max_fsync_delay parameter (-1 is automatic
tuning, 0 is the behavior prior to my patch, and 500 is the v3 patch);
(3) whether or not Jan Kara's barrier generation counter is disabled; and (4)
whether or not my old dirty flag patch is disabled. The old system behaviors
are in the *,0,1,1 rows, and the everything-on behavior should be in the
*,-1,0,0 rows.

>From these results, it seems like a reasonable conclusion that enabling Jan
Kara's barrier generation counter patch (*,*,0,*) does increase the tps count.
It also would appear that enabling the barrier coordination patch with the
automatic barrier delay tuning (*,-1,*,*) also seems to be increasing tps while
reducing barrier counts. It's not so clear that my old dirty flag patch
(*,*,*,0) is doing much anymore.

The *,na,na,na rows reflect what happens if I turn barriers off entirely, so I
can compare the new numbers with the nobarrier behavior.

elm3c44_sas is a 36-spindle SAS storage array (RAID0).
elm3c44_ssd is 4 Intel SSDs tied together in a RAID0 via md.
elm3c65_sata is a single SATA disk.
elm3c71_sas is a 8-spindle SAS storage array (RAID0).
elm3c71_scsi is a 14-spindle SCSI storage array (RAID0).
elm3c75_ide is a single IDE laptop disk.

The storage array have battery-backed WC.

Signed-off-by: Darrick J. Wong <[email protected]>
---

fs/ext4/ext4.h | 7 +++++++
fs/ext4/fsync.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/super.c | 21 ++++++++++++++++++++-
3 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..9e8dcc7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1143,6 +1143,13 @@ struct ext4_sb_info {

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ /* fsync coordination */
+ spinlock_t barrier_flag_lock;
+ int in_barrier;
+ struct completion barrier_completion;
+ u64 avg_barrier_time;
+ int max_fsync_delay;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..e164ccd 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -75,9 +75,12 @@ int ext4_sync_file(struct file *file, int datasync)
{
struct inode *inode = file->f_mapping->host;
struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+ journal_t *journal = sb->s_journal;
int ret;
tid_t commit_tid;
+ ktime_t before, expires;
+ u64 duration;

J_ASSERT(ext4_journal_current_handle() == NULL);

@@ -130,8 +133,52 @@ int ext4_sync_file(struct file *file, int datasync)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
NULL, BLKDEV_IFL_WAIT);
ret = jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
- BLKDEV_IFL_WAIT);
+ } else if (journal->j_flags & JBD2_BARRIER) {
+ if (!sb->max_fsync_delay) {
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
+ BLKDEV_IFL_WAIT);
+ goto fast_out;
+ }
+again:
+ spin_lock(&sb->barrier_flag_lock);
+ if (sb->in_barrier == 2) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ goto again;
+ } else if (sb->in_barrier) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ } else {
+ sb->in_barrier = 1;
+ INIT_COMPLETION(sb->barrier_completion);
+ spin_unlock(&sb->barrier_flag_lock);
+
+ if (sb->max_fsync_delay > 0 && (sb->max_fsync_delay * 1000) < sb->avg_barrier_time)
+ expires = ktime_add_ns(ktime_get(), sb->max_fsync_delay * 1000);
+ else
+ expires = ktime_add_ns(ktime_get(), sb->avg_barrier_time);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 2;
+ spin_unlock(&sb->barrier_flag_lock);
+
+ before = ktime_get();
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT);
+ duration = ktime_to_ns(ktime_sub(ktime_get(), before));
+ if (likely(sb->avg_barrier_time))
+ sb->avg_barrier_time = (duration + sb->avg_barrier_time * 15) / 16;
+ else
+ sb->avg_barrier_time = duration;
+
+ complete_all(&sb->barrier_completion);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 0;
+ spin_unlock(&sb->barrier_flag_lock);
+ }
+ }
+fast_out:
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..8d04e43 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -953,6 +953,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
if (!test_opt(sb, DELALLOC))
seq_puts(seq, ",nodelalloc");

+ seq_printf(seq, ",max_fsync_delay=%d", sbi->max_fsync_delay);

if (sbi->s_stripe)
seq_printf(seq, ",stripe=%lu", sbi->s_stripe);
@@ -1160,7 +1161,7 @@ enum {
Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
- Opt_discard, Opt_nodiscard,
+ Opt_discard, Opt_nodiscard, Opt_max_fsync_delay,
};

static const match_table_t tokens = {
@@ -1231,6 +1232,7 @@ static const match_table_t tokens = {
{Opt_dioread_lock, "dioread_lock"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
+ {Opt_max_fsync_delay, "max_fsync_delay=%d"},
{Opt_err, NULL},
};

@@ -1699,6 +1701,16 @@ set_qf_format:
case Opt_dioread_lock:
clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
break;
+ case Opt_max_fsync_delay:
+ if (args[0].from) {
+ if (match_int(&args[0], &option))
+ return 0;
+ } else
+ return 0;
+
+ sbi->max_fsync_delay = option;
+ ext4_msg(sb, KERN_INFO, "Maximum fsync delay %d\n", option);
+ break;
default:
ext4_msg(sb, KERN_ERR,
"Unrecognized mount option \"%s\" "
@@ -2562,6 +2574,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
if (!IS_EXT3_SB(sb))
set_opt(sbi->s_mount_opt, DELALLOC);

+ EXT4_SB(sb)->max_fsync_delay = -1;
+
if (!parse_options((char *) data, sb, &journal_devnum,
&journal_ioprio, NULL, 0))
goto failed_mount;
@@ -3045,6 +3059,11 @@ no_journal:
ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
"Opts: %s", descr, orig_data);

+ EXT4_SB(sb)->in_barrier = 0;
+ EXT4_SB(sb)->avg_barrier_time = 0;
+ spin_lock_init(&EXT4_SB(sb)->barrier_flag_lock);
+ init_completion(&EXT4_SB(sb)->barrier_completion);
+
lock_kernel();
kfree(orig_data);
return 0;

2010-08-19 08:53:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Wed, Aug 18, 2010 at 07:07:34PM -0700, Darrick J. Wong wrote:
> Oddly, I ran the entire suite of tests against a larger set of machines, and
> with Tejun's RFC patchset I didn't see nearly as much of an improvement. I
> have been trying to put together a new tree based on "replace barrier with
> sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
> gotten lost in the weeds. :(

Tejun's patches don't allow concurrent cache flushes to happen, while
my patch did. Tejun said there are drivers that can't handly empty
flushes with a bio attached, making this nessecary.

Tejun, any idea what drivers that would be?

> I also experienced some sort of crash with Tejun's relaxed barrier patch on one
> of my systems. I was hitting the BUG_ON in drivers/scsi/scsi_lib.c, line 1115.

My kernel source doesn't have a BUG_ON line there, but only one two
lines above. A req->nr_phys_segments that's zero sounds a bit like
empty flush requests, I'll need to look into it again.

2010-08-19 09:21:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

Hello,

On 08/19/2010 10:53 AM, Christoph Hellwig wrote:
> On Wed, Aug 18, 2010 at 07:07:34PM -0700, Darrick J. Wong wrote:
>> Oddly, I ran the entire suite of tests against a larger set of machines, and
>> with Tejun's RFC patchset I didn't see nearly as much of an improvement. I
>> have been trying to put together a new tree based on "replace barrier with
>> sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
>> gotten lost in the weeds. :(
>
> Tejun's patches don't allow concurrent cache flushes to happen, while
> my patch did. Tejun said there are drivers that can't handly empty
> flushes with a bio attached, making this nessecary.
>
> Tejun, any idea what drivers that would be?

What was the configuration? If dm was involved, both md and dm can
only process single flush request at a time. Supporing multiple
flushes in flight wouldn't be too difficult. It's just the way things
are implemented with the previous barrier implementation. It can
definitely be improved.

>> I also experienced some sort of crash with Tejun's relaxed barrier
>> patch on one of my systems. I was hitting the BUG_ON in
>> drivers/scsi/scsi_lib.c, line 1115.
>
> My kernel source doesn't have a BUG_ON line there, but only one two
> lines above. A req->nr_phys_segments that's zero sounds a bit like
> empty flush requests, I'll need to look into it again.

Yeah, definitely sounds like REQ_FS|REQ_FLUSH request causing problem.
Can you post the kernel log?

Thanks.

--
tejun

2010-08-19 15:52:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

Hello, again.

On 08/19/2010 11:17 AM, Tejun Heo wrote:
> What was the configuration? If dm was involved, both md and dm can
> only process single flush request at a time. Supporing multiple
> flushes in flight wouldn't be too difficult. It's just the way things
> are implemented with the previous barrier implementation. It can
> definitely be improved.

Oh, I just realized that the current bio-based dm implementation
doesn't allow other bio's to be processed if a flush is in progress.
So, it's not about not being able to handle multiple flushes. The
queue just stalls while flush is in progress and because it also waits
for flush completion by waiting for all commands in progress to
finish. It basically ends up draining and stalling everything.
Shouldn't be too hard to make it better.

Thanks.

--
tejun

2010-08-23 18:31:27

by djwong

[permalink] [raw]
Subject: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

Hi all,

I retested the ext4 barrier mitigation patchset against a base of 2.6.36-rc1 +
Tejun's flush_fua tree + Christoph's patches to change FS barrier semantics,
and came up with this new spreadsheet:
http://bit.ly/bWpbsT

Here are the previous 2.6.35 results for convenience: http://bit.ly/c22grd

The machine configurations are the same as with the previous (2.6.35)
spreadsheet. It appears to be the case that Tejun and Christoph's patches to
change barrier use into simpler cache flushes generally improve the speed of
the fsync-happy workload in buffered I/O mode ... if you have a bunch of
spinning disks. Results for the SSD array (elm3c44) and the single disk
systems (elm3c65/elm3c75) decreased slightly. For the case where direct I/O
was used, the patchset improved the results in nearly all cases. The speed
with barriers on is getting closer to the speed with barriers off, thankfully!

Unfortunately, one thing that became /much/ less clear in these new results is
the impact of the other patch sets that we've been working on to make ext4
smarter with regards to barrier/flush use. In most cases I don't really see
the fsync-delay patch having much effect for directio, and it seems to have
wild effects when buffered mode is used. Jan Kara's barrier generation patch
still generally helps with directio loads. I've also concluded that my really
old dirty-flag patchset from ages ago no longer has any effect.

What does everyone else think of these results?

--D