2010-04-30 17:56:33

by djwong

[permalink] [raw]
Subject: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

Hmm. A while ago I was complaining that an evil program that calls fsync() in
a loop will send a continuous stream of write barriers to the hard disk. Ted
theorized that it might be possible to set a flag in ext4_writepage and clear
it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
set (meaning that nothing has been dirtied since the last fsync()) then we
could skip issuing the barrier.

Here's an experimental patch to do something sort of like that. From a quick
run with blktrace, it seems to skip the redundant barriers and improves the ffsb
mail server scores. However, I haven't done extensive power failure testing to
see how much data it can destroy. For that matter I'm not even 100% sure it's
correct at what it aims to do.

Just throwing this out there, though. Nothing's blown up ... yet. :P
---
Signed-off-by: Darrick J. Wong <[email protected]>
---

fs/ext4/ext4.h | 2 ++
fs/ext4/fsync.c | 7 +++++--
fs/ext4/inode.c | 5 +++++
3 files changed, 12 insertions(+), 2 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..3b70195 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1025,6 +1025,8 @@ struct ext4_sb_info {

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ atomic_t unflushed_writes;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0d0c323..441f872 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -52,7 +52,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ journal_t *journal = sbi->s_journal;
int ret;
tid_t commit_tid;

@@ -102,7 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
(journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
+ } else if (journal->j_flags & JBD2_BARRIER && atomic_read(&sbi->unflushed_writes)) {
+ atomic_set(&sbi->unflushed_writes, 0);
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ }
return ret;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5381802..e501abd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2718,6 +2718,7 @@ static int ext4_writepage(struct page *page,
unsigned int len;
struct buffer_head *page_bufs = NULL;
struct inode *inode = page->mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(page->mapping->host->i_sb);

trace_ext4_writepage(inode, page);
size = i_size_read(inode);
@@ -2726,6 +2727,8 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_CACHE_SIZE;

+ atomic_set(&sbi->unflushed_writes, 1);
+
if (page_has_buffers(page)) {
page_bufs = page_buffers(page);
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
@@ -2872,6 +2875,8 @@ static int ext4_da_writepages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;

+ atomic_set(&sbi->unflushed_writes, 1);
+
range_cyclic = wbc->range_cyclic;
if (wbc->range_cyclic) {
index = mapping->writeback_index;


2010-05-04 00:57:50

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, 2010-04-29 at 16:51 -0700, Darrick J. Wong wrote:
> Hmm. A while ago I was complaining that an evil program that calls fsync() in
> a loop will send a continuous stream of write barriers to the hard disk. Ted
> theorized that it might be possible to set a flag in ext4_writepage and clear
> it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
> set (meaning that nothing has been dirtied since the last fsync()) then we
> could skip issuing the barrier.
>
> Here's an experimental patch to do something sort of like that. From a quick
> run with blktrace, it seems to skip the redundant barriers and improves the ffsb
> mail server scores. However, I haven't done extensive power failure testing to
> see how much data it can destroy. For that matter I'm not even 100% sure it's
> correct at what it aims to do.
>
> Just throwing this out there, though. Nothing's blown up ... yet. :P
> ---
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> fs/ext4/ext4.h | 2 ++
> fs/ext4/fsync.c | 7 +++++--
> fs/ext4/inode.c | 5 +++++
> 3 files changed, 12 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index bf938cf..3b70195 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1025,6 +1025,8 @@ struct ext4_sb_info {
>
> /* workqueue for dio unwritten */
> struct workqueue_struct *dio_unwritten_wq;
> +
> + atomic_t unflushed_writes;
> };
>

Just wondering is this per filesystem flag? Thought it is nicer to make
this per -inode flag, when there is no dirty data in fly for this inode
(instead of the whole fs), there is no need to call barrier in
ext4_sync_file().

Mingming
> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 0d0c323..441f872 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -52,7 +52,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> {
> struct inode *inode = dentry->d_inode;
> struct ext4_inode_info *ei = EXT4_I(inode);
> - journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + journal_t *journal = sbi->s_journal;
> int ret;
> tid_t commit_tid;
...

> @@ -102,7 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> (journal->j_flags & JBD2_BARRIER))
> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> jbd2_log_wait_commit(journal, commit_tid);
> - } else if (journal->j_flags & JBD2_BARRIER)
> + } else if (journal->j_flags & JBD2_BARRIER && atomic_read(&sbi->unflushed_writes)) {
> + atomic_set(&sbi->unflushed_writes, 0);
> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> + }
> return ret;
> }
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5381802..e501abd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2718,6 +2718,7 @@ static int ext4_writepage(struct page *page,
> unsigned int len;
> struct buffer_head *page_bufs = NULL;
> struct inode *inode = page->mapping->host;
> + struct ext4_sb_info *sbi = EXT4_SB(page->mapping->host->i_sb);
>
> trace_ext4_writepage(inode, page);
> size = i_size_read(inode);
> @@ -2726,6 +2727,8 @@ static int ext4_writepage(struct page *page,
> else
> len = PAGE_CACHE_SIZE;
>
> + atomic_set(&sbi->unflushed_writes, 1);
> +
> if (page_has_buffers(page)) {
> page_bufs = page_buffers(page);
> if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> @@ -2872,6 +2875,8 @@ static int ext4_da_writepages(struct address_space *mapping,
> if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
> range_whole = 1;
>
> + atomic_set(&sbi->unflushed_writes, 1);
> +
> range_cyclic = wbc->range_cyclic;
> if (wbc->range_cyclic) {
> index = mapping->writeback_index;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



2010-05-04 14:16:37

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 05/03/2010 08:57 PM, Mingming Cao wrote:
> On Thu, 2010-04-29 at 16:51 -0700, Darrick J. Wong wrote:
>
>> Hmm. A while ago I was complaining that an evil program that calls fsync() in
>> a loop will send a continuous stream of write barriers to the hard disk. Ted
>> theorized that it might be possible to set a flag in ext4_writepage and clear
>> it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
>> set (meaning that nothing has been dirtied since the last fsync()) then we
>> could skip issuing the barrier.
>>
>> Here's an experimental patch to do something sort of like that. From a quick
>> run with blktrace, it seems to skip the redundant barriers and improves the ffsb
>> mail server scores. However, I haven't done extensive power failure testing to
>> see how much data it can destroy. For that matter I'm not even 100% sure it's
>> correct at what it aims to do.
>>
>> Just throwing this out there, though. Nothing's blown up ... yet. :P
>> ---
>> Signed-off-by: Darrick J. Wong<[email protected]>
>> ---
>>
>> fs/ext4/ext4.h | 2 ++
>> fs/ext4/fsync.c | 7 +++++--
>> fs/ext4/inode.c | 5 +++++
>> 3 files changed, 12 insertions(+), 2 deletions(-)
>>
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index bf938cf..3b70195 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1025,6 +1025,8 @@ struct ext4_sb_info {
>>
>> /* workqueue for dio unwritten */
>> struct workqueue_struct *dio_unwritten_wq;
>> +
>> + atomic_t unflushed_writes;
>> };
>>
>>
> Just wondering is this per filesystem flag? Thought it is nicer to make
> this per -inode flag, when there is no dirty data in fly for this inode
> (instead of the whole fs), there is no need to call barrier in
> ext4_sync_file().
>
> Mingming
>

Checking per inode is actually incorrect - we do not want to short cut
the need to flush the target storage device's write cache just because a
specific file has no dirty pages. If a power hit occurs, having sent
the pages from to the storage device is not sufficient.

I was thinking that it could actually be more general, specifically we
could track the status of the write cache on the entire storage device.
That way, any command (write, etc) to the target device would set the
cache state to needs_flush (or whatever) and the barrier flush would
clear it.

Probably not worth the complication...

ric


>> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
>> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
>> index 0d0c323..441f872 100644
>> --- a/fs/ext4/fsync.c
>> +++ b/fs/ext4/fsync.c
>> @@ -52,7 +52,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
>> {
>> struct inode *inode = dentry->d_inode;
>> struct ext4_inode_info *ei = EXT4_I(inode);
>> - journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
>> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>> + journal_t *journal = sbi->s_journal;
>> int ret;
>> tid_t commit_tid;
>>
> ...
>
>
>> @@ -102,7 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
>> (journal->j_flags& JBD2_BARRIER))
>> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
>> jbd2_log_wait_commit(journal, commit_tid);
>> - } else if (journal->j_flags& JBD2_BARRIER)
>> + } else if (journal->j_flags& JBD2_BARRIER&& atomic_read(&sbi->unflushed_writes)) {
>> + atomic_set(&sbi->unflushed_writes, 0);
>> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
>> + }
>> return ret;
>> }
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 5381802..e501abd 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2718,6 +2718,7 @@ static int ext4_writepage(struct page *page,
>> unsigned int len;
>> struct buffer_head *page_bufs = NULL;
>> struct inode *inode = page->mapping->host;
>> + struct ext4_sb_info *sbi = EXT4_SB(page->mapping->host->i_sb);
>>
>> trace_ext4_writepage(inode, page);
>> size = i_size_read(inode);
>> @@ -2726,6 +2727,8 @@ static int ext4_writepage(struct page *page,
>> else
>> len = PAGE_CACHE_SIZE;
>>
>> + atomic_set(&sbi->unflushed_writes, 1);
>> +
>> if (page_has_buffers(page)) {
>> page_bufs = page_buffers(page);
>> if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
>> @@ -2872,6 +2875,8 @@ static int ext4_da_writepages(struct address_space *mapping,
>> if (wbc->range_start == 0&& wbc->range_end == LLONG_MAX)
>> range_whole = 1;
>>
>> + atomic_set(&sbi->unflushed_writes, 1);
>> +
>> range_cyclic = wbc->range_cyclic;
>> if (wbc->range_cyclic) {
>> index = mapping->writeback_index;
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2010-05-04 15:45:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, May 04, 2010 at 10:16:37AM -0400, Ric Wheeler wrote:
> Checking per inode is actually incorrect - we do not want to short cut
> the need to flush the target storage device's write cache just because a
> specific file has no dirty pages. If a power hit occurs, having sent
> the pages from to the storage device is not sufficient.

As long as we're only using the information for fsync doing it per inode
is the correct thing. We only want to flush the cache if the inode
(data or metadata) is dirty in some way. Note that this includes writes
via O_DIRECT which are quite different to track - I've not found the
original patch in my mbox so I can't comment if this is done right.

It might be good idea to track this information directly in the
writeback/direct I/O code so that we don't have to reimplement it for
every filesystems, btw.


2010-05-04 19:49:32

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, 2010-05-04 at 10:16 -0400, Ric Wheeler wrote:
> On 05/03/2010 08:57 PM, Mingming Cao wrote:
> > On Thu, 2010-04-29 at 16:51 -0700, Darrick J. Wong wrote:
> >
> >> Hmm. A while ago I was complaining that an evil program that calls fsync() in
> >> a loop will send a continuous stream of write barriers to the hard disk. Ted
> >> theorized that it might be possible to set a flag in ext4_writepage and clear
> >> it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
> >> set (meaning that nothing has been dirtied since the last fsync()) then we
> >> could skip issuing the barrier.
> >>
> >> Here's an experimental patch to do something sort of like that. From a quick
> >> run with blktrace, it seems to skip the redundant barriers and improves the ffsb
> >> mail server scores. However, I haven't done extensive power failure testing to
> >> see how much data it can destroy. For that matter I'm not even 100% sure it's
> >> correct at what it aims to do.
> >>
> >> Just throwing this out there, though. Nothing's blown up ... yet. :P
> >> ---
> >> Signed-off-by: Darrick J. Wong<[email protected]>
> >> ---
> >>
> >> fs/ext4/ext4.h | 2 ++
> >> fs/ext4/fsync.c | 7 +++++--
> >> fs/ext4/inode.c | 5 +++++
> >> 3 files changed, 12 insertions(+), 2 deletions(-)
> >>
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index bf938cf..3b70195 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -1025,6 +1025,8 @@ struct ext4_sb_info {
> >>
> >> /* workqueue for dio unwritten */
> >> struct workqueue_struct *dio_unwritten_wq;
> >> +
> >> + atomic_t unflushed_writes;
> >> };
> >>
> >>
> > Just wondering is this per filesystem flag? Thought it is nicer to make
> > this per -inode flag, when there is no dirty data in fly for this inode
> > (instead of the whole fs), there is no need to call barrier in
> > ext4_sync_file().
> >
> > Mingming
> >
>
> Checking per inode is actually incorrect - we do not want to short cut
> the need to flush the target storage device's write cache just because a
> specific file has no dirty pages. If a power hit occurs, having sent
> the pages from to the storage device is not sufficient.
>

hmm... My understanding is ext3/4 implementation of fsync syncing the
whole filesystem, as a jbd2 transacation could including metadata update
from other files, jbd2 has to commit the latest transactions. But the
caller is fsync(), which should only need to ensure the specified
inode's dirty data/metadata gets to disk by sending barriers down.

Mingming

> I was thinking that it could actually be more general, specifically we
> could track the status of the write cache on the entire storage device.
> That way, any command (write, etc) to the target device would set the
> cache state to needs_flush (or whatever) and the barrier flush would
> clear it.
>
> Probably not worth the complication...
>
> ric
>
>
> >> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> >> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> >> index 0d0c323..441f872 100644
> >> --- a/fs/ext4/fsync.c
> >> +++ b/fs/ext4/fsync.c
> >> @@ -52,7 +52,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> >> {
> >> struct inode *inode = dentry->d_inode;
> >> struct ext4_inode_info *ei = EXT4_I(inode);
> >> - journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> >> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> >> + journal_t *journal = sbi->s_journal;
> >> int ret;
> >> tid_t commit_tid;
> >>
> > ...
> >
> >
> >> @@ -102,7 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> >> (journal->j_flags& JBD2_BARRIER))
> >> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> >> jbd2_log_wait_commit(journal, commit_tid);
> >> - } else if (journal->j_flags& JBD2_BARRIER)
> >> + } else if (journal->j_flags& JBD2_BARRIER&& atomic_read(&sbi->unflushed_writes)) {
> >> + atomic_set(&sbi->unflushed_writes, 0);
> >> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> >> + }
> >> return ret;
> >> }
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 5381802..e501abd 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -2718,6 +2718,7 @@ static int ext4_writepage(struct page *page,
> >> unsigned int len;
> >> struct buffer_head *page_bufs = NULL;
> >> struct inode *inode = page->mapping->host;
> >> + struct ext4_sb_info *sbi = EXT4_SB(page->mapping->host->i_sb);
> >>
> >> trace_ext4_writepage(inode, page);
> >> size = i_size_read(inode);
> >> @@ -2726,6 +2727,8 @@ static int ext4_writepage(struct page *page,
> >> else
> >> len = PAGE_CACHE_SIZE;
> >>
> >> + atomic_set(&sbi->unflushed_writes, 1);
> >> +
> >> if (page_has_buffers(page)) {
> >> page_bufs = page_buffers(page);
> >> if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
> >> @@ -2872,6 +2875,8 @@ static int ext4_da_writepages(struct address_space *mapping,
> >> if (wbc->range_start == 0&& wbc->range_end == LLONG_MAX)
> >> range_whole = 1;
> >>
> >> + atomic_set(&sbi->unflushed_writes, 1);
> >> +
> >> range_cyclic = wbc->range_cyclic;
> >> if (wbc->range_cyclic) {
> >> index = mapping->writeback_index;
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



2010-06-29 20:51:16

by djwong

[permalink] [raw]
Subject: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

Hmm. A while ago I was complaining that an evil program that calls fsync() in
a loop will send a continuous stream of write barriers to the hard disk. Ted
theorized that it might be possible to set a flag in ext4_writepage and clear
it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
set (meaning that nothing has been dirtied since the last fsync()) then we
could skip issuing the barrier.

Here's an experimental patch to do something sort of like that. From a quick
run with blktrace, it seems to skip the redundant barriers and improves the ffsb
mail server scores. However, I haven't done extensive power failure testing to
see how much data it can destroy. For that matter I'm not even 100% sure it's
correct at what it aims to do.

This second version of the patch uses the inode state flags and (suboptimally)
also catches directio writes. It might be a better idea to try to coordinate
all the barrier requests across the whole filesystem, though that's a bit more
difficult.

Signed-off-by: Darrick J. Wong <[email protected]>
---

fs/ext4/ext4.h | 1 +
fs/ext4/fsync.c | 5 ++++-
fs/ext4/inode.c | 7 +++++++
3 files changed, 12 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..d2e8e40 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1181,6 +1181,7 @@ enum {
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
+ EXT4_STATE_DIRTY_DATA, /* dirty data, need barrier */
};

#define EXT4_INODE_BIT_FNS(name, field) \
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..96625c3 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -130,8 +130,11 @@ int ext4_sync_file(struct file *file, int datasync)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
NULL, BLKDEV_IFL_WAIT);
ret = jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
+ } else if (journal->j_flags & JBD2_BARRIER &&
+ ext4_test_inode_state(inode, EXT4_STATE_DIRTY_DATA)) {
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
BLKDEV_IFL_WAIT);
+ ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+ }
return ret;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..486d349 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2685,6 +2685,8 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_CACHE_SIZE;

+ ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
if (page_has_buffers(page)) {
page_bufs = page_buffers(page);
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
@@ -2948,6 +2950,8 @@ static int ext4_da_writepages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;

+ ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
range_cyclic = wbc->range_cyclic;
if (wbc->range_cyclic) {
index = mapping->writeback_index;
@@ -3996,6 +4000,9 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;

+ if (rw == WRITE)
+ ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
return ext4_ext_direct_IO(rw, iocb, iov, offset, nr_segs);


2010-06-30 12:48:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, May 04, 2010 at 11:45:53AM -0400, Christoph Hellwig wrote:
> On Tue, May 04, 2010 at 10:16:37AM -0400, Ric Wheeler wrote:
> > Checking per inode is actually incorrect - we do not want to short cut
> > the need to flush the target storage device's write cache just because a
> > specific file has no dirty pages. If a power hit occurs, having sent
> > the pages from to the storage device is not sufficient.
>
> As long as we're only using the information for fsync doing it per inode
> is the correct thing. We only want to flush the cache if the inode
> (data or metadata) is dirty in some way. Note that this includes writes
> via O_DIRECT which are quite different to track - I've not found the
> original patch in my mbox so I can't comment if this is done right.

I agree.

I wonder if it's worthwhile to think about a new system call which
allows users to provide an array of fd's which are collectively should
be fsync'ed out at the same time. Otherwise, we end up issuing
multiple barrier operations in cases where the application needs to
do:

fsync(control_fd);
fsync(data_fd);

- Ted

2010-06-30 13:21:12

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 06/30/2010 08:48 AM, [email protected] wrote:
> On Tue, May 04, 2010 at 11:45:53AM -0400, Christoph Hellwig wrote:
>> On Tue, May 04, 2010 at 10:16:37AM -0400, Ric Wheeler wrote:
>>> Checking per inode is actually incorrect - we do not want to short cut
>>> the need to flush the target storage device's write cache just because a
>>> specific file has no dirty pages. If a power hit occurs, having sent
>>> the pages from to the storage device is not sufficient.
>>
>> As long as we're only using the information for fsync doing it per inode
>> is the correct thing. We only want to flush the cache if the inode
>> (data or metadata) is dirty in some way. Note that this includes writes
>> via O_DIRECT which are quite different to track - I've not found the
>> original patch in my mbox so I can't comment if this is done right.
>
> I agree.
>
> I wonder if it's worthwhile to think about a new system call which
> allows users to provide an array of fd's which are collectively should
> be fsync'ed out at the same time. Otherwise, we end up issuing
> multiple barrier operations in cases where the application needs to
> do:
>
> fsync(control_fd);
> fsync(data_fd);
>
> - Ted

The problem with not issuing a cache flush when you have dirty meta data or data
is that it does not have any tie to the state of the volatile write cache of the
target storage device.

We do need to have fsync() issue the cache flush command even when there is no
dirty state for the inode in our local page cache in order to flush data that
was pushed out/cleaned and not followed by a flush.

It would definitely be *very* useful to have an array of fd's that all need
fsync()'ed at home time....

Ric


2010-06-30 13:44:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
>
> The problem with not issuing a cache flush when you have dirty meta
> data or data is that it does not have any tie to the state of the
> volatile write cache of the target storage device.

We track whether or not there is any metadata updates associated with
the inode already; if it does, we force a journal commit, and this
implies a barrier operation.

The case we're talking about here is one where either (a) there is no
journal, or (b) there have been no metadata updates (I'm simplifying a
little here; in fact we track whether there have been fdatasync()- vs
fsync()- worthy metadata updates), and so there hasn't been a journal
commit to do the cache flush.

In this case, we want to track when is the last time an fsync() has
been issued, versus when was the last time data blocks for a
particular inode have been pushed out to disk.

To use an example I used as motivation for why we might want an
fsync2(int fd[], int flags[], int num) syscall, consider the situation
of:

fsync(control_fd);
fdatasync(data_fd);

The first fsync() will have executed a cache flush operation. So when
we do the fdatasync() (assuming that no metadata needs to be flushed
out to disk), there is no need for the cache flush operation.

If we had an enhanced fsync command, we would also be able to
eliminate a second journal commit in the case where data_fd also had
some metadata that needed to be flushed out to disk.

> It would definitely be *very* useful to have an array of fd's that
> all need fsync()'ed at home time....

Yes, but it would require applications to change their code.

One thing that I would like about a new fsync2() system call is with a
flags field, we could add some new, more expressive flags:

#define FSYNC_DATA 0x0001 /* Only flush metadata if needed to access data */
#define FSYNC_NOWAIT 0x0002 /* Initiate the flush operations but don't wait
for them to complete */
#define FSYNC_NOBARRER 0x004 /* FS may skip the barrier if not needed for fs
consistency */

etc.

- Ted

2010-06-30 13:54:41

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 06/30/2010 09:44 AM, [email protected] wrote:
> On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
>>
>> The problem with not issuing a cache flush when you have dirty meta
>> data or data is that it does not have any tie to the state of the
>> volatile write cache of the target storage device.
>
> We track whether or not there is any metadata updates associated with
> the inode already; if it does, we force a journal commit, and this
> implies a barrier operation.
>
> The case we're talking about here is one where either (a) there is no
> journal, or (b) there have been no metadata updates (I'm simplifying a
> little here; in fact we track whether there have been fdatasync()- vs
> fsync()- worthy metadata updates), and so there hasn't been a journal
> commit to do the cache flush.
>
> In this case, we want to track when is the last time an fsync() has
> been issued, versus when was the last time data blocks for a
> particular inode have been pushed out to disk.

I think that the state that we want to track is the last time the write cache on
the target device has been flushed. If the last fsync() did do a full barrier,
that would be equivalent :-)

ric

>
> To use an example I used as motivation for why we might want an
> fsync2(int fd[], int flags[], int num) syscall, consider the situation
> of:
>
> fsync(control_fd);
> fdatasync(data_fd);
>
> The first fsync() will have executed a cache flush operation. So when
> we do the fdatasync() (assuming that no metadata needs to be flushed
> out to disk), there is no need for the cache flush operation.
>
> If we had an enhanced fsync command, we would also be able to
> eliminate a second journal commit in the case where data_fd also had
> some metadata that needed to be flushed out to disk.
>
>> It would definitely be *very* useful to have an array of fd's that
>> all need fsync()'ed at home time....
>
> Yes, but it would require applications to change their code.
>
> One thing that I would like about a new fsync2() system call is with a
> flags field, we could add some new, more expressive flags:
>
> #define FSYNC_DATA 0x0001 /* Only flush metadata if needed to access data */
> #define FSYNC_NOWAIT 0x0002 /* Initiate the flush operations but don't wait
> for them to complete */
> #define FSYNC_NOBARRER 0x004 /* FS may skip the barrier if not needed for fs
> consistency */
>
> etc.
>
> - Ted


2010-06-30 19:05:18

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 2010-06-30, at 07:54, Ric Wheeler wrote:
> On 06/30/2010 09:44 AM, [email protected] wrote:
>> We track whether or not there is any metadata updates associated with
>> the inode already; if it does, we force a journal commit, and this
>> implies a barrier operation.
>>
>> The case we're talking about here is one where either (a) there is no
>> journal, or (b) there have been no metadata updates (I'm simplifying a
>> little here; in fact we track whether there have been fdatasync()- vs
>> fsync()- worthy metadata updates), and so there hasn't been a journal
>> commit to do the cache flush.
>>
>> In this case, we want to track when is the last time an fsync() has
>> been issued, versus when was the last time data blocks for a
>> particular inode have been pushed out to disk.
>
> I think that the state that we want to track is the last time the write cache on the target device has been flushed. If the last fsync() did do a full barrier, that would be equivalent :-)

We had a similar problem in Lustre, where we want to ensure the integrity of some data on disk, but don't want to force an extra journal commit/barrier if there was already one since the time the write was submitted and before we need it to be on disk.

We fixed this in a similar manner but it is optimized somewhat. In your case there is a flag on the inode in question, but you should also registered a journal commit callback after the IO has been submitted that clears the flag when the journal commits (which also implies a barrier). This avoids a gratuitous barrier if fsync() is called on this (or any other similarly marked) inode after the journal has already issued the barrier.

The best part is that this gives "POSIXly correct" semantics for applications that are issuing the f{,data}sync() on the modified files, without penalizing them again if the journal happened to do this already in the background in aggregate.

Cheers, Andreas






2010-07-21 17:16:11

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

Hi,

> On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> >
> > The problem with not issuing a cache flush when you have dirty meta
> > data or data is that it does not have any tie to the state of the
> > volatile write cache of the target storage device.
>
> We track whether or not there is any metadata updates associated with
> the inode already; if it does, we force a journal commit, and this
> implies a barrier operation.
>
> The case we're talking about here is one where either (a) there is no
> journal, or (b) there have been no metadata updates (I'm simplifying a
> little here; in fact we track whether there have been fdatasync()- vs
> fsync()- worthy metadata updates), and so there hasn't been a journal
> commit to do the cache flush.
>
> In this case, we want to track when is the last time an fsync() has
> been issued, versus when was the last time data blocks for a
> particular inode have been pushed out to disk.
>
> To use an example I used as motivation for why we might want an
> fsync2(int fd[], int flags[], int num) syscall, consider the situation
> of:
>
> fsync(control_fd);
> fdatasync(data_fd);
>
> The first fsync() will have executed a cache flush operation. So when
> we do the fdatasync() (assuming that no metadata needs to be flushed
> out to disk), there is no need for the cache flush operation.
>
> If we had an enhanced fsync command, we would also be able to
> eliminate a second journal commit in the case where data_fd also had
> some metadata that needed to be flushed out to disk.
Current implementation already avoids journal commit because of
fdatasync(data_fd). We remeber a transaction ID when inode metadata has
last been updated and do not force a transaction commit if it is already
committed. Thus the first fsync might force a transaction commit but second
fdatasync likely won't.
We could actually improve the scheme to work for data as well. I wrote
a proof-of-concept patches (attached) and they nicely avoid second barrier
when doing:
echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1

Ted, would you be interested in something like this?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs


Attachments:
(No filename) (2.15 kB)
0001-block-Introduce-barrier-counters.patch (2.05 kB)
0002-ext4-Send-barriers-on-fsync-only-when-needed.patch (3.04 kB)
Download all attachments

2010-08-03 00:09:44

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote:
> Hi,
>
> > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > >
> > > The problem with not issuing a cache flush when you have dirty meta
> > > data or data is that it does not have any tie to the state of the
> > > volatile write cache of the target storage device.
> >
> > We track whether or not there is any metadata updates associated with
> > the inode already; if it does, we force a journal commit, and this
> > implies a barrier operation.
> >
> > The case we're talking about here is one where either (a) there is no
> > journal, or (b) there have been no metadata updates (I'm simplifying a
> > little here; in fact we track whether there have been fdatasync()- vs
> > fsync()- worthy metadata updates), and so there hasn't been a journal
> > commit to do the cache flush.
> >
> > In this case, we want to track when is the last time an fsync() has
> > been issued, versus when was the last time data blocks for a
> > particular inode have been pushed out to disk.
> >
> > To use an example I used as motivation for why we might want an
> > fsync2(int fd[], int flags[], int num) syscall, consider the situation
> > of:
> >
> > fsync(control_fd);
> > fdatasync(data_fd);
> >
> > The first fsync() will have executed a cache flush operation. So when
> > we do the fdatasync() (assuming that no metadata needs to be flushed
> > out to disk), there is no need for the cache flush operation.
> >
> > If we had an enhanced fsync command, we would also be able to
> > eliminate a second journal commit in the case where data_fd also had
> > some metadata that needed to be flushed out to disk.
> Current implementation already avoids journal commit because of
> fdatasync(data_fd). We remeber a transaction ID when inode metadata has
> last been updated and do not force a transaction commit if it is already
> committed. Thus the first fsync might force a transaction commit but second
> fdatasync likely won't.
> We could actually improve the scheme to work for data as well. I wrote
> a proof-of-concept patches (attached) and they nicely avoid second barrier
> when doing:
> echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1
>
> Ted, would you be interested in something like this?

Well... on my fsync-happy workloads, this seems to cut the barrier count down
by about 20%, and speeds it up by about 20%.

I also have a patch to ext4_sync_files that batches the fsync requests together
for a further 20% decrease in barrier IOs, which makes it run another 20%
faster. I'll send that one out shortly, though I've not safety-tested it at
all.

--D

2010-08-03 09:01:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote:
> Well... on my fsync-happy workloads, this seems to cut the barrier count down
> by about 20%, and speeds it up by about 20%.

Care to share the test case for this? I'd be especially interesting on
how it behaves with non-draining barriers / cache flushes in fsync.


2010-08-03 13:22:27

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Mon 02-08-10 17:09:39, Darrick J. Wong wrote:
> On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote:
> > Hi,
> >
> > > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > > >
> > > > The problem with not issuing a cache flush when you have dirty meta
> > > > data or data is that it does not have any tie to the state of the
> > > > volatile write cache of the target storage device.
> > >
> > > We track whether or not there is any metadata updates associated with
> > > the inode already; if it does, we force a journal commit, and this
> > > implies a barrier operation.
> > >
> > > The case we're talking about here is one where either (a) there is no
> > > journal, or (b) there have been no metadata updates (I'm simplifying a
> > > little here; in fact we track whether there have been fdatasync()- vs
> > > fsync()- worthy metadata updates), and so there hasn't been a journal
> > > commit to do the cache flush.
> > >
> > > In this case, we want to track when is the last time an fsync() has
> > > been issued, versus when was the last time data blocks for a
> > > particular inode have been pushed out to disk.
> > >
> > > To use an example I used as motivation for why we might want an
> > > fsync2(int fd[], int flags[], int num) syscall, consider the situation
> > > of:
> > >
> > > fsync(control_fd);
> > > fdatasync(data_fd);
> > >
> > > The first fsync() will have executed a cache flush operation. So when
> > > we do the fdatasync() (assuming that no metadata needs to be flushed
> > > out to disk), there is no need for the cache flush operation.
> > >
> > > If we had an enhanced fsync command, we would also be able to
> > > eliminate a second journal commit in the case where data_fd also had
> > > some metadata that needed to be flushed out to disk.
> > Current implementation already avoids journal commit because of
> > fdatasync(data_fd). We remeber a transaction ID when inode metadata has
> > last been updated and do not force a transaction commit if it is already
> > committed. Thus the first fsync might force a transaction commit but second
> > fdatasync likely won't.
> > We could actually improve the scheme to work for data as well. I wrote
> > a proof-of-concept patches (attached) and they nicely avoid second barrier
> > when doing:
> > echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1
> >
> > Ted, would you be interested in something like this?
>
> Well... on my fsync-happy workloads, this seems to cut the barrier count down
> by about 20%, and speeds it up by about 20%.
Nice, thanks for measurement.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-03 13:24:57

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 06/30/2010 03:48 PM, [email protected] wrote:
>
> I wonder if it's worthwhile to think about a new system call which
> allows users to provide an array of fd's which are collectively should
> be fsync'ed out at the same time. Otherwise, we end up issuing
> multiple barrier operations in cases where the application needs to
> do:
>
> fsync(control_fd);
> fsync(data_fd);
>

The system call exists, it's called io_submit().

--
error compiling committee.c: too many arguments to function


2010-08-04 18:16:24

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, Aug 03, 2010 at 05:01:52AM -0400, Christoph Hellwig wrote:
> On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote:
> > Well... on my fsync-happy workloads, this seems to cut the barrier count down
> > by about 20%, and speeds it up by about 20%.
>
> Care to share the test case for this? I'd be especially interesting on
> how it behaves with non-draining barriers / cache flushes in fsync.

Sure. When I run blktrace with the ffsb profile, I get these results:

barriers transactions/sec
16212 206
15625 201
10442 269
10870 266
15658 201

Without Jan's patch:
barriers transactions/sec
20855 177
20963 177
20340 174
20908 177

The two ~270 results are a little odd... if we ignore them, the net gain with
Jan's patch is about a 25% reduction in barriers issued and about a 15%
increase in tps. (If we don't, it's ~30% and ~30%, respectively.) That said,
I was running mkfs between runs, so it's possible that the disk layout could
have shifted a bit. If I turn off the fsync parts of the ffsb profile, the
barrier counts drop to about a couple every second or so, which means that
Jan's patch doesn't have much of an effect. But it does help if someone is
hammering on the filesystem with fsync.

The ffsb profile is attached below.

--D

-----------

time=300
alignio=1
directio=1

[filesystem0]
location=/mnt/
num_files=100000
num_dirs=1000

reuse=1
# File sizes range from 1kB to 1MB.
size_weight 1KB 10
size_weight 2KB 15
size_weight 4KB 16
size_weight 8KB 16
size_weight 16KB 15
size_weight 32KB 10
size_weight 64KB 8
size_weight 128KB 4
size_weight 256KB 3
size_weight 512KB 2
size_weight 1MB 1

create_blocksize=1048576
[end0]

[threadgroup0]
num_threads=64

readall_weight=4
create_fsync_weight=2
delete_weight=1

append_weight = 1
append_fsync_weight = 1
stat_weight = 1
create_weight = 1
writeall_weight = 1
writeall_fsync_weight = 1
open_close_weight = 1


write_size=64KB
write_blocksize=512KB

read_size=64KB
read_blocksize=512KB

[stats]
enable_stats=1
enable_range=1

msec_range 0.00 0.01
msec_range 0.01 0.02
msec_range 0.02 0.05
msec_range 0.05 0.10
msec_range 0.10 0.20
msec_range 0.20 0.50
msec_range 0.50 1.00
msec_range 1.00 2.00
msec_range 2.00 5.00
msec_range 5.00 10.00
msec_range 10.00 20.00
msec_range 20.00 50.00
msec_range 50.00 100.00
msec_range 100.00 200.00
msec_range 200.00 500.00
msec_range 500.00 1000.00
msec_range 1000.00 2000.00
msec_range 2000.00 5000.00
msec_range 5000.00 10000.00
[end]
[end0]

2010-08-04 23:32:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, Aug 03, 2010 at 04:24:49PM +0300, Avi Kivity wrote:
> On 06/30/2010 03:48 PM, [email protected] wrote:
> >
> >I wonder if it's worthwhile to think about a new system call which
> >allows users to provide an array of fd's which are collectively should
> >be fsync'ed out at the same time. Otherwise, we end up issuing
> >multiple barrier operations in cases where the application needs to
> >do:
> >
> > fsync(control_fd);
> > fsync(data_fd);
> >
>
> The system call exists, it's called io_submit().

Um, not the same thing at all.

- Ted

2010-08-05 02:20:23

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 08/05/2010 02:32 AM, Ted Ts'o wrote:
> On Tue, Aug 03, 2010 at 04:24:49PM +0300, Avi Kivity wrote:
>> On 06/30/2010 03:48 PM, [email protected] wrote:
>>> I wonder if it's worthwhile to think about a new system call which
>>> allows users to provide an array of fd's which are collectively should
>>> be fsync'ed out at the same time. Otherwise, we end up issuing
>>> multiple barrier operations in cases where the application needs to
>>> do:
>>>
>>> fsync(control_fd);
>>> fsync(data_fd);
>>>
>> The system call exists, it's called io_submit().
> Um, not the same thing at all.

Why not? To be clear, I'm talking about an io_submit() with multiple
IO_CMD_FSYNC requests, with a kernel implementation that is able to
batch these requests.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


2010-08-05 16:17:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 05:20:12AM +0300, Avi Kivity wrote:
>
> Why not? To be clear, I'm talking about an io_submit() with
> multiple IO_CMD_FSYNC requests, with a kernel implementation that is
> able to batch these requests.

IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have
to add a new syscall. I find the aio interface to be horribly
complicated, and it would mean that programs would have to link
against libaio, which again isn't my favorite set of interfaces.

All of that being said, I do agree that adding a new IO_CMD_FSYNC,
IO_CMD_FSYNCDATA, IO_CMD_FSYNC_NOBARRIER, and
IOCMD_FSYNC_DATA_NOBARRIER would be the simplist thing to do from a
kernel implementation perspective.

- Ted

2010-08-05 16:40:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Tue, Jun 29, 2010 at 01:51:02PM -0700, Darrick J. Wong wrote:
>
> This second version of the patch uses the inode state flags and
> (suboptimally) also catches directio writes. It might be a better
> idea to try to coordinate all the barrier requests across the whole
> filesystem, though that's a bit more difficult.

Hi Darrick,

When I looked at this patch more closely, and thought about it hard,
the fact that this helps the FFSB mail server benchmark surprised me,
and then I realized it's because it doesn't really accurately emulate
a mail server at all. Or at least, not a MTA. In a MTA, only one CPU
will touch a queue file, so there should never be a case of a double
fsync to a single file. This is why I was thinking about a
coordinating barrier requests across the whole filesystem --- it helps
out in the case where you have all your CPU threads hammering
/var/spool/mqueue, or /var/spool/exim4/input, and where they are all
creating queue files, and calling fsync() in parallel. This patch
won't help that case.

It will help the case of a MDA --- Mail Delivery Agent --- if you have
multiple e-mails all getting delivered at the same time into the same
/var/mail/<username> file, with an fsync() following after a mail
message is appended to the file. This is a much rarer case, and I
can't think of any other workload where you will have multiple
processes racing against each other and fsync'ing the same inode.
Even in the MDA case, it's rare that you will have one mbox getting so
many deliveries that this case would be hit.

So while I was thinking about accepting this patch, I now find myself
hesitating. There _is_ a minor race in the patch that I noticed,
which I'll point out below, but that's easily fixed. The bigger issue
is it's not clear this patch will actually make a difference in the
real world. I trying and failing to think of a real-life application
which is stupid enough to do back-to-back fsync commands, even if it's
because it has multiple threads all trying to write to the file and
fsync to it in an uncoordinated fashion. It would be easily enough to
add instrumentation that would trigger a printk if the patch optimized
out a barrier --- and if someone can point out even one badly written
application --- whether it's mysql, postgresql, a GNOME or KDE
application, db2, Oracle, etc., I'd say sure. But adding even a tiny
amount of extra complexity for something which is _only_ helpful for a
benchmark grates against my soul....

So if you can think of something, please point it out to me. If it
would help ext4 users in real life, I'd be all for it. But at this
point, I'm thinking that perhaps the real issue is that the mail
server benchmark isn't accurately reflecting a real life workload.

Am I missing something?

- Ted

> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 592adf2..96625c3 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -130,8 +130,11 @@ int ext4_sync_file(struct file *file, int datasync)
> blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
> NULL, BLKDEV_IFL_WAIT);
> ret = jbd2_log_wait_commit(journal, commit_tid);
> - } else if (journal->j_flags & JBD2_BARRIER)
> + } else if (journal->j_flags & JBD2_BARRIER &&
> + ext4_test_inode_state(inode, EXT4_STATE_DIRTY_DATA)) {
> blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
> BLKDEV_IFL_WAIT);
> + ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
> + }
> return ret;

This is the minor race I was talking about; you should move the
ext4_clear_inode_state() call above blkdev_issue_flush(). If there is
a race, you want to fail safe, by accidentally issuing a second
barrier, instead of possibly skipping a barrier if a page gets dirtied
*after* the blkdev_issue_flush() has taken effect, but *before* we
have a chance to clear the EXT4_STATE_DIRTY_DATA flag.

BTW, my apologies for not looking at this sooner, and giving you this
feedback earlier. This summer has been crazy busy, and I didn't have
time until the merge window provided a forcing function to look at
outstanding patches.

2010-08-05 16:45:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

P.S. If it wasn't clear, I'm still in favor of trying to coordinate
barriers across the whole file system, since that is much more likely
to help use cases that arise in real life.

- Ted

2010-08-05 19:13:52

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

"Ted Ts'o" <[email protected]> writes:

> On Thu, Aug 05, 2010 at 05:20:12AM +0300, Avi Kivity wrote:
>>
>> Why not? To be clear, I'm talking about an io_submit() with
>> multiple IO_CMD_FSYNC requests, with a kernel implementation that is
>> able to batch these requests.
>
> IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have

Well, there's IOCB_CMD_FSYNC. But still, this isn't the same thing as
what's requested. If I understand correctly, what is requested is a
mechanism to flush out all data for multiple file descriptors and follow
that with a single barrier/flush (and yes, Ted did give a summary of the
commands that would be required to accomplish that).

There still remains the question of why this should be tied to the AIO
submission interface.


Cheers,
Jeff

2010-08-05 20:39:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 03:13:44PM -0400, Jeff Moyer wrote:
> > IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have
>
> Well, there's IOCB_CMD_FSYNC. But still, this isn't the same thing as
> what's requested. If I understand correctly, what is requested is a
> mechanism to flush out all data for multiple file descriptors and follow
> that with a single barrier/flush (and yes, Ted did give a summary of the
> commands that would be required to accomplish that).
>
> There still remains the question of why this should be tied to the AIO
> submission interface.

I don't think it should, personally. The only excuse might be if
someone wanted to do an asynchronous fsync(), but I don't think that
makes sense in most cases.

- Ted

2010-08-05 20:45:02

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

"Ted Ts'o" <[email protected]> writes:

> On Thu, Aug 05, 2010 at 03:13:44PM -0400, Jeff Moyer wrote:
>> > IO_CMD_FSYNC doesn't exist right now, but sure, it means we don't have
>>
>> Well, there's IOCB_CMD_FSYNC. But still, this isn't the same thing as
>> what's requested. If I understand correctly, what is requested is a
>> mechanism to flush out all data for multiple file descriptors and follow
>> that with a single barrier/flush (and yes, Ted did give a summary of the
>> commands that would be required to accomplish that).
>>
>> There still remains the question of why this should be tied to the AIO
>> submission interface.
>
> I don't think it should, personally. The only excuse might be if
> someone wanted to do an asynchronous fsync(), but I don't think that
> makes sense in most cases.

In case it wasn't clear, we are in agreement on this.

Cheers,
Jeff

2010-08-06 07:04:27

by djwong

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 12:45:04PM -0400, Ted Ts'o wrote:
> P.S. If it wasn't clear, I'm still in favor of trying to coordinate
> barriers across the whole file system, since that is much more likely
> to help use cases that arise in real life.

Ok. I have a rough sketch of a patch to do that, and I was going to send it
out today, but the test machine caught on fire while I was hammering it with
the fsync tests one last time and ... yeah. I'm fairly sure the patch didn't
cause the fire, but I'll check anyway after I finish cleaning up.

"[PATCH] ext4: Don't set my machine ablaze with barrier requests" :P

(The patch did seem to cut barrier requests counts by about 20% though the
impact on performance was pretty small.)

--D

2010-08-06 07:13:59

by djwong

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Thu, Aug 05, 2010 at 12:40:08PM -0400, Ted Ts'o wrote:
> On Tue, Jun 29, 2010 at 01:51:02PM -0700, Darrick J. Wong wrote:
> >
> > This second version of the patch uses the inode state flags and
> > (suboptimally) also catches directio writes. It might be a better
> > idea to try to coordinate all the barrier requests across the whole
> > filesystem, though that's a bit more difficult.
>
> Hi Darrick,
>
> When I looked at this patch more closely, and thought about it hard,
> the fact that this helps the FFSB mail server benchmark surprised me,
> and then I realized it's because it doesn't really accurately emulate
> a mail server at all. Or at least, not a MTA. In a MTA, only one CPU
> will touch a queue file, so there should never be a case of a double
> fsync to a single file. This is why I was thinking about a
> coordinating barrier requests across the whole filesystem --- it helps
> out in the case where you have all your CPU threads hammering
> /var/spool/mqueue, or /var/spool/exim4/input, and where they are all
> creating queue files, and calling fsync() in parallel. This patch
> won't help that case.
>
> It will help the case of a MDA --- Mail Delivery Agent --- if you have
> multiple e-mails all getting delivered at the same time into the same
> /var/mail/<username> file, with an fsync() following after a mail
> message is appended to the file. This is a much rarer case, and I
> can't think of any other workload where you will have multiple
> processes racing against each other and fsync'ing the same inode.
> Even in the MDA case, it's rare that you will have one mbox getting so
> many deliveries that this case would be hit.
>
> So while I was thinking about accepting this patch, I now find myself
> hesitating. There _is_ a minor race in the patch that I noticed,
> which I'll point out below, but that's easily fixed. The bigger issue
> is it's not clear this patch will actually make a difference in the
> real world. I trying and failing to think of a real-life application
> which is stupid enough to do back-to-back fsync commands, even if it's
> because it has multiple threads all trying to write to the file and
> fsync to it in an uncoordinated fashion. It would be easily enough to
> add instrumentation that would trigger a printk if the patch optimized
> out a barrier --- and if someone can point out even one badly written
> application --- whether it's mysql, postgresql, a GNOME or KDE
> application, db2, Oracle, etc., I'd say sure. But adding even a tiny
> amount of extra complexity for something which is _only_ helpful for a
> benchmark grates against my soul....
>
> So if you can think of something, please point it out to me. If it
> would help ext4 users in real life, I'd be all for it. But at this
> point, I'm thinking that perhaps the real issue is that the mail
> server benchmark isn't accurately reflecting a real life workload.

Yes, it's a proxy for something else. One of our larger products would like to
use fsync() to flush dirty data out to disk (right now it looks like they use
O_SYNC), but they're concerned that the many threads they use can create an
fsync() storm. So, they wanted to know how to mitigate the effects of those
storms. Not calling fsync() except when they really need to guarantee a disk
write is a good start, but I'd like to get ahead of them to pick off more low
hanging fruit like the barrier coordination and not sending barriers when
there's no dirty data ... before they run into it. :)

--D

2010-08-06 10:17:54

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On 08/06/2010 03:04 AM, Darrick J. Wong wrote:
> On Thu, Aug 05, 2010 at 12:45:04PM -0400, Ted Ts'o wrote:
>> P.S. If it wasn't clear, I'm still in favor of trying to coordinate
>> barriers across the whole file system, since that is much more likely
>> to help use cases that arise in real life.
> Ok. I have a rough sketch of a patch to do that, and I was going to send it
> out today, but the test machine caught on fire while I was hammering it with
> the fsync tests one last time and ... yeah. I'm fairly sure the patch didn't
> cause the fire, but I'll check anyway after I finish cleaning up.
>
> "[PATCH] ext4: Don't set my machine ablaze with barrier requests" :P
>
> (The patch did seem to cut barrier requests counts by about 20% though the
> impact on performance was pretty small.)
>
> --D

Just a note, one thing that we have been doing is trying to get a reasonable
regression test in place for testing data integrity. That might be useful to
share as we float patches around barrier changes.

Basic test:

(1) Get a box with an external e-sata (or USB) connected drive

(2) Fire off some large load on that drive (Chris Mason had one, some of our QE
engineers have been using fs_mark (fs_mark -d /your_fs/test_dir -S 0 -t 8 -F)

(3) Pull the power cable to that external box.

Of course, you can use any system and drop power, but the above setup will make
sure that we kill the write cache on the device without letting the firmware
destage the cache contents.

The test passes if you can now do the following:

(1) Mount the file system without error

(2) Unmount and force an fsck - that should run without reporting errors as well.

Note that the above does not use fsync in the testing.

Thanks!

Ric


2010-08-06 18:05:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Fri, Aug 06, 2010 at 12:13:56AM -0700, Darrick J. Wong wrote:
> Yes, it's a proxy for something else. One of our larger products would like to
> use fsync() to flush dirty data out to disk (right now it looks like they use
> O_SYNC), but they're concerned that the many threads they use can create an
> fsync() storm. So, they wanted to know how to mitigate the effects of those
> storms. Not calling fsync() except when they really need to guarantee a disk
> write is a good start, but I'd like to get ahead of them to pick off more low
> hanging fruit like the barrier coordination and not sending barriers when
> there's no dirty data ... before they run into it. :)

Do they need a barrier operation, or do they just want to initiate the
I/O? One of the reasons I found it hard to believe you would have
multiple threads all fsync()'ing the same file is that keeping the the
file consistent is very hard to do in such a scenario. Maintaining
ACID-level consistency without a single thread which coordinates when
commit records gets written is I'm sure theoretically possible, but in
practice, I wasn't sure any applications would actually be _written_
that way.

If the goal is just to make sure I/O is getting initiated, without
necessarily waiting for assurance that a specific file write has hit
the disk platters, it may be that the Linux-specific
sync_file_range(2) system call might be a far more efficient way of
achieving those ends. Without more details about what this product is
doing, it's hard to say, of course.

- Ted

2010-08-09 19:36:39

by djwong

[permalink] [raw]
Subject: Re: [RFC v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

On Fri, Aug 06, 2010 at 02:04:54PM -0400, Ted Ts'o wrote:
> On Fri, Aug 06, 2010 at 12:13:56AM -0700, Darrick J. Wong wrote:
> > Yes, it's a proxy for something else. One of our larger products would like to
> > use fsync() to flush dirty data out to disk (right now it looks like they use
> > O_SYNC), but they're concerned that the many threads they use can create an
> > fsync() storm. So, they wanted to know how to mitigate the effects of those
> > storms. Not calling fsync() except when they really need to guarantee a disk
> > write is a good start, but I'd like to get ahead of them to pick off more low
> > hanging fruit like the barrier coordination and not sending barriers when
> > there's no dirty data ... before they run into it. :)
>
> Do they need a barrier operation, or do they just want to initiate the
> I/O? One of the reasons I found it hard to believe you would have
> multiple threads all fsync()'ing the same file is that keeping the the
> file consistent is very hard to do in such a scenario. Maintaining
> ACID-level consistency without a single thread which coordinates when
> commit records gets written is I'm sure theoretically possible, but in
> practice, I wasn't sure any applications would actually be _written_
> that way.

> If the goal is just to make sure I/O is getting initiated, without
> necessarily waiting for assurance that a specific file write has hit
> the disk platters, it may be that the Linux-specific
> sync_file_range(2) system call might be a far more efficient way of
> achieving those ends. Without more details about what this product is
> doing, it's hard to say, of course.

I don't know for sure, though given what I've seen of the app behavior I
suspect they simply want the disk cache flushed, and don't need the full
ordering semantics. That said, I do think they want to make sure that data
actually hits the disk platters.

--D

2010-08-09 19:53:30

by djwong

[permalink] [raw]
Subject: [RFC v3] ext4: Combine barrier requests coming from fsync

This patch attempts to coordinate barrier requests being sent in by fsync.
Instead of each fsync call initiating its own barrier, there's now a flag to
indicate if (0) no barriers are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a barrier.

So, if someone calls ext4_sync_file and no barriers are in progress, the flag
shifts from 0->1 and the thread delays for 500us to see if there are any other
threads that are close behind in ext4_sync_file. After that wait, the state
transitions to 2 and the barrier is issued. Once that's done, the state goes
back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled. Instead of issuing a barrier themselves, they simply
wait for that first thread to do it for them.

Without Jan's prior barrier generation patch, I can run my 5min fsync-happy
workload and see the barrier count drop from ~150000 to ~37000, and the
transaction rate increase from ~630 to ~680.

With Jan's patch, I see the barrier count drop from ~18000 to ~12000, and the
transaction rate jumps from ~720 to ~760 when using this patch.

There are some things to clean up -- the wait duration probably ought to be a
mount option and not a module parameter, but this is just a test patch. I'm
also not sure that it won't totally eat the performance of a single thread that
calls fsync frequently, though ... that seems like sort of bad behavior.
If you set the delay time to 0 you get the old behavior.

I'm wondering at this point if this is useful? Ted, is this the sort of fsync
coordination that you were looking for?

---

fs/ext4/ext4.h | 5 +++++
fs/ext4/fsync.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/super.c | 4 ++++
3 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..e51535a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1143,6 +1143,11 @@ struct ext4_sb_info {

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ /* fsync barrier reduction */
+ spinlock_t barrier_flag_lock;
+ int in_barrier;
+ struct completion barrier_completion;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..c5afb45 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -57,6 +57,10 @@ static void ext4_sync_parent(struct inode *inode)
}
}

+static int barrier_wait = 500;
+module_param(barrier_wait, int, 0644);
+MODULE_PARM_DESC(barrier_wait, "fsync barrier wait time (usec)");
+
/*
* akpm: A new design for ext4_sync_file().
*
@@ -75,7 +79,8 @@ int ext4_sync_file(struct file *file, int datasync)
{
struct inode *inode = file->f_mapping->host;
struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+ journal_t *journal = sb->s_journal;
int ret;
tid_t commit_tid;

@@ -130,8 +135,42 @@ int ext4_sync_file(struct file *file, int datasync)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
NULL, BLKDEV_IFL_WAIT);
ret = jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
- BLKDEV_IFL_WAIT);
+ } else if (journal->j_flags & JBD2_BARRIER) {
+ if (!barrier_wait) {
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
+ BLKDEV_IFL_WAIT);
+ goto fast_out;
+ }
+again:
+ spin_lock(&sb->barrier_flag_lock);
+ if (sb->in_barrier == 2) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ goto again;
+ } else if (sb->in_barrier) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ } else {
+ sb->in_barrier = 1;
+ INIT_COMPLETION(sb->barrier_completion);
+ spin_unlock(&sb->barrier_flag_lock);
+
+ udelay(barrier_wait);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 2;
+ spin_unlock(&sb->barrier_flag_lock);
+
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT);
+ complete_all(&sb->barrier_completion);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 0;
+ spin_unlock(&sb->barrier_flag_lock);
+ }
+fast_out:
+ if (!ret)
+ ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+ }
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..0dc96d2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3045,6 +3045,10 @@ no_journal:
ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
"Opts: %s", descr, orig_data);

+ EXT4_SB(sb)->in_barrier = 0;
+ spin_lock_init(&EXT4_SB(sb)->barrier_flag_lock);
+ init_completion(&EXT4_SB(sb)->barrier_completion);
+
lock_kernel();
kfree(orig_data);
return 0;

2010-08-09 21:07:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

Can you try with the new barrier implementation in the

[PATCH, RFC] relaxed barriers

by making cache flushes just that and not complicated drain barrier
it should speed this case up a lot.


2010-08-09 21:19:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On 2010-08-09, at 15:53, Darrick J. Wong wrote:
> This patch attempts to coordinate barrier requests being sent in by fsync. Instead of each fsync call initiating its own barrier, there's now a flag to indicate if (0) no barriers are ongoing, (1) we're delaying a short time to collect other fsync threads, or (2) we're actually in-progress on a barrier.
>
> So, if someone calls ext4_sync_file and no barriers are in progress, the flag shifts from 0->1 and the thread delays for 500us to see if there are any other threads that are close behind in ext4_sync_file. After that wait, the state transitions to 2 and the barrier is issued. Once that's done, the state goes back to 0 and a completion is signalled.

You shouldn't use a fixed delay for the thread. 500us _seems_ reasonable, if you have a single HDD. If you have an SSD, or an NVRAM-backed array, then 2000 IOPS is a serious limitation.

What is done in the JBD2 code is to scale the commit sleep interval based on the average commit time. In fact, the ext4_force_commit-> ...->jbd2_journal_force_commit() call will itself be waiting in the jbd2 code to merge journal commits. It looks like we are duplicating some of this machinery in ext4_sync_file() already.

It seems like a better idea to have a single piece of code to wait to merge the IOs. For the non-journal ext4 filesystems it should implement the wait for merges explicitly, otherwise it should defer the wait to jbd2.

Cheers, Andreas






2010-08-09 23:46:11

by djwong

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Mon, Aug 09, 2010 at 05:19:22PM -0400, Andreas Dilger wrote:
> On 2010-08-09, at 15:53, Darrick J. Wong wrote:
> > This patch attempts to coordinate barrier requests being sent in by fsync.
> > Instead of each fsync call initiating its own barrier, there's now a flag
> > to indicate if (0) no barriers are ongoing, (1) we're delaying a short time
> > to collect other fsync threads, or (2) we're actually in-progress on a
> > barrier.
> >
> > So, if someone calls ext4_sync_file and no barriers are in progress, the
> > flag shifts from 0->1 and the thread delays for 500us to see if there are
> > any other threads that are close behind in ext4_sync_file. After that
> > wait, the state transitions to 2 and the barrier is issued. Once that's
> > done, the state goes back to 0 and a completion is signalled.
>
> You shouldn't use a fixed delay for the thread. 500us _seems_ reasonable, if
> you have a single HDD. If you have an SSD, or an NVRAM-backed array, then
> 2000 IOPS is a serious limitation.

2000 fsyncs per second, anyway. I wasn't explicitly trying to limit any other
types of IO.

> What is done in the JBD2 code is to scale the commit sleep interval based on
> the average commit time. In fact, the ext4_force_commit->
> ...->jbd2_journal_force_commit() call will itself be waiting in the jbd2 code
> to merge journal commits. It looks like we are duplicating some of this
> machinery in ext4_sync_file() already.

I actually picked 500us arbitrarily because it seemed to work, even for SSDs.
It was a convenient test vehicle, and not much more. That said, I like your
recommendation much better. I'll look into that.

> It seems like a better idea to have a single piece of code to wait to merge
> the IOs. For the non-journal ext4 filesystems it should implement the wait
> for merges explicitly, otherwise it should defer the wait to jbd2.

I wondered if this would have been better off in the block layer than ext4?
Though I suppose that could imply two kinds of flush: flush-immediately, and
flush-shortly. I intend to try those flush drain elimination patches before I
think about this much more.

--D

2010-08-16 16:15:04

by djwong

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Mon, Aug 09, 2010 at 05:07:23PM -0400, Christoph Hellwig wrote:
> Can you try with the new barrier implementation in the
>
> [PATCH, RFC] relaxed barriers
>
> by making cache flushes just that and not complicated drain barrier
> it should speed this case up a lot.

Indeed it does! The barrier count increases to about 21000, but I also see
much higher throughput, about 830 transactions per second (versus 12000 and 760
respectively before Tejun's patch).

--D

2010-08-19 02:07:37

by djwong

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Mon, Aug 16, 2010 at 09:14:33AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 09, 2010 at 05:07:23PM -0400, Christoph Hellwig wrote:
> > Can you try with the new barrier implementation in the
> >
> > [PATCH, RFC] relaxed barriers
> >
> > by making cache flushes just that and not complicated drain barrier
> > it should speed this case up a lot.
>
> Indeed it does! The barrier count increases to about 21000, but I also see
> much higher throughput, about 830 transactions per second (versus 12000 and 760
> respectively before Tejun's patch).

Oddly, I ran the entire suite of tests against a larger set of machines, and
with Tejun's RFC patchset I didn't see nearly as much of an improvement. I
have been trying to put together a new tree based on "replace barrier with
sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
gotten lost in the weeds. :(

I also experienced some sort of crash with Tejun's relaxed barrier patch on one
of my systems. I was hitting the BUG_ON in drivers/scsi/scsi_lib.c, line 1115.

Rather than hold on to (aging) test results any further, I'll be posting a new
fsync coordination patch shortly that includes Andreas' suggestion to use the
average barrier time instead of a static 500us, and a spreadsheet that shows
what happens with various patches, and on a wider range of hardware.

--D

2010-08-19 02:14:49

by djwong

[permalink] [raw]
Subject: [RFC v4] ext4: Coordinate fsync requests

This patch attempts to coordinate barrier requests being sent in by fsync.
Instead of each fsync call initiating its own barrier, there's now a flag to
indicate if (0) no barriers are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a barrier.

So, if someone calls ext4_sync_file and no barriers are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are
any other threads that are close behind in ext4_sync_file. After that wait,
the state transitions to 2 and the barrier is issued. Once that's done, the
state goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled. Instead of issuing a barrier themselves, they simply
wait for that first thread to do it for them.

In an earlier version of this patch, the "short time" was 500ms, adjustable by
a module parameter. Now, it's a mount option, and the mount option has three
values: x = 0, which gets you the old behavior; x = -1, which uses the average
commit time for the delay period; and x > 0, which sets the delay time to
min(x, average commit time).

So I gave this patchset some wider testing in my lab, and came up with the
following spreadsheet:
https://spreadsheets.google.com/ccc?key=0AtDnBlSkyNjodDZ4OEdSZC01X2haVi1ncmpGOXpfNkE&hl=en&authkey=CMznqcgH

These results do not reflect Tejun/Christoph's latest barrier/flush semantic
changes because ... I haven't figured out the correct sequence of 2.6.35+,
tejun's patchset, and hch's patchset. With Tejun's original set of patches,
the performance actually decreased a little bit, so I haven't bothered to
upload those results while I try to put together a new tree.

The comma-separated numbers in the leftmost column indicates which options were
in effect during the run. From left to right, they are (1) whether or not
directio is enabled; (2) the max_fsync_delay parameter (-1 is automatic
tuning, 0 is the behavior prior to my patch, and 500 is the v3 patch);
(3) whether or not Jan Kara's barrier generation counter is disabled; and (4)
whether or not my old dirty flag patch is disabled. The old system behaviors
are in the *,0,1,1 rows, and the everything-on behavior should be in the
*,-1,0,0 rows.

>From these results, it seems like a reasonable conclusion that enabling Jan
Kara's barrier generation counter patch (*,*,0,*) does increase the tps count.
It also would appear that enabling the barrier coordination patch with the
automatic barrier delay tuning (*,-1,*,*) also seems to be increasing tps while
reducing barrier counts. It's not so clear that my old dirty flag patch
(*,*,*,0) is doing much anymore.

The *,na,na,na rows reflect what happens if I turn barriers off entirely, so I
can compare the new numbers with the nobarrier behavior.

elm3c44_sas is a 36-spindle SAS storage array (RAID0).
elm3c44_ssd is 4 Intel SSDs tied together in a RAID0 via md.
elm3c65_sata is a single SATA disk.
elm3c71_sas is a 8-spindle SAS storage array (RAID0).
elm3c71_scsi is a 14-spindle SCSI storage array (RAID0).
elm3c75_ide is a single IDE laptop disk.

The storage array have battery-backed WC.

Signed-off-by: Darrick J. Wong <[email protected]>
---

fs/ext4/ext4.h | 7 +++++++
fs/ext4/fsync.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/super.c | 21 ++++++++++++++++++++-
3 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..9e8dcc7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1143,6 +1143,13 @@ struct ext4_sb_info {

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ /* fsync coordination */
+ spinlock_t barrier_flag_lock;
+ int in_barrier;
+ struct completion barrier_completion;
+ u64 avg_barrier_time;
+ int max_fsync_delay;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..e164ccd 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -75,9 +75,12 @@ int ext4_sync_file(struct file *file, int datasync)
{
struct inode *inode = file->f_mapping->host;
struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+ journal_t *journal = sb->s_journal;
int ret;
tid_t commit_tid;
+ ktime_t before, expires;
+ u64 duration;

J_ASSERT(ext4_journal_current_handle() == NULL);

@@ -130,8 +133,52 @@ int ext4_sync_file(struct file *file, int datasync)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
NULL, BLKDEV_IFL_WAIT);
ret = jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
- BLKDEV_IFL_WAIT);
+ } else if (journal->j_flags & JBD2_BARRIER) {
+ if (!sb->max_fsync_delay) {
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
+ BLKDEV_IFL_WAIT);
+ goto fast_out;
+ }
+again:
+ spin_lock(&sb->barrier_flag_lock);
+ if (sb->in_barrier == 2) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ goto again;
+ } else if (sb->in_barrier) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ } else {
+ sb->in_barrier = 1;
+ INIT_COMPLETION(sb->barrier_completion);
+ spin_unlock(&sb->barrier_flag_lock);
+
+ if (sb->max_fsync_delay > 0 && (sb->max_fsync_delay * 1000) < sb->avg_barrier_time)
+ expires = ktime_add_ns(ktime_get(), sb->max_fsync_delay * 1000);
+ else
+ expires = ktime_add_ns(ktime_get(), sb->avg_barrier_time);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 2;
+ spin_unlock(&sb->barrier_flag_lock);
+
+ before = ktime_get();
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT);
+ duration = ktime_to_ns(ktime_sub(ktime_get(), before));
+ if (likely(sb->avg_barrier_time))
+ sb->avg_barrier_time = (duration + sb->avg_barrier_time * 15) / 16;
+ else
+ sb->avg_barrier_time = duration;
+
+ complete_all(&sb->barrier_completion);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 0;
+ spin_unlock(&sb->barrier_flag_lock);
+ }
+ }
+fast_out:
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..8d04e43 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -953,6 +953,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
if (!test_opt(sb, DELALLOC))
seq_puts(seq, ",nodelalloc");

+ seq_printf(seq, ",max_fsync_delay=%d", sbi->max_fsync_delay);

if (sbi->s_stripe)
seq_printf(seq, ",stripe=%lu", sbi->s_stripe);
@@ -1160,7 +1161,7 @@ enum {
Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
- Opt_discard, Opt_nodiscard,
+ Opt_discard, Opt_nodiscard, Opt_max_fsync_delay,
};

static const match_table_t tokens = {
@@ -1231,6 +1232,7 @@ static const match_table_t tokens = {
{Opt_dioread_lock, "dioread_lock"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
+ {Opt_max_fsync_delay, "max_fsync_delay=%d"},
{Opt_err, NULL},
};

@@ -1699,6 +1701,16 @@ set_qf_format:
case Opt_dioread_lock:
clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
break;
+ case Opt_max_fsync_delay:
+ if (args[0].from) {
+ if (match_int(&args[0], &option))
+ return 0;
+ } else
+ return 0;
+
+ sbi->max_fsync_delay = option;
+ ext4_msg(sb, KERN_INFO, "Maximum fsync delay %d\n", option);
+ break;
default:
ext4_msg(sb, KERN_ERR,
"Unrecognized mount option \"%s\" "
@@ -2562,6 +2574,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
if (!IS_EXT3_SB(sb))
set_opt(sbi->s_mount_opt, DELALLOC);

+ EXT4_SB(sb)->max_fsync_delay = -1;
+
if (!parse_options((char *) data, sb, &journal_devnum,
&journal_ioprio, NULL, 0))
goto failed_mount;
@@ -3045,6 +3059,11 @@ no_journal:
ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
"Opts: %s", descr, orig_data);

+ EXT4_SB(sb)->in_barrier = 0;
+ EXT4_SB(sb)->avg_barrier_time = 0;
+ spin_lock_init(&EXT4_SB(sb)->barrier_flag_lock);
+ init_completion(&EXT4_SB(sb)->barrier_completion);
+
lock_kernel();
kfree(orig_data);
return 0;

2010-08-19 08:53:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

On Wed, Aug 18, 2010 at 07:07:34PM -0700, Darrick J. Wong wrote:
> Oddly, I ran the entire suite of tests against a larger set of machines, and
> with Tejun's RFC patchset I didn't see nearly as much of an improvement. I
> have been trying to put together a new tree based on "replace barrier with
> sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
> gotten lost in the weeds. :(

Tejun's patches don't allow concurrent cache flushes to happen, while
my patch did. Tejun said there are drivers that can't handly empty
flushes with a bio attached, making this nessecary.

Tejun, any idea what drivers that would be?

> I also experienced some sort of crash with Tejun's relaxed barrier patch on one
> of my systems. I was hitting the BUG_ON in drivers/scsi/scsi_lib.c, line 1115.

My kernel source doesn't have a BUG_ON line there, but only one two
lines above. A req->nr_phys_segments that's zero sounds a bit like
empty flush requests, I'll need to look into it again.


2010-08-19 09:21:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

Hello,

On 08/19/2010 10:53 AM, Christoph Hellwig wrote:
> On Wed, Aug 18, 2010 at 07:07:34PM -0700, Darrick J. Wong wrote:
>> Oddly, I ran the entire suite of tests against a larger set of machines, and
>> with Tejun's RFC patchset I didn't see nearly as much of an improvement. I
>> have been trying to put together a new tree based on "replace barrier with
>> sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
>> gotten lost in the weeds. :(
>
> Tejun's patches don't allow concurrent cache flushes to happen, while
> my patch did. Tejun said there are drivers that can't handly empty
> flushes with a bio attached, making this nessecary.
>
> Tejun, any idea what drivers that would be?

What was the configuration? If dm was involved, both md and dm can
only process single flush request at a time. Supporing multiple
flushes in flight wouldn't be too difficult. It's just the way things
are implemented with the previous barrier implementation. It can
definitely be improved.

>> I also experienced some sort of crash with Tejun's relaxed barrier
>> patch on one of my systems. I was hitting the BUG_ON in
>> drivers/scsi/scsi_lib.c, line 1115.
>
> My kernel source doesn't have a BUG_ON line there, but only one two
> lines above. A req->nr_phys_segments that's zero sounds a bit like
> empty flush requests, I'll need to look into it again.

Yeah, definitely sounds like REQ_FS|REQ_FLUSH request causing problem.
Can you post the kernel log?

Thanks.

--
tejun

2010-08-19 15:52:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v3] ext4: Combine barrier requests coming from fsync

Hello, again.

On 08/19/2010 11:17 AM, Tejun Heo wrote:
> What was the configuration? If dm was involved, both md and dm can
> only process single flush request at a time. Supporing multiple
> flushes in flight wouldn't be too difficult. It's just the way things
> are implemented with the previous barrier implementation. It can
> definitely be improved.

Oh, I just realized that the current bio-based dm implementation
doesn't allow other bio's to be processed if a flush is in progress.
So, it's not about not being able to handle multiple flushes. The
queue just stalls while flush is in progress and because it also waits
for flush completion by waiting for all commands in progress to
finish. It basically ends up draining and stalling everything.
Shouldn't be too hard to make it better.

Thanks.

--
tejun

2010-08-23 18:31:24

by djwong

[permalink] [raw]
Subject: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

Hi all,

I retested the ext4 barrier mitigation patchset against a base of 2.6.36-rc1 +
Tejun's flush_fua tree + Christoph's patches to change FS barrier semantics,
and came up with this new spreadsheet:
http://bit.ly/bWpbsT

Here are the previous 2.6.35 results for convenience: http://bit.ly/c22grd

The machine configurations are the same as with the previous (2.6.35)
spreadsheet. It appears to be the case that Tejun and Christoph's patches to
change barrier use into simpler cache flushes generally improve the speed of
the fsync-happy workload in buffered I/O mode ... if you have a bunch of
spinning disks. Results for the SSD array (elm3c44) and the single disk
systems (elm3c65/elm3c75) decreased slightly. For the case where direct I/O
was used, the patchset improved the results in nearly all cases. The speed
with barriers on is getting closer to the speed with barriers off, thankfully!

Unfortunately, one thing that became /much/ less clear in these new results is
the impact of the other patch sets that we've been working on to make ext4
smarter with regards to barrier/flush use. In most cases I don't really see
the fsync-delay patch having much effect for directio, and it seems to have
wild effects when buffered mode is used. Jan Kara's barrier generation patch
still generally helps with directio loads. I've also concluded that my really
old dirty-flag patchset from ages ago no longer has any effect.

What does everyone else think of these results?

--D

2010-09-23 23:25:29

by djwong

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

Hi all,

I just retested with 2.6.36-rc5 and the same set of patches as before
(flush_fua, fsync_coordination, etc) and have an even larger spreadsheet:
http://bit.ly/ahdhyk

This time, however, I instrumented the kernel to report the amount of time it
takes to complete the flush operation. The test setups elm3a63, elm3c44_sas,
and elm3c71_sas are all arrays that have battery backed write-back cache; it
should not be a huge shock that the average flush time generally stays under
8ms for these setups. elm3c65 and elm3c75_ide are single disk SAS and IDE
disks (no write cache), and the other setups all feature md-raids backed by
SCSI disks (also no write cache). The flush_times tab in the spreadsheet lists
average, max, and min sync times.

Turning to the ffsb scores, I can see some of the same results that I saw while
testing 2.6.36-rc1 a few weeks ago. Now that I've had the time to look at how
the code works and evaluate a lot more setups, I think I can speculate further
about the cause of the regression that I see with the fsync coordination patch.
Because I'm testing the effects of varying the fsync_delay values, I've bolded
the highest score for each unique (directio, nojan, nodj) configuration, and it
appears that the most winning cases are fsync_delay=0 which corresponds to the
old fsync behavior (every caller issues a flush), and fsync_delay=-1 which
corresponds to a coordination delay equal to the average flush duration.

To try to find an explanation, I started looking for connections between fsync
delay values and average flush times. I noticed that the setups with low (<
8ms) flush times exhibit better performance when fsync coordination is not
attempted, and the setups with higher flush times exhibit better performance
when fsync coordination happens. This also is no surprise, as it seems
perfectly reasonable that the more time consuming a flush is, the more desirous
it is to spend a little time coordinating those flushes across CPUs.

I think a reasonable next step would be to alter this patch so that
ext4_sync_file always measures the duration of the flushes that it issues, but
only enable the coordination steps if it detects the flushes taking more than
about 8ms. One thing I don't know for sure is whether 8ms is a result of 2*HZ
(currently set to 250) or if 8ms is a hardware property.

As for safety testing, I've been running power-fail tests on the single-disk
systems with the same ffsb profile. So far I've observed a lot of fsck
complaints about orphaned inodes being truncated ("Truncating orphaned inode
1352607 (uid=0, gid=0, mode=0100700, size=4096)") though this happens
regardless of whether I run with this 2.6.36 test kernel of mine or a plain
vanilla 2.6.35 configuration. I've not seen any serious corruption yet.

So, what do people think of these latest results?

--D

On Mon, Aug 23, 2010 at 11:31:19AM -0700, Darrick J. Wong wrote:
> Hi all,
>
> I retested the ext4 barrier mitigation patchset against a base of 2.6.36-rc1 +
> Tejun's flush_fua tree + Christoph's patches to change FS barrier semantics,
> and came up with this new spreadsheet:
> http://bit.ly/bWpbsT
>
> Here are the previous 2.6.35 results for convenience: http://bit.ly/c22grd
>
> The machine configurations are the same as with the previous (2.6.35)
> spreadsheet. It appears to be the case that Tejun and Christoph's patches to
> change barrier use into simpler cache flushes generally improve the speed of
> the fsync-happy workload in buffered I/O mode ... if you have a bunch of
> spinning disks. Results for the SSD array (elm3c44) and the single disk
> systems (elm3c65/elm3c75) decreased slightly. For the case where direct I/O
> was used, the patchset improved the results in nearly all cases. The speed
> with barriers on is getting closer to the speed with barriers off, thankfully!
>
> Unfortunately, one thing that became /much/ less clear in these new results is
> the impact of the other patch sets that we've been working on to make ext4
> smarter with regards to barrier/flush use. In most cases I don't really see
> the fsync-delay patch having much effect for directio, and it seems to have
> wild effects when buffered mode is used. Jan Kara's barrier generation patch
> still generally helps with directio loads. I've also concluded that my really
> old dirty-flag patchset from ages ago no longer has any effect.
>
> What does everyone else think of these results?
>
> --D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-09-24 06:24:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On 2010-09-23, at 17:25, Darrick J. Wong wrote:
> To try to find an explanation, I started looking for connections between fsync delay values and average flush times. I noticed that the setups with low (< 8ms) flush times exhibit better performance when fsync coordination is not attempted, and the setups with higher flush times exhibit better performance when fsync coordination happens. This also is no surprise, as it seems perfectly reasonable that the more time consuming a flush is, the more desirous it is to spend a little time coordinating those flushes across CPUs.
>
> I think a reasonable next step would be to alter this patch so that ext4_sync_file always measures the duration of the flushes that it issues, but only enable the coordination steps if it detects the flushes taking more than about 8ms. One thing I don't know for sure is whether 8ms is a result of 2*HZ (currently set to 250) or if 8ms is a hardware property.

Note that the JBD/JBD2 code will already dynamically adjust the journal flush interval based on the delay seen when writing the journal commit block. This was done to allow aggregating sync journal operations for slow devices, and allowing fast (no delay) sync on fast devices. See jbd2_journal_stop() for details.

I think the best approach is to just depend on the journal to do this sync aggregation, if at all possible, otherwise use the same mechanism in ext3/4 for fsync operations that do not involve the journal (e.g. nojournal mode, data sync in writeback mode, etc).

Using any fixed threshold is the wrong approach, IMHO.

Cheers, Andreas






2010-09-24 11:44:11

by Ric Wheeler

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On 09/24/2010 02:24 AM, Andreas Dilger wrote:
> On 2010-09-23, at 17:25, Darrick J. Wong wrote:
>> To try to find an explanation, I started looking for connections between fsync delay values and average flush times. I noticed that the setups with low (< 8ms) flush times exhibit better performance when fsync coordination is not attempted, and the setups with higher flush times exhibit better performance when fsync coordination happens. This also is no surprise, as it seems perfectly reasonable that the more time consuming a flush is, the more desirous it is to spend a little time coordinating those flushes across CPUs.
>>
>> I think a reasonable next step would be to alter this patch so that ext4_sync_file always measures the duration of the flushes that it issues, but only enable the coordination steps if it detects the flushes taking more than about 8ms. One thing I don't know for sure is whether 8ms is a result of 2*HZ (currently set to 250) or if 8ms is a hardware property.
> Note that the JBD/JBD2 code will already dynamically adjust the journal flush interval based on the delay seen when writing the journal commit block. This was done to allow aggregating sync journal operations for slow devices, and allowing fast (no delay) sync on fast devices. See jbd2_journal_stop() for details.
>
> I think the best approach is to just depend on the journal to do this sync aggregation, if at all possible, otherwise use the same mechanism in ext3/4 for fsync operations that do not involve the journal (e.g. nojournal mode, data sync in writeback mode, etc).
>
> Using any fixed threshold is the wrong approach, IMHO.
>
> Cheers, Andreas
>
>
>
>
>

I agree - we started on that dynamic batching when we noticed that single
threaded writes to an array went at something like 720 files/sec (using fs_mark)
and 2 threaded writes dropped down to 230 files/sec. That was directly
attributed to the fixed (1 jiffie) wait we used to do.

Josef Bacik worked on the dynamic batching so we would not wait (sometimes
much!) to batch other fsync/flushes into a transaction when it was faster just
to dispatch them.

Related worry I have is that we have other places in the kernel that currently
wait way too long for our current classes of devices....

Thanks,

Ric


2010-09-27 23:01:45

by djwong

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Fri, Sep 24, 2010 at 12:24:04AM -0600, Andreas Dilger wrote:
> On 2010-09-23, at 17:25, Darrick J. Wong wrote:
> > To try to find an explanation, I started looking for connections between
> > fsync delay values and average flush times. I noticed that the setups with
> > low (< 8ms) flush times exhibit better performance when fsync coordination
> > is not attempted, and the setups with higher flush times exhibit better
> > performance when fsync coordination happens. This also is no surprise, as
> > it seems perfectly reasonable that the more time consuming a flush is, the
> > more desirous it is to spend a little time coordinating those flushes
> > across CPUs.
> >
> > I think a reasonable next step would be to alter this patch so that
> > ext4_sync_file always measures the duration of the flushes that it issues,
> > but only enable the coordination steps if it detects the flushes taking
> > more than about 8ms. One thing I don't know for sure is whether 8ms is a
> > result of 2*HZ (currently set to 250) or if 8ms is a hardware property.
>
> Note that the JBD/JBD2 code will already dynamically adjust the journal flush
> interval based on the delay seen when writing the journal commit block. This
> was done to allow aggregating sync journal operations for slow devices, and
> allowing fast (no delay) sync on fast devices. See jbd2_journal_stop() for
> details.
>
> I think the best approach is to just depend on the journal to do this sync
> aggregation, if at all possible, otherwise use the same mechanism in ext3/4
> for fsync operations that do not involve the journal (e.g. nojournal mode,
> data sync in writeback mode, etc).

I've been informed that there's confusion about how to interpret this
spreadsheet. I'll first provide a few clarifications, then discuss Andreas'
suggestion, which I've coded up and given some light testing.

Zeroth, the kernel is 2.6.36-rc5 with a few patchsets applied:
1. Tejun Heo's conversion of barriers to flush/fua.
2. Jan Kara's barrier generation patch.
3. My old patch to record if there's dirty data in the disk cache.
4. My newer patch to implement fsync coordination in ext4.
5. My newest patch which implements coordination via jbd2.

Patches 2, 3, 4, and 5 all have debugging toggles so I can quickly run
experiments.

First, the "fsync_delay_us" column records the behavior of my (latest) fsync
coordination patch. The raw control values might be a bit confusing, so I
elaborated them a little more in the spreadsheet. The "old fsync behavior"
entries use the current upstream semantics (no coordination, everyone issues
their own flush). "jbd2 fsync" means coordination of fsyncs through jbd2 as
detailed below. "use avg sync time" measures the average time it takes to
issue a flush command, and tells the first thread into ext4_sync_pages to wait
that amount of time for other threads to catch up.

Second, the "nojan" column is a control knob I added to Jan Kara's old barrier
generation patch so that I could measure its effects. 0 means always track
barrier generations and don't submit flushes for already-flushed data. 1 means
always issue flushes, regardless of generation counts.

Third, the "nodj" column is a control knob that controls my old
EXT4_STATE_DIRTY_DATA patch. A zero here means that a flush will only be
triggered if ext4_write_page has written some dirty data and there hasn't been
a flush yet. 1 disables this logic.

Fourth, the bolded cells in the table represent the highest transactions per
second count across all fsync_delay_us values when holding the other four
control variables constant. For example, let's take a look at
host=elm3a4,directio=0,nojan=0,nodj=0. There are five fsync_delay_us values
(old, jbd2, avg, 1, 500) and five corresponding results (145.84, 184.06,
181.58, 152.39, 158.19). 184.06 is the highest, hence jbd2 wins and is in bold
face.

Background colors are used to group the rows by fsync_delay_us.

The barriers=0 results are, of course, the transactions per second count when
the fs is mounted with barrier support disabled. This ought to provide a rough
idea of the upper performance limit of each piece of hardware.

------

As for Andreas' suggestion, he wants ext4 to use jbd2 as coordination point for
all fsync calls. I could be wrong, but I think that the following snippet
ought to do the trick:

h = ext4_journal_start(journal, 0);
ext4_journal_stop(h);
if (jbd2_journal_start_commit(journal, &target))
jbd2_log_wait_commit(journal, target);

It looks as though this snippet effectively says "Send an empty transaction.
Then, if there are any live or committing transactions, wait for them to
finish", which sounds like what we want. I figured this also means that the
nojan/nodj settings would not have any significant effect on the results, which
seems to be true (though nojan/nodj have had little effect under Tejun's
patchset). So I coded up that patch and gave it a spin on my testing farm.
The results have been added to the 2.6.36-rc5 spreadsheet. Though I have to
say, this seems like an awful lot of overhead just to issue a flush command.

Given a quick look around the jbd2 code, it seems that going through the journal
ought to have a higher overhead cost, which would negatively impact performance
on hardware that features low flush times, and this seems to be true for
elm3a63, elm3c44_sas, and elm3c71_sas in directio=1 mode, where we see rather
large regressions against fsync_delay=avg_sync_time. Curiously, I saw a
dramatic increase in speed for the SSDs when directio=1, which probably relates
to the way SSDs perform writes.

Other than those regressions, the jbd2 fsync coordination is about as fast as
sending the flush directly from ext4. Unfortunately, where there _are_
regressions they seem rather large, which makes this approach (as implemented,
anyway) less attractive. Perhaps there is a better way to do it?

--D

2010-10-08 21:26:10

by djwong

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Mon, Sep 27, 2010 at 04:01:11PM -0700, Darrick J. Wong wrote:
>
> Other than those regressions, the jbd2 fsync coordination is about as fast as
> sending the flush directly from ext4. Unfortunately, where there _are_
> regressions they seem rather large, which makes this approach (as implemented,
> anyway) less attractive. Perhaps there is a better way to do it?

Hmm, not much chatter for two weeks. Either I've confused everyone with the
humongous spreadsheet, or ... something?

I've performed some more extensive performance and safety testing with the
fsync coordination patch. The results have been merged into the spreadsheet
that I linked to in the last email, though in general the results have not
really changed much at all.

I see two trends happening here with regards to comparing the use of jbd2 to
coordinate the flushes vs. measuring and coodinating flushes directly in ext4.
The first is that for loads that most benefit from having any kind of fsync
coordination (i.e. storage with slow flushes), the jbd2 approach provides the
same or slightly better performance than the direct approach. However, for
storage with fast flushes, the jbd2 approach seems to cause major slowdowns
even compared to not changing any code at all. To me this would suggest that
ext4 needs to coordinate the fsyncs directly, even at a higher code maintenance
cost, because a huge performance regression isn't good.

Other people in my group have been running their own performance comparisons
between no-coordination, jbd2-coordination, and direct-coordination, and what
I'm hearing is tha the direct-coordination mode is slightly faster than jbd2
coordination, though either are better than no coordination at all. Happily, I
haven't seen an increase in fsck complaints in my poweroff testing.

Given the nearness of the merge window, perhaps we ought to discuss this on
Monday's ext4 call? In the meantime I'll clean up the fsync coordination patch
so that it doesn't have so many debugging knobs and whistles.

Thanks,

--D

2010-10-08 21:55:37

by Ric Wheeler

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On 10/08/2010 05:26 PM, Darrick J. Wong wrote:
> On Mon, Sep 27, 2010 at 04:01:11PM -0700, Darrick J. Wong wrote:
>> Other than those regressions, the jbd2 fsync coordination is about as fast as
>> sending the flush directly from ext4. Unfortunately, where there _are_
>> regressions they seem rather large, which makes this approach (as implemented,
>> anyway) less attractive. Perhaps there is a better way to do it?
> Hmm, not much chatter for two weeks. Either I've confused everyone with the
> humongous spreadsheet, or ... something?
>
> I've performed some more extensive performance and safety testing with the
> fsync coordination patch. The results have been merged into the spreadsheet
> that I linked to in the last email, though in general the results have not
> really changed much at all.
>
> I see two trends happening here with regards to comparing the use of jbd2 to
> coordinate the flushes vs. measuring and coodinating flushes directly in ext4.
> The first is that for loads that most benefit from having any kind of fsync
> coordination (i.e. storage with slow flushes), the jbd2 approach provides the
> same or slightly better performance than the direct approach. However, for
> storage with fast flushes, the jbd2 approach seems to cause major slowdowns
> even compared to not changing any code at all. To me this would suggest that
> ext4 needs to coordinate the fsyncs directly, even at a higher code maintenance
> cost, because a huge performance regression isn't good.
>
> Other people in my group have been running their own performance comparisons
> between no-coordination, jbd2-coordination, and direct-coordination, and what
> I'm hearing is tha the direct-coordination mode is slightly faster than jbd2
> coordination, though either are better than no coordination at all. Happily, I
> haven't seen an increase in fsck complaints in my poweroff testing.
>
> Given the nearness of the merge window, perhaps we ought to discuss this on
> Monday's ext4 call? In the meantime I'll clean up the fsync coordination patch
> so that it doesn't have so many debugging knobs and whistles.
>
> Thanks,
>
> --D

Hi Darrick,

We have been busily testing various combinations at Red Hat (we being not me
:)), but here is one test that we used a long time back to validate the batching
impact.

You need a slow, poky S-ATA drive - the slower it spins, the better.

A single fs_mark run against that drive should drive some modest number of
files/sec with 1 thread:


[root@tunkums /]# fs_mark -s 20480 -n 500 -L 5 -d /test/foo

On my disk, I see:

5 500 20480 31.8 6213

Now run with 4 threads to give the code a chance to coalesce.

On my box, I see it jump up:

5 2000 20480 113.0 25092

And at 8 threads it jumps again:

5 4000 20480 179.0 49480

This work load is very device specific. On a very low latency device (arrays,
high performance SSD), the coalescing "wait" time could be slower than just
dispatching the command. Ext3/4 work done by Josef a few years back was meant to
use high res timers to dynamically adjust that wait to avoid slowing down.

Have we tested the combined patchset with this?

Thanks!

Ric







2010-10-11 14:33:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Fri, Oct 08, 2010 at 02:26:06PM -0700, Darrick J. Wong wrote:
> Given the nearness of the merge window, perhaps we ought to discuss
> this on Monday's ext4 call? In the meantime I'll clean up the fsync
> coordination patch so that it doesn't have so many debugging knobs
> and whistles.

Yes, we should definitely talk about this on today's call.

One of the things that concern me is that if I'm reading the
spreadsheet correctly, there are some colums (for example,
elm3c44_sas, comparing I3, which is what will be in 2.6.36 --- jan's
patch, but not yours, and the currently existing behaviour --- with
I10 which is as I understand it, what you are recommending, I see an
FFSB regression from 1,916.54 to 1,690.33).

That's something like a 15% regression. I know that for other
hardware, there are improvements, but the fact that at least for some
hardware we're seeing such a big regression makes me worried that
we're missing something --- or maybe it's just me who is missing
something about your spreadsheet?

- Ted

2010-10-11 20:20:22

by djwong

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Fri, Oct 08, 2010 at 05:56:12PM -0400, Ric Wheeler wrote:
> On 10/08/2010 05:26 PM, Darrick J. Wong wrote:
>> On Mon, Sep 27, 2010 at 04:01:11PM -0700, Darrick J. Wong wrote:
>>> Other than those regressions, the jbd2 fsync coordination is about as fast as
>>> sending the flush directly from ext4. Unfortunately, where there _are_
>>> regressions they seem rather large, which makes this approach (as implemented,
>>> anyway) less attractive. Perhaps there is a better way to do it?
>> Hmm, not much chatter for two weeks. Either I've confused everyone with the
>> humongous spreadsheet, or ... something?
>>
>> I've performed some more extensive performance and safety testing with the
>> fsync coordination patch. The results have been merged into the spreadsheet
>> that I linked to in the last email, though in general the results have not
>> really changed much at all.
>>
>> I see two trends happening here with regards to comparing the use of jbd2 to
>> coordinate the flushes vs. measuring and coodinating flushes directly in ext4.
>> The first is that for loads that most benefit from having any kind of fsync
>> coordination (i.e. storage with slow flushes), the jbd2 approach provides the
>> same or slightly better performance than the direct approach. However, for
>> storage with fast flushes, the jbd2 approach seems to cause major slowdowns
>> even compared to not changing any code at all. To me this would suggest that
>> ext4 needs to coordinate the fsyncs directly, even at a higher code maintenance
>> cost, because a huge performance regression isn't good.
>>
>> Other people in my group have been running their own performance comparisons
>> between no-coordination, jbd2-coordination, and direct-coordination, and what
>> I'm hearing is tha the direct-coordination mode is slightly faster than jbd2
>> coordination, though either are better than no coordination at all. Happily, I
>> haven't seen an increase in fsck complaints in my poweroff testing.
>>
>> Given the nearness of the merge window, perhaps we ought to discuss this on
>> Monday's ext4 call? In the meantime I'll clean up the fsync coordination patch
>> so that it doesn't have so many debugging knobs and whistles.
>>
>> Thanks,
>>
>> --D
>
> Hi Darrick,
>
> We have been busily testing various combinations at Red Hat (we being not
> me :)), but here is one test that we used a long time back to validate
> the batching impact.
>
> You need a slow, poky S-ATA drive - the slower it spins, the better.
>
> A single fs_mark run against that drive should drive some modest number
> of files/sec with 1 thread:
>
>
> [root@tunkums /]# fs_mark -s 20480 -n 500 -L 5 -d /test/foo
>
> On my disk, I see:
>
> 5 500 20480 31.8 6213
>
> Now run with 4 threads to give the code a chance to coalesce.
>
> On my box, I see it jump up:
>
> 5 2000 20480 113.0 25092
>
> And at 8 threads it jumps again:
>
> 5 4000 20480 179.0 49480
>
> This work load is very device specific. On a very low latency device
> (arrays, high performance SSD), the coalescing "wait" time could be
> slower than just dispatching the command. Ext3/4 work done by Josef a few
> years back was meant to use high res timers to dynamically adjust that
> wait to avoid slowing down.

Yeah, elm3c65 and elm3c75 in that spreadsheet are a new pokey SATA disk and a
really old IDE disk, which ought to represent the low end case. elm3c44-sas is
a midrange storage server... which doesn't like the patch so much.

--D

2010-10-12 14:15:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

I still think adding code to every filesystem to optimize for a rather
stupid use case is not a good idea. I dropped out a bit from the
thread in the middle, but what was the real use case for lots of
concurrent fsyncs on the same inode again?

And what is the amount of performance you need? If we go back to the
direct submission of REQ_FLUSH request from the earlier flush+fua
setups that were faster or high end storage, would that be enough for
you?

Below is a patch brining the optimization back.

WARNING: completely untested!


Index: linux-2.6/block/blk-flush.c
===================================================================
--- linux-2.6.orig/block/blk-flush.c 2010-10-12 10:08:43.777004514 -0400
+++ linux-2.6/block/blk-flush.c 2010-10-12 10:10:37.547016093 -0400
@@ -143,6 +143,17 @@ struct request *blk_do_flush(struct requ
unsigned skip = 0;

/*
+ * Just issue pure flushes directly.
+ */
+ if (!blk_rq_sectors(rq)) {
+ if (!do_preflush) {
+ __blk_end_request_all(rq, 0);
+ return NULL;
+ }
+ return rq;
+ }
+
+ /*
* Special case. If there's data but flush is not necessary,
* the request can be issued directly.
*

2010-10-15 23:39:05

by djwong

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Tue, Oct 12, 2010 at 04:14:55PM +0200, Christoph Hellwig wrote:
> I still think adding code to every filesystem to optimize for a rather
> stupid use case is not a good idea. I dropped out a bit from the
> thread in the middle, but what was the real use case for lots of
> concurrent fsyncs on the same inode again?

The use case I'm looking at is concurrent fsyncs on /different/ inodes,
actually. We have _n_ different processes, each writing (and fsyncing) its own
separate file on the same filesystem.

iirc, ext4_sync_file is called with the inode mutex held, which prevents
concurrent fsyncs on the same inode.

> And what is the amount of performance you need? If we go back to the
> direct submission of REQ_FLUSH request from the earlier flush+fua
> setups that were faster or high end storage, would that be enough for
> you?
>
> Below is a patch brining the optimization back.
>
> WARNING: completely untested!

So I hacked up a patch to the block layer that collects measurements of the
time delay between blk_start_request and blk_finish_request when a flush
command is encountered, and what I noticed was that there's a rather large
discrepancy between the delay as observed by the block layer and the delay as
observed by ext4. In general, the discrepancy is a nearly 2x increase between
what the block layer sees and what ext4 sees, so I'll give Christoph's
direct-flush patch (below) a try over the weekend.

--D

2010-10-15 23:40:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Fri, Oct 15, 2010 at 04:39:04PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 12, 2010 at 04:14:55PM +0200, Christoph Hellwig wrote:
> > I still think adding code to every filesystem to optimize for a rather
> > stupid use case is not a good idea. I dropped out a bit from the
> > thread in the middle, but what was the real use case for lots of
> > concurrent fsyncs on the same inode again?
>
> The use case I'm looking at is concurrent fsyncs on /different/ inodes,
> actually. We have _n_ different processes, each writing (and fsyncing) its own
> separate file on the same filesystem.
>
> iirc, ext4_sync_file is called with the inode mutex held, which prevents
> concurrent fsyncs on the same inode.

Indeed. Although we could drop it at least for the cache flush
call. We already do this for block devices.

2010-10-16 00:02:18

by djwong

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Sat, Oct 16, 2010 at 01:40:41AM +0200, Christoph Hellwig wrote:
> On Fri, Oct 15, 2010 at 04:39:04PM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 12, 2010 at 04:14:55PM +0200, Christoph Hellwig wrote:
> > > I still think adding code to every filesystem to optimize for a rather
> > > stupid use case is not a good idea. I dropped out a bit from the
> > > thread in the middle, but what was the real use case for lots of
> > > concurrent fsyncs on the same inode again?
> >
> > The use case I'm looking at is concurrent fsyncs on /different/ inodes,
> > actually. We have _n_ different processes, each writing (and fsyncing) its own
> > separate file on the same filesystem.
> >
> > iirc, ext4_sync_file is called with the inode mutex held, which prevents
> > concurrent fsyncs on the same inode.
>
> Indeed. Although we could drop it at least for the cache flush
> call. We already do this for block devices.

<nod>

Unfortunately, the patch immediately triggers the BUG at
drivers/scsi/scsi_lib.c:1064:
/*
* BLOCK_PC requests may transfer data, in which case they must
* a bio attached to them. Or they might contain a SCSI command
* that does not transfer data, in which case they may optionally
* submit a request without an attached bio.
*/
if (req->bio) {
int ret;

BUG_ON(!req->nr_phys_segments);

--D

2010-10-18 22:49:35

by djwong

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

Hi all,

I've updated the spreadsheet at http://bit.ly/ahdhyk to include a new set of
flush time figures. There are two main columns to the "flush_times_in_ms"
sheet. block_flush_rtt_avg is the average amount of time that it takes the
same flush to be processed by the block layer, the lower level device driver
(lldd) which is usually SCSI, and the hardware. lldd_flush_rtt_avg is the
average amount of time that it takes for a flush command to be processed by the
lldd and the hardware.

Through this table, I'm looking for a performance characteristic that typifies
storage with a battery-backed write cache (BBWC). As we can see from
lldd_flush_rtt_avg, the BBWC storage features a very low flush time, about 1ms
or less. Everything else, including SSDs, are over that amount. The other odd
result I see is that it takes a significant amount of time to get a flush
command from the top of the block layer to the LLDD, though I suspect some of
that might be waiting for the device to process earlier writes. Christoph has
a patch that looks like it streamlines that, but it triggered various BUG_ONs
when I booted with it, so I took the patch out.

I measured the amount of time it takes to get through the fsync coordination
setup. It doesn't take more than about 2us, even on the old hardware.

The hackish patch I use to collect flush times is attached below, if anyone
wants to run their own tests. I'm not an expert on how the block layer works,
so apologies if it makes your eyes bleed.

I'm wondering how to proceed from here. Right now we seem to want to pick a
fsync coordination strategy based on measured flush times. Assuming that a
fixed version of hch's patch (or anyone else's) doesn't eliminate this want, is
it better for the block layer to export flush time data and let each filesystem
figure things out for itself? Or is the block layer smarter, and hence it
should do the coordination instead of the filesystem?

--D
---

block: Measure flush round-trip times.

This hacky patch adds some more plumbing to the block layer to measure round
trip times of flush requests. I've tried to write this patch in such a way
that it is compatible with both bio- and request-based LLDDs. It probably
needs some cleanup before we can use it to decide on a flush coordination
strategy.

Signed-off-by: Darrick J. Wong <[email protected]>
---

block/blk-core.c | 26 ++++++++++++++++++++++++++
block/genhd.c | 35 +++++++++++++++++++++++++++++++++++
drivers/md/dm.c | 21 +++++++++++++++++++++
drivers/md/md.c | 19 +++++++++++++++++++
fs/bio.c | 20 ++++++++++++++++++++
include/linux/blk_types.h | 2 ++
include/linux/blkdev.h | 2 ++
include/linux/genhd.h | 3 +++
8 files changed, 128 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1f75c13..96d5a9f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1523,6 +1523,9 @@ static inline void __generic_make_request(struct bio *bio)

trace_block_bio_queue(q, bio);

+ if (bio->bi_rw & REQ_FLUSH)
+ bio->flush_start_time_ns = ktime_to_ns(ktime_get());
+
ret = q->make_request_fn(q, bio);
} while (ret);

@@ -1936,6 +1939,11 @@ void blk_start_request(struct request *req)
req->next_rq->resid_len = blk_rq_bytes(req->next_rq);

blk_add_timer(req);
+
+#if 1
+ if (req->cmd_flags & REQ_FLUSH)
+ req->flush_start_time_ns = ktime_to_ns(ktime_get());
+#endif
}
EXPORT_SYMBOL(blk_start_request);

@@ -2164,6 +2172,24 @@ EXPORT_SYMBOL_GPL(blk_unprep_request);
*/
static void blk_finish_request(struct request *req, int error)
{
+#if 1
+#define NR_FLUSH_SAMPLES 256
+ if (!error && req->cmd_flags & REQ_FLUSH) {
+ u64 delta;
+
+ delta = ktime_to_ns(ktime_get()) - req->flush_start_time_ns;
+// if (req->rq_disk->flush_samples < NR_FLUSH_SAMPLES)
+ req->rq_disk->flush_samples++;
+ if (!req->rq_disk->avg_flush_time_ns)
+ req->rq_disk->avg_flush_time_ns = delta;
+ else {
+ req->rq_disk->avg_flush_time_ns *= (NR_FLUSH_SAMPLES - 1);
+ req->rq_disk->avg_flush_time_ns += delta;
+ req->rq_disk->avg_flush_time_ns /= NR_FLUSH_SAMPLES;
+ }
+ }
+#endif
+
if (blk_rq_tagged(req))
blk_queue_end_tag(req->q, req);

diff --git a/block/genhd.c b/block/genhd.c
index 59a2db6..195fc06 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -540,6 +540,8 @@ void add_disk(struct gendisk *disk)
*/
disk->major = MAJOR(devt);
disk->first_minor = MINOR(devt);
+ disk->avg_flush_time_ns = 0;
+ disk->flush_samples = 0;

blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
@@ -820,6 +822,35 @@ static ssize_t disk_range_show(struct device *dev,
return sprintf(buf, "%d\n", disk->minors);
}

+static ssize_t disk_flush_samples_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ return sprintf(buf, "%llu\n", disk->flush_samples);
+}
+
+static ssize_t disk_avg_flush_time_ns_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ return sprintf(buf, "%llu\n", disk->avg_flush_time_ns);
+}
+
+static ssize_t disk_avg_flush_time_ns_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ disk->avg_flush_time_ns = 0;
+ disk->flush_samples = 0;
+
+ return count;
+}
+
+
static ssize_t disk_ext_range_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -872,6 +903,8 @@ static ssize_t disk_discard_alignment_show(struct device *dev,
}

static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL);
+static DEVICE_ATTR(avg_flush_time_ns, S_IRUGO | S_IWUSR, disk_avg_flush_time_ns_show, disk_avg_flush_time_ns_store);
+static DEVICE_ATTR(flush_samples, S_IRUGO, disk_flush_samples_show, NULL);
static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL);
static DEVICE_ATTR(ro, S_IRUGO, disk_ro_show, NULL);
@@ -894,6 +927,8 @@ static struct device_attribute dev_attr_fail_timeout =

static struct attribute *disk_attrs[] = {
&dev_attr_range.attr,
+ &dev_attr_avg_flush_time_ns.attr,
+ &dev_attr_flush_samples.attr,
&dev_attr_ext_range.attr,
&dev_attr_removable.attr,
&dev_attr_ro.attr,
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 61be47f..2b654c0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2406,6 +2406,25 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
return r;
}

+static void measure_flushes(struct mapped_device *md)
+{
+ struct dm_table *t;
+ struct dm_dev_internal *dd;
+ struct list_head *devices;
+ u64 max = 0;
+
+ t = dm_get_live_table(md);
+ devices = dm_table_get_devices(t);
+ list_for_each_entry(dd, devices, list) {
+ if (dd->dm_dev.bdev->bd_disk->avg_flush_time_ns > max) {
+ max = dd->dm_dev.bdev->bd_disk->avg_flush_time_ns;
+ md->disk->avg_flush_time_ns = dd->dm_dev.bdev->bd_disk->avg_flush_time_ns;
+ md->disk->flush_samples = dd->dm_dev.bdev->bd_disk->flush_samples;
+ }
+ }
+ dm_table_put(t);
+}
+
static void process_flush(struct mapped_device *md, struct bio *bio)
{
md->flush_error = 0;
@@ -2431,6 +2450,8 @@ static void process_flush(struct mapped_device *md, struct bio *bio)
return;
}

+ measure_flushes(md);
+
/* issue data + REQ_FUA */
bio->bi_rw &= ~REQ_FLUSH;
__split_and_process_bio(md, bio);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ed075d1..28cabce 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -290,6 +290,24 @@ EXPORT_SYMBOL(mddev_congested);
* Generic flush handling for md
*/

+static void measure_flushes(mddev_t *mddev)
+{
+ mdk_rdev_t *rdev;
+ u64 max = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+ if (rdev->raid_disk >= 0 &&
+ !test_bit(Faulty, &rdev->flags)) {
+ if (rdev->bdev->bd_disk->avg_flush_time_ns > max) {
+ max = rdev->bdev->bd_disk->avg_flush_time_ns;
+ mddev->gendisk->avg_flush_time_ns = rdev->bdev->bd_disk->avg_flush_time_ns;
+ mddev->gendisk->flush_samples = rdev->bdev->bd_disk->flush_samples;
+ }
+ }
+ rcu_read_unlock();
+}
+
static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
@@ -298,6 +316,7 @@ static void md_end_flush(struct bio *bio, int err)
rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
+ measure_flushes(mddev);
/* The pre-request flush has finished */
schedule_work(&mddev->flush_work);
}
diff --git a/fs/bio.c b/fs/bio.c
index 8abb2df..1a002c6 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1423,11 +1423,31 @@ EXPORT_SYMBOL(bio_flush_dcache_pages);
**/
void bio_endio(struct bio *bio, int error)
{
+ struct request_queue *q = NULL;
+
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
error = -EIO;

+#define NR_FLUSH_SAMPLES 256
+ if (bio->bi_bdev)
+ q = bdev_get_queue(bio->bi_bdev);
+ if (!(q && q->prep_rq_fn) && !error && bio->bi_rw & REQ_FLUSH) {
+ u64 delta;
+
+ delta = ktime_to_ns(ktime_get()) - bio->flush_start_time_ns;
+// if (bio->bi_bdev->bd_disk->flush_samples < NR_FLUSH_SAMPLES)
+ bio->bi_bdev->bd_disk->flush_samples++;
+ if (!bio->bi_bdev->bd_disk->avg_flush_time_ns)
+ bio->bi_bdev->bd_disk->avg_flush_time_ns = delta;
+ else {
+ bio->bi_bdev->bd_disk->avg_flush_time_ns *= (NR_FLUSH_SAMPLES - 1);
+ bio->bi_bdev->bd_disk->avg_flush_time_ns += delta;
+ bio->bi_bdev->bd_disk->avg_flush_time_ns /= NR_FLUSH_SAMPLES;
+ }
+ }
+
if (bio->bi_end_io)
bio->bi_end_io(bio, error);
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1797994..d68ecc7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -74,6 +74,8 @@ struct bio {

bio_destructor_t *bi_destructor; /* destructor */

+ u64 flush_start_time_ns;
+
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5cba7fe..e6dc1e2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -160,6 +160,8 @@ struct request {

/* for bidi */
struct request *next_rq;
+
+ uint64_t flush_start_time_ns;
};

static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5f2f4c4..e9c5f8f 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -168,6 +168,9 @@ struct gendisk {
struct blk_integrity *integrity;
#endif
int node_id;
+
+ u64 avg_flush_time_ns;
+ u64 flush_samples;
};

static inline struct gendisk *part_to_disk(struct hd_struct *part)

2010-10-19 18:28:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Performance testing of various barrier reduction patches [was: Re: [RFC v4] ext4: Coordinate fsync requests]

On Mon, Oct 18, 2010 at 03:49:36PM -0700, Darrick J. Wong wrote:
> Through this table, I'm looking for a performance characteristic that typifies
> storage with a battery-backed write cache (BBWC). As we can see from
> lldd_flush_rtt_avg, the BBWC storage features a very low flush time, about 1ms
> or less. Everything else, including SSDs, are over that amount. The other odd
> result I see is that it takes a significant amount of time to get a flush
> command from the top of the block layer to the LLDD, though I suspect some of
> that might be waiting for the device to process earlier writes. Christoph has
> a patch that looks like it streamlines that, but it triggered various BUG_ONs
> when I booted with it, so I took the patch out.

We currently synchronize flush requests. There's no real reason to do
it except that we'll either need to make drivers accept flush requests
with a bio attached to them or find a workaround in the block layer to
submit it without bio without synchronizing them.

I thin kthat should be the first angle of attack before adding
complexity to filesystems.