2009-03-27 20:24:35

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 0/3] Ext3 latency improvement patches

The following patches have been posted as providing at least some
partial improvement to the ext3 latency problem that has been
discussed on the 2.6.29 mongo-LKML-thread-that-would-not-die.

I plan to push this to Linus during the 2.6.29 merge window if there
are no huge objections. It is by no means a complete solution, but it
should at least help the situation somewhat.

It is also available via git:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes

Comments?

- Ted

Jan Kara (1):
ext3: Avoid starting a transaction in writepage when not necessary

Theodore Ts'o (2):
block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks
ext3: Use WRITE_SYNC for commits which are caused by fsync()






2009-03-27 20:24:35

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

When doing synchronous writes because wbc->sync_mode is set to
WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we
don't unduly block system calls such as fsync().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/buffer.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..e7ebd95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1714,6 +1714,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
const unsigned blocksize = 1 << inode->i_blkbits;
int nr_underway = 0;
+ int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);

BUG_ON(!PageLocked(page));

@@ -1805,7 +1806,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
- submit_bh(WRITE, bh);
+ submit_bh(write_op, bh);
nr_underway++;
}
bh = next;
@@ -1859,7 +1860,7 @@ recover:
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
- submit_bh(WRITE, bh);
+ submit_bh(write_op, bh);
nr_underway++;
}
bh = next;
--
1.5.6.3


2009-03-27 20:24:35

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 3/3] ext3: Avoid starting a transaction in writepage when not necessary

From: Jan Kara <[email protected]>

We don't have to start a transaction in writepage() when all the blocks
are a properly allocated. Even in ordered mode either the data has been
written via write() and they are thus already added to transaction's list
or the data was written via mmap and then it's random in which transaction
they get written anyway.

This should help VM to pageout dirty memory without blocking on transaction
commits.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext3/inode.c | 19 ++++++++++++++-----
1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5fa453b..a9c12b1 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1435,6 +1435,10 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
return 0;
}

+static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
+{
+ return !buffer_mapped(bh);
+}
/*
* Note that we always start a transaction even if we're not journalling
* data. This is to preserve ordering: any hole instantiation within
@@ -1505,6 +1509,16 @@ static int ext3_ordered_writepage(struct page *page,
if (ext3_journal_current_handle())
goto out_fail;

+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, inode->i_sb->s_blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ } else if (!walk_page_buffers(NULL, page_buffers(page), 0, PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
+ /* Provide NULL instead of get_block so that we catch bugs if buffers weren't really mapped */
+ return block_write_full_page(page, NULL, wbc);
+ }
+ page_bufs = page_buffers(page);
+
+
handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));

if (IS_ERR(handle)) {
@@ -1512,11 +1526,6 @@ static int ext3_ordered_writepage(struct page *page,
goto out_fail;
}

- if (!page_has_buffers(page)) {
- create_empty_buffers(page, inode->i_sb->s_blocksize,
- (1 << BH_Dirty)|(1 << BH_Uptodate));
- }
- page_bufs = page_buffers(page);
walk_page_buffers(handle, page_bufs, 0,
PAGE_CACHE_SIZE, NULL, bget_one);

--
1.5.6.3


2009-03-27 20:24:35

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 2/3] ext3: Use WRITE_SYNC for commits which are caused by fsync()

If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd/commit.c | 23 +++++++++++++++--------
fs/jbd/transaction.c | 2 ++
include/linux/jbd.h | 5 +++++
3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 3fbffb1..f8077b9 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -20,6 +20,7 @@
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/bio.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -171,14 +172,15 @@ static int journal_write_commit_record(journal_t *journal,
return (ret == -EIO);
}

-static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
+static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
+ int write_op)
{
int i;

for (i = 0; i < bufs; i++) {
wbuf[i]->b_end_io = end_buffer_write_sync;
/* We use-up our safety reference in submit_bh() */
- submit_bh(WRITE, wbuf[i]);
+ submit_bh(write_op, wbuf[i]);
}
}

@@ -186,7 +188,8 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
* Submit all the data buffers to disk
*/
static int journal_submit_data_buffers(journal_t *journal,
- transaction_t *commit_transaction)
+ transaction_t *commit_transaction,
+ int write_op)
{
struct journal_head *jh;
struct buffer_head *bh;
@@ -225,7 +228,7 @@ write_out_data:
BUFFER_TRACE(bh, "needs blocking lock");
spin_unlock(&journal->j_list_lock);
/* Write out all data to prevent deadlocks */
- journal_do_submit_data(wbuf, bufs);
+ journal_do_submit_data(wbuf, bufs, write_op);
bufs = 0;
lock_buffer(bh);
spin_lock(&journal->j_list_lock);
@@ -256,7 +259,7 @@ write_out_data:
jbd_unlock_bh_state(bh);
if (bufs == journal->j_wbufsize) {
spin_unlock(&journal->j_list_lock);
- journal_do_submit_data(wbuf, bufs);
+ journal_do_submit_data(wbuf, bufs, write_op);
bufs = 0;
goto write_out_data;
}
@@ -286,7 +289,7 @@ write_out_data:
}
}
spin_unlock(&journal->j_list_lock);
- journal_do_submit_data(wbuf, bufs);
+ journal_do_submit_data(wbuf, bufs, write_op);

return err;
}
@@ -315,6 +318,7 @@ void journal_commit_transaction(journal_t *journal)
int first_tag = 0;
int tag_flag;
int i;
+ int write_op = WRITE;

/*
* First job: lock down the current transaction and wait for
@@ -347,6 +351,8 @@ void journal_commit_transaction(journal_t *journal)
spin_lock(&journal->j_state_lock);
commit_transaction->t_state = T_LOCKED;

+ if (commit_transaction->t_synchronous_commit)
+ write_op = WRITE_SYNC;
spin_lock(&commit_transaction->t_handle_lock);
while (commit_transaction->t_updates) {
DEFINE_WAIT(wait);
@@ -431,7 +437,8 @@ void journal_commit_transaction(journal_t *journal)
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
- err = journal_submit_data_buffers(journal, commit_transaction);
+ err = journal_submit_data_buffers(journal, commit_transaction,
+ write_op);

/*
* Wait for all previously submitted IO to complete.
@@ -660,7 +667,7 @@ start_journal_io:
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;
- submit_bh(WRITE, bh);
+ submit_bh(write_op, bh);
}
cond_resched();

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index e6a1174..ed886e6 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1440,6 +1440,8 @@ int journal_stop(handle_t *handle)
}
}

+ if (handle->h_sync)
+ transaction->t_synchronous_commit = 1;
current->journal_info = NULL;
spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 64246dc..2c69431 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -552,6 +552,11 @@ struct transaction_s
*/
int t_handle_count;

+ /*
+ * This transaction is being forced and some process is
+ * waiting for it to finish.
+ */
+ int t_synchronous_commit:1;
};

/**
--
1.5.6.3


2009-03-27 20:51:06

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri, 2009-03-27 at 16:24 -0400, Theodore Ts'o wrote:
> The following patches have been posted as providing at least some
> partial improvement to the ext3 latency problem that has been
> discussed on the 2.6.29 mongo-LKML-thread-that-would-not-die.

Ric had asked me about a test program that would show the worst case
ext3 behavior. So I've modified your ext3 program a little. It now
creates a 8G file and forks off another proc to do random IO to that
file.

Then it runs one fsync every 4 seconds and times how long they take.
After the program has been running for 60 seconds, it tries to stop.

On my sata drive with barriers on, even btrfs and xfs saw some
multi-second fsyncs, but ext3 came in at 414s for a single fsync.

Warning: don't run this on a laptop drive, you'll still be waiting for
it next year. This is probably full of little errors, I cut it together
pretty quickly.

-chris


Attachments:
fsync-tester.c (2.73 kB)

2009-03-27 20:55:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Fri 27-03-09 16:24:29, Theodore Ts'o wrote:
> When doing synchronous writes because wbc->sync_mode is set to
> WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we
> don't unduly block system calls such as fsync().
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
Looks sane. You can add "Acked-by: Jan Kara <[email protected]>"

Honza
> ---
> fs/buffer.c | 5 +++--
> 1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..e7ebd95 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1714,6 +1714,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> struct buffer_head *bh, *head;
> const unsigned blocksize = 1 << inode->i_blkbits;
> int nr_underway = 0;
> + int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
>
> BUG_ON(!PageLocked(page));
>
> @@ -1805,7 +1806,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> do {
> struct buffer_head *next = bh->b_this_page;
> if (buffer_async_write(bh)) {
> - submit_bh(WRITE, bh);
> + submit_bh(write_op, bh);
> nr_underway++;
> }
> bh = next;
> @@ -1859,7 +1860,7 @@ recover:
> struct buffer_head *next = bh->b_this_page;
> if (buffer_async_write(bh)) {
> clear_buffer_dirty(bh);
> - submit_bh(WRITE, bh);
> + submit_bh(write_op, bh);
> nr_underway++;
> }
> bh = next;
> --
> 1.5.6.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-27 21:04:09

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri, 2009-03-27 at 16:50 -0400, Chris Mason wrote:
> On Fri, 2009-03-27 at 16:24 -0400, Theodore Ts'o wrote:
> > The following patches have been posted as providing at least some
> > partial improvement to the ext3 latency problem that has been
> > discussed on the 2.6.29 mongo-LKML-thread-that-would-not-die.
>
> Ric had asked me about a test program that would show the worst case
> ext3 behavior. So I've modified your ext3 program a little. It now
> creates a 8G file and forks off another proc to do random IO to that
> file.
>
> Then it runs one fsync every 4 seconds and times how long they take.
> After the program has been running for 60 seconds, it tries to stop.
>
> On my sata drive with barriers on, even btrfs and xfs saw some
> multi-second fsyncs, but ext3 came in at 414s for a single fsync.
>
> Warning: don't run this on a laptop drive, you'll still be waiting for
> it next year. This is probably full of little errors, I cut it together
> pretty quickly.
>

My understanding of ext4 delalloc is that once blocks are allocated to
file, we go back to data=ordered.

Ext4 is going pretty slowly for this fsync test (slower than ext3), it
looks like we're going for a very long time in
jbd2_journal_commit_transaction -> write_cache_pages.

-chris



2009-03-27 21:19:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri 27-03-09 17:03:38, Chris Mason wrote:
> On Fri, 2009-03-27 at 16:50 -0400, Chris Mason wrote:
> > On Fri, 2009-03-27 at 16:24 -0400, Theodore Ts'o wrote:
> > > The following patches have been posted as providing at least some
> > > partial improvement to the ext3 latency problem that has been
> > > discussed on the 2.6.29 mongo-LKML-thread-that-would-not-die.
> >
> > Ric had asked me about a test program that would show the worst case
> > ext3 behavior. So I've modified your ext3 program a little. It now
> > creates a 8G file and forks off another proc to do random IO to that
> > file.
> >
> > Then it runs one fsync every 4 seconds and times how long they take.
> > After the program has been running for 60 seconds, it tries to stop.
> >
> > On my sata drive with barriers on, even btrfs and xfs saw some
> > multi-second fsyncs, but ext3 came in at 414s for a single fsync.
> >
> > Warning: don't run this on a laptop drive, you'll still be waiting for
> > it next year. This is probably full of little errors, I cut it together
> > pretty quickly.
> >
>
> My understanding of ext4 delalloc is that once blocks are allocated to
> file, we go back to data=ordered.
Yes.

> Ext4 is going pretty slowly for this fsync test (slower than ext3), it
> looks like we're going for a very long time in
> jbd2_journal_commit_transaction -> write_cache_pages.
Yes, this is how we writeout ordered data and obviously it takes long for
such a huge file where you do random IO...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-27 21:31:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri, Mar 27, 2009 at 05:03:38PM -0400, Chris Mason wrote:
> > Ric had asked me about a test program that would show the worst case
> > ext3 behavior. So I've modified your ext3 program a little. It now
> > creates a 8G file and forks off another proc to do random IO to that
> > file.
> >
>
> My understanding of ext4 delalloc is that once blocks are allocated to
> file, we go back to data=ordered.

Yes, that's correct.

> Ext4 is going pretty slowly for this fsync test (slower than ext3), it
> looks like we're going for a very long time in
> jbd2_journal_commit_transaction -> write_cache_pages.

One of the things that we can do to optimize this case for ext4 (and
ext3) is that if block has already been written out to disk once, we
don't have to flush it to disk a second time. So if we add a new
buffer_head flag which can distinguish between blocks that have been
newly allocated (and not yet been flushed to disk) versus blocks that
have already been flushed to disk at least once, we wouldn't need to
force I/O for blocks in the latter case.

After all, most of the applications which do random I/O to a file
normally will use fsync() appropriately such that they are rewriting
already allocated blocks. So there really is no reason to flush those
blocks out to disk even in data=ordered mode.

We currently flush *all* blocks out to disk in data=ordered mode
because we don't have a good way of telling the difference between the
two cases.

- Ted

2009-03-27 21:54:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri 27-03-09 17:30:52, Theodore Tso wrote:
> On Fri, Mar 27, 2009 at 05:03:38PM -0400, Chris Mason wrote:
> > > Ric had asked me about a test program that would show the worst case
> > > ext3 behavior. So I've modified your ext3 program a little. It now
> > > creates a 8G file and forks off another proc to do random IO to that
> > > file.
> > >
> >
> > My understanding of ext4 delalloc is that once blocks are allocated to
> > file, we go back to data=ordered.
>
> Yes, that's correct.
>
> > Ext4 is going pretty slowly for this fsync test (slower than ext3), it
> > looks like we're going for a very long time in
> > jbd2_journal_commit_transaction -> write_cache_pages.
>
> One of the things that we can do to optimize this case for ext4 (and
> ext3) is that if block has already been written out to disk once, we
> don't have to flush it to disk a second time. So if we add a new
> buffer_head flag which can distinguish between blocks that have been
> newly allocated (and not yet been flushed to disk) versus blocks that
> have already been flushed to disk at least once, we wouldn't need to
> force I/O for blocks in the latter case.
>
> After all, most of the applications which do random I/O to a file
> normally will use fsync() appropriately such that they are rewriting
> already allocated blocks. So there really is no reason to flush those
> blocks out to disk even in data=ordered mode.
>
> We currently flush *all* blocks out to disk in data=ordered mode
> because we don't have a good way of telling the difference between the
> two cases.
Yes, but OTOH this will make the problem "my data was lost after crash"
worse... Although rewrites aren't that common either so maybe it won't
be that bad.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-27 22:20:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext3: Use WRITE_SYNC for commits which are caused by fsync()

On Fri 27-03-09 16:24:30, Theodore Ts'o wrote:
> If a commit is triggered by fsync(), set a flag indicating the journal
> blocks associated with the transaction should be flushed out using
> WRITE_SYNC.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
And this one seems fine as well. Acked-by: Jan Kara <[email protected]>

Honza

> ---
> fs/jbd/commit.c | 23 +++++++++++++++--------
> fs/jbd/transaction.c | 2 ++
> include/linux/jbd.h | 5 +++++
> 3 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 3fbffb1..f8077b9 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -20,6 +20,7 @@
> #include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> +#include <linux/bio.h>
>
> /*
> * Default IO end handler for temporary BJ_IO buffer_heads.
> @@ -171,14 +172,15 @@ static int journal_write_commit_record(journal_t *journal,
> return (ret == -EIO);
> }
>
> -static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
> +static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
> + int write_op)
> {
> int i;
>
> for (i = 0; i < bufs; i++) {
> wbuf[i]->b_end_io = end_buffer_write_sync;
> /* We use-up our safety reference in submit_bh() */
> - submit_bh(WRITE, wbuf[i]);
> + submit_bh(write_op, wbuf[i]);
> }
> }
>
> @@ -186,7 +188,8 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
> * Submit all the data buffers to disk
> */
> static int journal_submit_data_buffers(journal_t *journal,
> - transaction_t *commit_transaction)
> + transaction_t *commit_transaction,
> + int write_op)
> {
> struct journal_head *jh;
> struct buffer_head *bh;
> @@ -225,7 +228,7 @@ write_out_data:
> BUFFER_TRACE(bh, "needs blocking lock");
> spin_unlock(&journal->j_list_lock);
> /* Write out all data to prevent deadlocks */
> - journal_do_submit_data(wbuf, bufs);
> + journal_do_submit_data(wbuf, bufs, write_op);
> bufs = 0;
> lock_buffer(bh);
> spin_lock(&journal->j_list_lock);
> @@ -256,7 +259,7 @@ write_out_data:
> jbd_unlock_bh_state(bh);
> if (bufs == journal->j_wbufsize) {
> spin_unlock(&journal->j_list_lock);
> - journal_do_submit_data(wbuf, bufs);
> + journal_do_submit_data(wbuf, bufs, write_op);
> bufs = 0;
> goto write_out_data;
> }
> @@ -286,7 +289,7 @@ write_out_data:
> }
> }
> spin_unlock(&journal->j_list_lock);
> - journal_do_submit_data(wbuf, bufs);
> + journal_do_submit_data(wbuf, bufs, write_op);
>
> return err;
> }
> @@ -315,6 +318,7 @@ void journal_commit_transaction(journal_t *journal)
> int first_tag = 0;
> int tag_flag;
> int i;
> + int write_op = WRITE;
>
> /*
> * First job: lock down the current transaction and wait for
> @@ -347,6 +351,8 @@ void journal_commit_transaction(journal_t *journal)
> spin_lock(&journal->j_state_lock);
> commit_transaction->t_state = T_LOCKED;
>
> + if (commit_transaction->t_synchronous_commit)
> + write_op = WRITE_SYNC;
> spin_lock(&commit_transaction->t_handle_lock);
> while (commit_transaction->t_updates) {
> DEFINE_WAIT(wait);
> @@ -431,7 +437,8 @@ void journal_commit_transaction(journal_t *journal)
> * Now start flushing things to disk, in the order they appear
> * on the transaction lists. Data blocks go first.
> */
> - err = journal_submit_data_buffers(journal, commit_transaction);
> + err = journal_submit_data_buffers(journal, commit_transaction,
> + write_op);
>
> /*
> * Wait for all previously submitted IO to complete.
> @@ -660,7 +667,7 @@ start_journal_io:
> clear_buffer_dirty(bh);
> set_buffer_uptodate(bh);
> bh->b_end_io = journal_end_buffer_io_sync;
> - submit_bh(WRITE, bh);
> + submit_bh(write_op, bh);
> }
> cond_resched();
>
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index e6a1174..ed886e6 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -1440,6 +1440,8 @@ int journal_stop(handle_t *handle)
> }
> }
>
> + if (handle->h_sync)
> + transaction->t_synchronous_commit = 1;
> current->journal_info = NULL;
> spin_lock(&journal->j_state_lock);
> spin_lock(&transaction->t_handle_lock);
> diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> index 64246dc..2c69431 100644
> --- a/include/linux/jbd.h
> +++ b/include/linux/jbd.h
> @@ -552,6 +552,11 @@ struct transaction_s
> */
> int t_handle_count;
>
> + /*
> + * This transaction is being forced and some process is
> + * waiting for it to finish.
> + */
> + int t_synchronous_commit:1;
> };
>
> /**
> --
> 1.5.6.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-27 22:23:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/3] ext3: Avoid starting a transaction in writepage when not necessary

On Fri 27-03-09 16:24:31, Theodore Ts'o wrote:
> From: Jan Kara <[email protected]>
>
> We don't have to start a transaction in writepage() when all the blocks
> are a properly allocated. Even in ordered mode either the data has been
> written via write() and they are thus already added to transaction's list
> or the data was written via mmap and then it's random in which transaction
> they get written anyway.
>
> This should help VM to pageout dirty memory without blocking on transaction
> commits.
>
> Signed-off-by: Jan Kara <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
Please, use the patch below instead (and I'd also wait a few days for
Mingo to check whether it also helps him). It also changes data=writeback
mode in the same way and it adheres to coding style...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

---

>From 8c1678ee703b36f3be51c856f4586c1512d98521 Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Thu, 26 Mar 2009 13:08:04 +0100
Subject: [PATCH] ext3: Avoid starting a transaction in writepage when not necessary

We don't have to start a transaction in writepage() when all the blocks
are a properly allocated. Even in ordered mode either the data has been
written via write() and they are thus already added to transaction's list
or the data was written via mmap and then it's random in which transaction
they get written anyway.

This should help VM to pageout dirty memory without blocking on transaction
commits.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/inode.c | 31 ++++++++++++++++++++++++++-----
1 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index e230f7a..73f605a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1420,6 +1420,10 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
return 0;
}

+static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
+{
+ return !buffer_mapped(bh);
+}
/*
* Note that we always start a transaction even if we're not journalling
* data. This is to preserve ordering: any hole instantiation within
@@ -1490,6 +1494,19 @@ static int ext3_ordered_writepage(struct page *page,
if (ext3_journal_current_handle())
goto out_fail;

+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, inode->i_sb->s_blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ page_bufs = page_buffers(page);
+ } else {
+ page_bufs = page_buffers(page);
+ if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
+ NULL, buffer_unmapped)) {
+ /* Provide NULL get_block() to catch bugs if buffers
+ * weren't really mapped */
+ return block_write_full_page(page, NULL, wbc);
+ }
+ }
handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));

if (IS_ERR(handle)) {
@@ -1497,11 +1514,6 @@ static int ext3_ordered_writepage(struct page *page,
goto out_fail;
}

- if (!page_has_buffers(page)) {
- create_empty_buffers(page, inode->i_sb->s_blocksize,
- (1 << BH_Dirty)|(1 << BH_Uptodate));
- }
- page_bufs = page_buffers(page);
walk_page_buffers(handle, page_bufs, 0,
PAGE_CACHE_SIZE, NULL, bget_one);

@@ -1549,6 +1561,15 @@ static int ext3_writeback_writepage(struct page *page,
if (ext3_journal_current_handle())
goto out_fail;

+ if (page_has_buffers(page)) {
+ if (!walk_page_buffers(NULL, page_buffers(page), 0,
+ PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
+ /* Provide NULL get_block() to catch bugs if buffers
+ * weren't really mapped */
+ return block_write_full_page(page, NULL, wbc);
+ }
+ }
+
handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
--
1.6.0.2


2009-03-27 23:03:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/3] ext3: Avoid starting a transaction in writepage when not necessary

On Fri, Mar 27, 2009 at 11:23:46PM +0100, Jan Kara wrote:
> On Fri 27-03-09 16:24:31, Theodore Ts'o wrote:
> > From: Jan Kara <[email protected]>
> >
> > We don't have to start a transaction in writepage() when all the blocks
> > are a properly allocated. Even in ordered mode either the data has been
> > written via write() and they are thus already added to transaction's list
> > or the data was written via mmap and then it's random in which transaction
> > they get written anyway.
> >
> > This should help VM to pageout dirty memory without blocking on transaction
> > commits.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > Signed-off-by: "Theodore Ts'o" <[email protected]>
> Please, use the patch below instead (and I'd also wait a few days for
> Mingo to check whether it also helps him). It also changes data=writeback
> mode in the same way and it adheres to coding style...

FYI, Looks like Linus has already cherry-picked your first patch from
LKML and dropped it into mainline. My other two patches haven't gone
in yet as far as I can tell.

So we'll need to do a delta patch that has the differences from your
new patch and what Linus has already pulled into mainline.

- Ted

2009-03-27 23:09:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri, Mar 27, 2009 at 10:54:54PM +0100, Jan Kara wrote:
> Yes, but OTOH this will make the problem "my data was lost after crash"
> worse... Although rewrites aren't that common either so maybe it won't
> be that bad.

The real question is how many applications actually do random access
I/O to rewrite blocks and *don't* use fsync()? Most of the
applications which do block rewrites are databases, and those folks
generally pay very close attention to using synchronization primiives.
It's the people use the standard I/O interfaces, or and who end up
rewriting files by replacing them outright that tend to lose.

OTOH, the really big databases will tend to use direct I/O, so they
won't be dirtying the page cache anyway. So maybe it's not worth the
effort; it's not clear what workloads will do huge numbers of block
rewrites in the first place. Your test program does, of course, but
the question is whether that happens a lot in real life, and so it's
an open question whether we should optimize for it, and whether we can
assume that applications which do this tend to be databases like
sqlite that are quite anal already about calling fsync().

- Ted

2009-03-28 00:14:30

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

Theodore Tso wrote:
> OTOH, the really big databases will tend to use direct I/O, so they
> won't be dirtying the page cache anyway. So maybe it's not worth the

Not necessarily... From what I understand, a lot of the individual
low-level components in cloud storage, such as GoogleFS's chunk
server[1] do not bypass the page cache, even though they do care about
the details of data caching and data consistency.

I am looking at the same areas for my own distributed storage work, and
am finding that the current crop of Linux-specific,
database/server-friendly syscalls permit more application control over
pagecache usage than in past years, decreasing the need for O_DIRECT.
Things like readahead(2), sync_file_range(2), fadvise(3), really help.

Jeff


[1] http://labs.google.com/papers/gfs-sosp2003.pdf


2009-03-28 00:25:13

by David Rees

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri, Mar 27, 2009 at 5:14 PM, Jeff Garzik <[email protected]> wrote:
> Theodore Tso wrote:
>> OTOH, the really big databases will tend to use direct I/O, so they
>> won't be dirtying the page cache anyway. ?So maybe it's not worth the
>
> Not necessarily... ?From what I understand, a lot of the individual
> low-level components in cloud storage, such as GoogleFS's chunk server[1] do
> not bypass the page cache, even though they do care about the details of
> data caching and data consistency.

PostgreSQL does not use direct I/O, either (except for the
write-ahead-logs which are written sequentially and only get read
during database recovery). I'm sure that most of MySQL's database
engines, also don't.

-Dave

2009-03-30 11:23:37

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Fri, Mar 27, 2009 at 05:30:52PM -0400, Theodore Tso wrote:
> On Fri, Mar 27, 2009 at 05:03:38PM -0400, Chris Mason wrote:
> > > Ric had asked me about a test program that would show the worst case
> > > ext3 behavior. So I've modified your ext3 program a little. It now
> > > creates a 8G file and forks off another proc to do random IO to that
> > > file.
> > >
> >
> > My understanding of ext4 delalloc is that once blocks are allocated to
> > file, we go back to data=ordered.
>
> Yes, that's correct.
>
> > Ext4 is going pretty slowly for this fsync test (slower than ext3), it
> > looks like we're going for a very long time in
> > jbd2_journal_commit_transaction -> write_cache_pages.
>
> One of the things that we can do to optimize this case for ext4 (and
> ext3) is that if block has already been written out to disk once, we
> don't have to flush it to disk a second time. So if we add a new
> buffer_head flag which can distinguish between blocks that have been
> newly allocated (and not yet been flushed to disk) versus blocks that
> have already been flushed to disk at least once, we wouldn't need to
> force I/O for blocks in the latter case.

write_cache_pages should only look at pages which are marked dirty right
?. So why are we writing these pages again and again ?

-aneesh

2009-03-30 11:44:52

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

On Mon, 2009-03-30 at 16:53 +0530, Aneesh Kumar K.V wrote:
> On Fri, Mar 27, 2009 at 05:30:52PM -0400, Theodore Tso wrote:
> > On Fri, Mar 27, 2009 at 05:03:38PM -0400, Chris Mason wrote:
> > > > Ric had asked me about a test program that would show the worst case
> > > > ext3 behavior. So I've modified your ext3 program a little. It now
> > > > creates a 8G file and forks off another proc to do random IO to that
> > > > file.
> > > >
> > >
> > > My understanding of ext4 delalloc is that once blocks are allocated to
> > > file, we go back to data=ordered.
> >
> > Yes, that's correct.
> >
> > > Ext4 is going pretty slowly for this fsync test (slower than ext3), it
> > > looks like we're going for a very long time in
> > > jbd2_journal_commit_transaction -> write_cache_pages.
> >
> > One of the things that we can do to optimize this case for ext4 (and
> > ext3) is that if block has already been written out to disk once, we
> > don't have to flush it to disk a second time. So if we add a new
> > buffer_head flag which can distinguish between blocks that have been
> > newly allocated (and not yet been flushed to disk) versus blocks that
> > have already been flushed to disk at least once, we wouldn't need to
> > force I/O for blocks in the latter case.
>
> write_cache_pages should only look at pages which are marked dirty right
> ?. So why are we writing these pages again and again ?

The test program is constantly creating new dirty pages to random
offsets on the disk ;)

-chris



2009-03-30 13:22:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/3] ext3: Avoid starting a transaction in writepage when not necessary

On Fri 27-03-09 19:03:41, Theodore Tso wrote:
> On Fri, Mar 27, 2009 at 11:23:46PM +0100, Jan Kara wrote:
> > On Fri 27-03-09 16:24:31, Theodore Ts'o wrote:
> > > From: Jan Kara <[email protected]>
> > >
> > > We don't have to start a transaction in writepage() when all the blocks
> > > are a properly allocated. Even in ordered mode either the data has been
> > > written via write() and they are thus already added to transaction's list
> > > or the data was written via mmap and then it's random in which transaction
> > > they get written anyway.
> > >
> > > This should help VM to pageout dirty memory without blocking on transaction
> > > commits.
> > >
> > > Signed-off-by: Jan Kara <[email protected]>
> > > Signed-off-by: "Theodore Ts'o" <[email protected]>
> > Please, use the patch below instead (and I'd also wait a few days for
> > Mingo to check whether it also helps him). It also changes data=writeback
> > mode in the same way and it adheres to coding style...
>
> FYI, Looks like Linus has already cherry-picked your first patch from
> LKML and dropped it into mainline. My other two patches haven't gone
> in yet as far as I can tell.
>
> So we'll need to do a delta patch that has the differences from your
> new patch and what Linus has already pulled into mainline.
Thanks for letting me know. Attached is a rediff against current
Linus's tree.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (1.39 kB)
0001-ext3-Avoid-starting-a-transaction-in-writepage-when.patch (1.52 kB)
Download all attachments

2009-03-30 14:20:34

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 0/3] Ext3 latency improvement patches

David Rees wrote:
> On Fri, Mar 27, 2009 at 5:14 PM, Jeff Garzik <[email protected]> wrote:
>
>> Theodore Tso wrote:
>>
>>> OTOH, the really big databases will tend to use direct I/O, so they
>>> won't be dirtying the page cache anyway. So maybe it's not worth the
>>>
>> Not necessarily... From what I understand, a lot of the individual
>> low-level components in cloud storage, such as GoogleFS's chunk server[1] do
>> not bypass the page cache, even though they do care about the details of
>> data caching and data consistency.
>>
>
> PostgreSQL does not use direct I/O, either (except for the
> write-ahead-logs which are written sequentially and only get read
> during database recovery). I'm sure that most of MySQL's database
> engines, also don't.
>
> -Dave
>

The high end, traditional databases like DB2 and Oracle definitely do
tend to use direct I/O and manage the cache vs not cached pages
carefully on their own.

They also tend to use database "page sizes" larger than our VM page
size or FS block size and work hard to send large, aligned IO's down to
storage in the correct order so they can be fully recoverable after a
crash (no partially updated DB pages, aka "torn pages").

A lot of the cloud storage people rely on whole files. For example, you
implement RAID at the file level by breaking your file down into K
chunks, each one sent over the network to different machines. That chunk
is really a whole file and is sent to disk (hopefully with an fsync()!)
before ack'ing the transaction. They don't worry about data integrity
for objects less than that chunk size.

At least, this is how we did it in Centera - without doing that, you are
definitely open to data loss.

Ric




2009-04-07 06:25:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Fri, 27 Mar 2009 16:24:29 -0400 "Theodore Ts'o" <[email protected]> wrote:

> When doing synchronous writes because wbc->sync_mode is set to
> WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we
> don't unduly block system calls such as fsync().
>

Who what where why when? How does this patch work?

> ---
> fs/buffer.c | 5 +++--
> 1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..e7ebd95 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1714,6 +1714,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> struct buffer_head *bh, *head;
> const unsigned blocksize = 1 << inode->i_blkbits;
> int nr_underway = 0;
> + int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
>
> BUG_ON(!PageLocked(page));
>
> @@ -1805,7 +1806,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> do {
> struct buffer_head *next = bh->b_this_page;
> if (buffer_async_write(bh)) {
> - submit_bh(WRITE, bh);
> + submit_bh(write_op, bh);
> nr_underway++;
> }
> bh = next;
> @@ -1859,7 +1860,7 @@ recover:
> struct buffer_head *next = bh->b_this_page;
> if (buffer_async_write(bh)) {
> clear_buffer_dirty(bh);
> - submit_bh(WRITE, bh);
> + submit_bh(write_op, bh);
> nr_underway++;
> }
> bh = next;

<spends faaaaaaaaaaaaaaaaaaaaaar too long trying to work out what WRITE_SYNC does>

ytf can't we document these things?

I'm having trouble distinguishing all that code from a pile of crap.

I mean, let's graph it:

WRITE_SYNC -> WRITE_SYNC_PLUG -> BIO_RW_SYNCIO -> bio_sync() -> REQ_RW_SYNC -> rw_is_sync() -> does something mysterious in get_request()
-> rq_is_sync() -> does something mysterious in IO schedulers
-> BIO_RW_NOIDLE -> bio_noidle() -> REQ_NOIDLE -> rq_noidle() -> does something mysterious in cfq-iosched only
-> BIO_RW_UNPLUG -> bio_unplug() -> REQ_UNPLUG -> OK, the cognoscenti know what this is supposed to do, but it is unused!

It this really really so obvious and simple that we can afford to leave
WRITE_SYNC semantics undocumented?

All this makes it impossible to review your patch.

2009-04-07 06:54:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Mon, 6 Apr 2009 23:21:41 -0700 Andrew Morton <[email protected]> wrote:

> I mean, let's graph it:
>
> WRITE_SYNC -> WRITE_SYNC_PLUG -> BIO_RW_SYNCIO -> bio_sync() -> REQ_RW_SYNC -> rw_is_sync() -> does something mysterious in get_request()
> -> rq_is_sync() -> does something mysterious in IO schedulers
> -> BIO_RW_NOIDLE -> bio_noidle() -> REQ_NOIDLE -> rq_noidle() -> does something mysterious in cfq-iosched only
> -> BIO_RW_UNPLUG -> bio_unplug() -> REQ_UNPLUG -> OK, the cognoscenti know what this is supposed to do, but it is unused!

whoop, I found a use of bio_unplug() in __make_request().

So it appears that the intent of your patch is to cause an unplug after
submission of each WB_SYNC_ALL block?

But what about all the other stuff which WRITE_SYNC might or might not
do? What does WRITE_SYNC _actually_ do, and what are the actual
effects of this change??

And what effect will this large stream of unplugs have upon merging?

2009-04-07 07:08:40

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Mon, Apr 06 2009, Andrew Morton wrote:
> On Mon, 6 Apr 2009 23:21:41 -0700 Andrew Morton <[email protected]> wrote:
>
> > I mean, let's graph it:
> >
> > WRITE_SYNC -> WRITE_SYNC_PLUG -> BIO_RW_SYNCIO -> bio_sync() -> REQ_RW_SYNC -> rw_is_sync() -> does something mysterious in get_request()
> > -> rq_is_sync() -> does something mysterious in IO schedulers
> > -> BIO_RW_NOIDLE -> bio_noidle() -> REQ_NOIDLE -> rq_noidle() -> does something mysterious in cfq-iosched only
> > -> BIO_RW_UNPLUG -> bio_unplug() -> REQ_UNPLUG -> OK, the cognoscenti know what this is supposed to do, but it is unused!
>
> whoop, I found a use of bio_unplug() in __make_request().
>
> So it appears that the intent of your patch is to cause an unplug after
> submission of each WB_SYNC_ALL block?
>
> But what about all the other stuff which WRITE_SYNC might or might not
> do? What does WRITE_SYNC _actually_ do, and what are the actual
> effects of this change??
>
> And what effect will this large stream of unplugs have upon merging?

It looks like a good candidate for WRITE_SYNC_PLUG instead, since it
does more than one buffer submission before waiting. It likely wont mean
a whole lot since we'll usually only have a single buffer on that page,
but for < PAGE_CACHE_SIZE block sizes it could easily make a big
difference (4 ios instead of 1!).

So on the write side, basically we have:

WRITE Normal async write.
WRITE_SYNC_PLUG Sync write, someone will wait on this so don't
treat it as background activity. This is a hint
to the io schedulers. This one does NOT unplug
the queue, either the caller should do it after
submission, or he should make sure that the
wait_on_* callbacks do it for him.
WRITE_SYNC Like WRITE_SYNC_PLUG, but causes immediate
unplug of the queue after submission. Most
uses of this should likely use WRITE_SYNC_PLUG,
at least in the normal IO path.
WRITE_ODIRECT Like WRITE_SYNC, but also passes a hint to the
IO scheduler that we should expect more IO.
This is similar to how a read is treated in the
scheduler, it'll enable anticipation/idling.

Ditto for the SWRITE* variants, which are special hacks for
ll_rw_block() only.

I have killed REQ_UNPLUG, it doesn't make sense to pass the further down
than to __make_request(), so the bio flag is enough.

--
Jens Axboe


2009-04-07 07:17:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07 2009, Jens Axboe wrote:
> On Mon, Apr 06 2009, Andrew Morton wrote:
> > On Mon, 6 Apr 2009 23:21:41 -0700 Andrew Morton <[email protected]> wrote:
> >
> > > I mean, let's graph it:
> > >
> > > WRITE_SYNC -> WRITE_SYNC_PLUG -> BIO_RW_SYNCIO -> bio_sync() -> REQ_RW_SYNC -> rw_is_sync() -> does something mysterious in get_request()
> > > -> rq_is_sync() -> does something mysterious in IO schedulers
> > > -> BIO_RW_NOIDLE -> bio_noidle() -> REQ_NOIDLE -> rq_noidle() -> does something mysterious in cfq-iosched only
> > > -> BIO_RW_UNPLUG -> bio_unplug() -> REQ_UNPLUG -> OK, the cognoscenti know what this is supposed to do, but it is unused!
> >
> > whoop, I found a use of bio_unplug() in __make_request().
> >
> > So it appears that the intent of your patch is to cause an unplug after
> > submission of each WB_SYNC_ALL block?
> >
> > But what about all the other stuff which WRITE_SYNC might or might not
> > do? What does WRITE_SYNC _actually_ do, and what are the actual
> > effects of this change??
> >
> > And what effect will this large stream of unplugs have upon merging?
>
> It looks like a good candidate for WRITE_SYNC_PLUG instead, since it
> does more than one buffer submission before waiting. It likely wont mean
> a whole lot since we'll usually only have a single buffer on that page,
> but for < PAGE_CACHE_SIZE block sizes it could easily make a big
> difference (4 ios instead of 1!).
>
> So on the write side, basically we have:
>
> WRITE Normal async write.
> WRITE_SYNC_PLUG Sync write, someone will wait on this so don't
> treat it as background activity. This is a hint
> to the io schedulers. This one does NOT unplug
> the queue, either the caller should do it after
> submission, or he should make sure that the
> wait_on_* callbacks do it for him.
> WRITE_SYNC Like WRITE_SYNC_PLUG, but causes immediate
> unplug of the queue after submission. Most
> uses of this should likely use WRITE_SYNC_PLUG,
> at least in the normal IO path.
> WRITE_ODIRECT Like WRITE_SYNC, but also passes a hint to the
> IO scheduler that we should expect more IO.
> This is similar to how a read is treated in the
> scheduler, it'll enable anticipation/idling.
>
> Ditto for the SWRITE* variants, which are special hacks for
> ll_rw_block() only.
>
> I have killed REQ_UNPLUG, it doesn't make sense to pass the further down
> than to __make_request(), so the bio flag is enough.

BTW, with the increased number of sync IO and unplugging, it makes sense
to soon look into some finer granularity of plugging. If we didn't have
so many single page submission paths it would not be as big a problem,
but we do. And since they still persist so many years after we added
functionality to pass bigger IOs, it likely wont be much better in the
future either.

So we can either look into doing per io context plugging, or doing
something similar to:

plugctx = blk_get_plug_context();
...
submit_bio_plug(rw, bio, plugctx);
...
submit_bio_plug(rw, bio, plugctx);
...
blk_submit_plug_context(plugctx);

and pass that down through wbc, perhaps. Dunno, just a thought.
Basically a work-around for not having a dedicated writepages() that
does the right thing (ext3 anyone?).

--
Jens Axboe

2009-04-07 07:26:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, 7 Apr 2009 09:08:36 +0200 Jens Axboe <[email protected]> wrote:

> On Mon, Apr 06 2009, Andrew Morton wrote:
> > On Mon, 6 Apr 2009 23:21:41 -0700 Andrew Morton <[email protected]> wrote:
> >
> > > I mean, let's graph it:
> > >
> > > WRITE_SYNC -> WRITE_SYNC_PLUG -> BIO_RW_SYNCIO -> bio_sync() -> REQ_RW_SYNC -> rw_is_sync() -> does something mysterious in get_request()
> > > -> rq_is_sync() -> does something mysterious in IO schedulers
> > > -> BIO_RW_NOIDLE -> bio_noidle() -> REQ_NOIDLE -> rq_noidle() -> does something mysterious in cfq-iosched only
> > > -> BIO_RW_UNPLUG -> bio_unplug() -> REQ_UNPLUG -> OK, the cognoscenti know what this is supposed to do, but it is unused!

I think the number of different greps which was needed to find all the
above was excessive. Too many levels of wrappers and helpers.

If there was documentation at the intermediate levels then that would
terminate the search early. But working out the _actual_ semantics of
(say) BIO_RW_SYNCIO is quite hard!

> > whoop, I found a use of bio_unplug() in __make_request().
> >
> > So it appears that the intent of your patch is to cause an unplug after
> > submission of each WB_SYNC_ALL block?
> >
> > But what about all the other stuff which WRITE_SYNC might or might not
> > do? What does WRITE_SYNC _actually_ do, and what are the actual
> > effects of this change??
> >
> > And what effect will this large stream of unplugs have upon merging?
>
> It looks like a good candidate for WRITE_SYNC_PLUG instead,

Perhaps that mean that Ted didn't know what his own patch did. I
certainly couldn't work it out. That's a problem, IMO!

> since it
> does more than one buffer submission before waiting. It likely wont mean
> a whole lot since we'll usually only have a single buffer on that page,
> but for < PAGE_CACHE_SIZE block sizes it could easily make a big
> difference (4 ios instead of 1!).

OK.

But what is the advantage in doing this stream of unplugs? For your
average fsync(), we're probably doing tens of thousands per second.
Does it actually help?

I assume that the actual code path for such a buffer becomes a lot
longer because we need to go beyond the the queueing layer and perhaps
as far down as the device driver for each block, so the CPU cost will
go up?

> So on the write side, basically we have:

Could we get this in patch form, pretty please?

> WRITE Normal async write.
> WRITE_SYNC_PLUG Sync write, someone will wait on this so don't
> treat it as background activity. This is a hint
> to the io schedulers. This one does NOT unplug
> the queue, either the caller should do it after
> submission, or he should make sure that the
> wait_on_* callbacks do it for him.

The description isn't terribly useful unless the reader is told what
actions the schedulers are expected to take in response to the hint.

> WRITE_SYNC Like WRITE_SYNC_PLUG, but causes immediate
> unplug of the queue after submission. Most
> uses of this should likely use WRITE_SYNC_PLUG,
> at least in the normal IO path.
> WRITE_ODIRECT Like WRITE_SYNC, but also passes a hint to the
> IO scheduler that we should expect more IO.
> This is similar to how a read is treated in the
> scheduler, it'll enable anticipation/idling.

Ditto, somewhat.

> Ditto for the SWRITE* variants, which are special hacks for
> ll_rw_block() only.
>
> I have killed REQ_UNPLUG, it doesn't make sense to pass the further down
> than to __make_request(), so the bio flag is enough.

OK.

2009-04-07 07:57:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07 2009, Andrew Morton wrote:
> On Tue, 7 Apr 2009 09:08:36 +0200 Jens Axboe <[email protected]> wrote:
>
> > On Mon, Apr 06 2009, Andrew Morton wrote:
> > > On Mon, 6 Apr 2009 23:21:41 -0700 Andrew Morton <[email protected]> wrote:
> > >
> > > > I mean, let's graph it:
> > > >
> > > > WRITE_SYNC -> WRITE_SYNC_PLUG -> BIO_RW_SYNCIO -> bio_sync() -> REQ_RW_SYNC -> rw_is_sync() -> does something mysterious in get_request()
> > > > -> rq_is_sync() -> does something mysterious in IO schedulers
> > > > -> BIO_RW_NOIDLE -> bio_noidle() -> REQ_NOIDLE -> rq_noidle() -> does something mysterious in cfq-iosched only
> > > > -> BIO_RW_UNPLUG -> bio_unplug() -> REQ_UNPLUG -> OK, the cognoscenti know what this is supposed to do, but it is unused!
>
> I think the number of different greps which was needed to find all the
> above was excessive. Too many levels of wrappers and helpers.
>
> If there was documentation at the intermediate levels then that would
> terminate the search early. But working out the _actual_ semantics of
> (say) BIO_RW_SYNCIO is quite hard!

I'll add some proper documentation to the list of WRITE* definitions.

> > > whoop, I found a use of bio_unplug() in __make_request().
> > >
> > > So it appears that the intent of your patch is to cause an unplug after
> > > submission of each WB_SYNC_ALL block?
> > >
> > > But what about all the other stuff which WRITE_SYNC might or might not
> > > do? What does WRITE_SYNC _actually_ do, and what are the actual
> > > effects of this change??
> > >
> > > And what effect will this large stream of unplugs have upon merging?
> >
> > It looks like a good candidate for WRITE_SYNC_PLUG instead,
>
> Perhaps that mean that Ted didn't know what his own patch did. I
> certainly couldn't work it out. That's a problem, IMO!

I think Ted knew, we did talk about the fact that it would unplug. But
note that it only really changes the number of unplugs for the cases
where you do multiple bh submissions before doing a
wait_on/lock_buffer() since that'll do the unplug anyway. For those
cases we do want to use WRITE_SYNC_PLUG to defer unplugging to the first
wait/lock of the bh.

> > since it
> > does more than one buffer submission before waiting. It likely wont mean
> > a whole lot since we'll usually only have a single buffer on that page,
> > but for < PAGE_CACHE_SIZE block sizes it could easily make a big
> > difference (4 ios instead of 1!).
>
> OK.
>
> But what is the advantage in doing this stream of unplugs? For your
> average fsync(), we're probably doing tens of thousands per second.
> Does it actually help?
>
> I assume that the actual code path for such a buffer becomes a lot
> longer because we need to go beyond the the queueing layer and perhaps
> as far down as the device driver for each block, so the CPU cost will
> go up?

If you are just doing a single IO submission, whether you do that unplug
in sync_buffer() or explicitly at the IO submission call doesn't make
any difference. In fact, the stack will be shorter from the submission
path by a function call or two.

> > So on the write side, basically we have:
>
> Could we get this in patch form, pretty please?

Certainly, I'll improve it a bit and add it as well.

> > WRITE Normal async write.
> > WRITE_SYNC_PLUG Sync write, someone will wait on this so don't
> > treat it as background activity. This is a hint
> > to the io schedulers. This one does NOT unplug
> > the queue, either the caller should do it after
> > submission, or he should make sure that the
> > wait_on_* callbacks do it for him.
>
> The description isn't terribly useful unless the reader is told what
> actions the schedulers are expected to take in response to the hint.

Well, the caller doesn't really need to know. Basically it boils down to
whether you are going to be waiting on this IO or not. If you are, it's
a sync write. I don't want to add a lot of IO scheduler specifics about
how we treat a sync write differently from an async one.

I'll rewrite the text.

--
Jens Axboe

2009-04-07 08:16:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07 2009, Jens Axboe wrote:
> BTW, with the increased number of sync IO and unplugging, it makes sense
> to soon look into some finer granularity of plugging. If we didn't have
> so many single page submission paths it would not be as big a problem,
> but we do. And since they still persist so many years after we added
> functionality to pass bigger IOs, it likely wont be much better in the
> future either.
>
> So we can either look into doing per io context plugging, or doing
> something similar to:
>
> plugctx = blk_get_plug_context();
> ...
> submit_bio_plug(rw, bio, plugctx);
> ...
> submit_bio_plug(rw, bio, plugctx);
> ...
> blk_submit_plug_context(plugctx);
>
> and pass that down through wbc, perhaps. Dunno, just a thought.
> Basically a work-around for not having a dedicated writepages() that
> does the right thing (ext3 anyone?).

Here's a quick mockup. It compiles, but that's about all the usage it
has seen so far :-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43fdedc..5cf416c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1567,6 +1567,17 @@ void submit_bio(int rw, struct bio *bio)
}
EXPORT_SYMBOL(submit_bio);

+void submit_bio_plug(int rw, struct bio *bio, struct blk_plug_ctx *ctx)
+{
+ if (ctx) {
+ bio->bi_rw |= rw;
+ bio->bi_next = ctx->bio_list;
+ ctx->bio_list = bio;
+ } else
+ submit_bio(rw, bio);
+}
+EXPORT_SYMBOL(submit_bio_plug);
+
/**
* blk_rq_check_limits - Helper function to check a request for the queue limit
* @q: the queue
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..e4313e3 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -101,6 +101,8 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
INIT_HLIST_HEAD(&ret->cic_list);
ret->ioc_data = NULL;
+ ret->plug_ctx.bio_list = NULL;
+ ret->plug_ctx.state = 0;
}

return ret;
@@ -171,6 +173,55 @@ void copy_io_context(struct io_context **pdst, struct io_context **psrc)
}
EXPORT_SYMBOL(copy_io_context);

+struct blk_plug_ctx *blk_get_plug_context(void)
+{
+ struct io_context *ioc;
+
+ ioc = current_io_context(GFP_ATOMIC, -1);
+ if (!ioc)
+ return NULL;
+
+ if (!test_and_set_bit_lock(0, &ioc->plug_ctx.state))
+ return &ioc->plug_ctx;
+
+ return NULL;
+}
+
+static void __blk_submit_plug_context(struct blk_plug_ctx *ctx)
+{
+ struct block_device *bdev = NULL;
+ struct bio *bio;
+
+ while ((bio = ctx->bio_list) != NULL) {
+ ctx->bio_list = bio->bi_next;
+ bio->bi_next = NULL;
+
+ if (bdev && bdev != bio->bi_bdev)
+ blk_unplug(bdev_get_queue(bdev));
+
+ if (bio_unplug(bio))
+ bdev = bio->bi_bdev;
+
+ bio->bi_flags &= ~(1 << BIO_RW_UNPLUG);
+
+ submit_bio(bio->bi_rw, bio);
+ }
+}
+
+void blk_submit_plug_context(struct blk_plug_ctx *ctx)
+{
+ if (ctx) {
+ __blk_submit_plug_context(ctx);
+ clear_bit_unlock(0, &ctx->state);
+ }
+}
+
+void blk_flush_plug_context(struct blk_plug_ctx *ctx)
+{
+ if (ctx)
+ __blk_submit_plug_context(ctx);
+}
+
static int __init blk_ioc_init(void)
{
iocontext_cachep = kmem_cache_create("blkdev_ioc",
diff --git a/fs/buffer.c b/fs/buffer.c
index 6e35762..2ed21b8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1698,7 +1698,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
- submit_bh(write_op, bh);
+ submit_bh_plug(write_op, bh, wbc->plug);
nr_underway++;
}
bh = next;
@@ -2884,8 +2884,10 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
bio_put(bio);
}

-int submit_bh(int rw, struct buffer_head * bh)
+static int __submit_bh(int rw, struct buffer_head * bh,
+ struct blk_plug_ctx *ctx)
{
+ gfp_t gfp = ctx ? GFP_ATOMIC : GFP_NOIO;
struct bio *bio;
int ret = 0;

@@ -2910,7 +2912,12 @@ int submit_bh(int rw, struct buffer_head * bh)
* from here on down, it's all bio -- do the initial mapping,
* submit_bio -> generic_make_request may further map this bio around
*/
- bio = bio_alloc(GFP_NOIO, 1);
+ bio = bio_alloc(gfp, 1);
+ if (!bio) {
+ blk_flush_plug_context(ctx);
+ bio_alloc(GFP_NOIO, 1);
+ ctx = NULL;
+ }

bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
@@ -2926,7 +2933,8 @@ int submit_bh(int rw, struct buffer_head * bh)
bio->bi_private = bh;

bio_get(bio);
- submit_bio(rw, bio);
+
+ submit_bio_plug(rw, bio, ctx);

if (bio_flagged(bio, BIO_EOPNOTSUPP))
ret = -EOPNOTSUPP;
@@ -2935,6 +2943,16 @@ int submit_bh(int rw, struct buffer_head * bh)
return ret;
}

+int submit_bh(int rw, struct buffer_head *bh)
+{
+ return __submit_bh(rw, bh, NULL);
+}
+
+int submit_bh_plug(int rw, struct buffer_head *bh, struct blk_plug_ctx *ctx)
+{
+ return __submit_bh(rw, bh, ctx);
+}
+
/**
* ll_rw_block: low-level access to block devices (DEPRECATED)
* @rw: whether to %READ or %WRITE or %SWRITE or maybe %READA (readahead)
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73bb8..a8eec18 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -183,6 +183,7 @@ void __lock_buffer(struct buffer_head *bh);
void ll_rw_block(int, int, struct buffer_head * bh[]);
int sync_dirty_buffer(struct buffer_head *bh);
int submit_bh(int, struct buffer_head *);
+int submit_bh_plug(int, struct buffer_head *, struct blk_plug_ctx *);
void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
int bh_uptodate_or_lock(struct buffer_head *bh);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bce40a2..8a0c4b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2117,7 +2117,9 @@ extern void file_move(struct file *f, struct list_head *list);
extern void file_kill(struct file *f);
#ifdef CONFIG_BLOCK
struct bio;
+struct blk_plug_ctx;
extern void submit_bio(int, struct bio *);
+extern void submit_bio_plug(int, struct bio *, struct blk_plug_ctx *);
extern int bdev_read_only(struct block_device *);
#endif
extern int set_blocksize(struct block_device *, int);
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..38c8a2c 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -3,6 +3,7 @@

#include <linux/radix-tree.h>
#include <linux/rcupdate.h>
+#include <linux/list.h>

/*
* This is the per-process anticipatory I/O scheduler state.
@@ -59,6 +60,11 @@ struct cfq_io_context {
struct rcu_head rcu_head;
};

+struct blk_plug_ctx {
+ struct bio *bio_list;
+ unsigned long state;
+};
+
/*
* I/O subsystem state of the associated processes. It is refcounted
* and kmalloc'ed. These could be shared between processes.
@@ -83,6 +89,8 @@ struct io_context {
struct radix_tree_root radix_root;
struct hlist_head cic_list;
void *ioc_data;
+
+ struct blk_plug_ctx plug_ctx;
};

static inline struct io_context *ioc_task_link(struct io_context *ioc)
@@ -105,7 +113,17 @@ void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
+struct blk_plug_ctx *blk_get_plug_context(void);
+void blk_submit_plug_context(struct blk_plug_ctx *);
+void blk_flush_plug_context(struct blk_plug_ctx *);
#else
+static inline void blk_submit_plug_context(struct blk_plug_ctx *ctx)
+{
+}
+static inline struct blk_plug_ctx *blk_get_plug_context(void)
+{
+ return NULL;
+}
static inline void exit_io_context(void)
{
}
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9344547..8b5c14a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -6,6 +6,7 @@

#include <linux/sched.h>
#include <linux/fs.h>
+#include <linux/iocontext.h>

struct backing_dev_info;

@@ -40,6 +41,7 @@ enum writeback_sync_modes {
struct writeback_control {
struct backing_dev_info *bdi; /* If !NULL, only write back this
queue */
+ struct blk_plug_ctx *plug;
enum writeback_sync_modes sync_mode;
unsigned long *older_than_this; /* If !NULL, only write back inodes
older than this */
diff --git a/mm/filemap.c b/mm/filemap.c
index 2e2d38e..d521830 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -218,7 +218,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
if (!mapping_cap_writeback_dirty(mapping))
return 0;

+ wbc.plug = blk_get_plug_context();
ret = do_writepages(mapping, &wbc);
+ blk_submit_plug_context(wbc.plug);
return ret;
}


--
Jens Axboe


2009-04-07 14:19:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07, 2009 at 12:23:13AM -0700, Andrew Morton wrote:
>
> I think the number of different greps which was needed to find all the
> above was excessive. Too many levels of wrappers and helpers.

This is true not only for the block I/O code, it's true also for the
page writeback code --- have you ever tried creating a call tree that
traces all of the descendents of sync_inode()?

One of these days when I have a spare week (hah!), I'd love to create
a huge chart of all of the callers into the various wrapper functions
of the page writeback paths, figure out who needs what, and then
collapse the rest.

> > It looks like a good candidate for WRITE_SYNC_PLUG instead,

Yeah, what we'll need to do now that we have the difference between
WRITE_SYNC and WRITE_SYNC_PLUG is to have a way of signalling
(probably via yet another wbc flag) that we want WRITE_SYNC_PLUG and
not WRITE_SYNC, and that the top-level caller of this whole mess will
be responsible for issuing the unplug. I'll try to get something
whipped up.

- Ted


2009-04-07 19:09:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07, 2009 at 09:57:32AM +0200, Jens Axboe wrote:
> > > It looks like a good candidate for WRITE_SYNC_PLUG instead,
> >

So is this patch sane? (Compile-tested only, since I'm at a
conference at the moment). Am I using the proper abstraction to
unplug the block device? If not, it might be nice to document the
preferred for callers into the block layer.

(BTW, I was looking at Documentation/biodoc.txt, and I found some
clearly old documentation bits: "This is just the same as in 2.4 so
far, though per-device unplugging support is anticipated for 2.5." :-)

- Ted

[RFQ] Smart unplugging for page writeback

Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
use WRITE_SYNC_PLUG in block_write_full_page(), and then before we
wait for page writebacks to complete in jbd, jbd2, and filemap, call
blk_unplug() to make sure the writes are on the way to the disk.

diff --git a/fs/buffer.c b/fs/buffer.c
index 977e12a..95b5390 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1646,7 +1646,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
const unsigned blocksize = 1 << inode->i_blkbits;
int nr_underway = 0;
- int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
+ WRITE_SYNC_PLUG : WRITE);

BUG_ON(!PageLocked(page));

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index a8e8513..3e6726f 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/bio.h>
+#include <linux/blkdev.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -196,6 +197,7 @@ static int journal_submit_data_buffers(journal_t *journal,
int locked;
int bufs = 0;
struct buffer_head **wbuf = journal->j_wbuf;
+ struct block_device *fs_bdev = 0;
int err = 0;

/*
@@ -213,6 +215,7 @@ write_out_data:
while (commit_transaction->t_sync_datalist) {
jh = commit_transaction->t_sync_datalist;
bh = jh2bh(jh);
+ fs_bdev = bh->b_bdev;
locked = 0;

/* Get reference just to make sure buffer does not disappear
@@ -290,6 +293,8 @@ write_out_data:
}
spin_unlock(&journal->j_list_lock);
journal_do_submit_data(wbuf, bufs, write_op);
+ if (fs_bdev)
+ blk_unplug(bdev_get_queue(fs_bdev));

return err;
}
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 282750c..b5448dd 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -26,6 +26,7 @@
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/bio.h>
+#include <linux/blkdev.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -176,6 +177,7 @@ static int journal_wait_on_commit_record(journal_t *journal,

retry:
clear_buffer_dirty(bh);
+ blk_unplug(bdev_get_queue(bh->b_bdev));
wait_on_buffer(bh);
if (buffer_eopnotsupp(bh) && (journal->j_flags & JBD2_BARRIER)) {
printk(KERN_WARNING
@@ -241,10 +243,12 @@ static int journal_submit_data_buffers(journal_t *journal,
struct jbd2_inode *jinode;
int err, ret = 0;
struct address_space *mapping;
+ struct block_device *fs_bdev = 0;

spin_lock(&journal->j_list_lock);
list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
mapping = jinode->i_vfs_inode->i_mapping;
+ fs_bdev = jinode->i_vfs_inode->i_sb->s_bdev;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
/*
@@ -262,6 +266,8 @@ static int journal_submit_data_buffers(journal_t *journal,
wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
}
spin_unlock(&journal->j_list_lock);
+ if (fs_bdev)
+ blk_unplug(bdev_get_queue(fs_bdev));
return ret;
}

diff --git a/mm/filemap.c b/mm/filemap.c
index 2e2d38e..eff2ed9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -274,6 +274,10 @@ int wait_on_page_writeback_range(struct address_space *mapping,
if (end < start)
return 0;

+ if (mapping->host && mapping->host->i_sb && mapping->host->i_sb &&
+ mapping->host->i_sb->s_bdev)
+ blk_unplug(bdev_get_queue(mapping->host->i_sb->s_bdev));
+
pagevec_init(&pvec, 0);
index = start;
while ((index <= end) &&

2009-04-07 19:32:39

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07 2009, Theodore Tso wrote:
> On Tue, Apr 07, 2009 at 09:57:32AM +0200, Jens Axboe wrote:
> > > > It looks like a good candidate for WRITE_SYNC_PLUG instead,
> > >
>
> So is this patch sane? (Compile-tested only, since I'm at a
> conference at the moment). Am I using the proper abstraction to
> unplug the block device? If not, it might be nice to document the
> preferred for callers into the block layer.
>
> (BTW, I was looking at Documentation/biodoc.txt, and I found some
> clearly old documentation bits: "This is just the same as in 2.4 so
> far, though per-device unplugging support is anticipated for 2.5." :-)

Woops, something to update :-)

> [RFQ] Smart unplugging for page writeback
>
> Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
> use WRITE_SYNC_PLUG in block_write_full_page(), and then before we
> wait for page writebacks to complete in jbd, jbd2, and filemap, call
> blk_unplug() to make sure the writes are on the way to the disk.
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 977e12a..95b5390 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1646,7 +1646,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> struct buffer_head *bh, *head;
> const unsigned blocksize = 1 << inode->i_blkbits;
> int nr_underway = 0;
> - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
> + int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
> + WRITE_SYNC_PLUG : WRITE);
>
> BUG_ON(!PageLocked(page));

This should be all you need, forget the below. The manual unplugging
only really makes sense for very select situations, since it'll be done
when someone waits on the buffer anyway. And the block layer will unplug
implicitly if 3ms has passed or 4 requests are pending, if someone
hasn't done a wait for a page/bh in flight for that device anyway.

So at least with the current setup, it'll only really make a difference
if you end up waiting on the buffer 1-2ms after submission and you
haven't submitted enough IO to kick off the unplugging. If you wait
immediately, the time period between your blk_unplug() call and the one
that wait_on_bit() would do is so short that it will not make a
difference. And if you don't wait on the IO, you just want to let it hit
the disk eventually.


> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index a8e8513..3e6726f 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -21,6 +21,7 @@
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> #include <linux/bio.h>
> +#include <linux/blkdev.h>
>
> /*
> * Default IO end handler for temporary BJ_IO buffer_heads.
> @@ -196,6 +197,7 @@ static int journal_submit_data_buffers(journal_t *journal,
> int locked;
> int bufs = 0;
> struct buffer_head **wbuf = journal->j_wbuf;
> + struct block_device *fs_bdev = 0;
> int err = 0;
>
> /*
> @@ -213,6 +215,7 @@ write_out_data:
> while (commit_transaction->t_sync_datalist) {
> jh = commit_transaction->t_sync_datalist;
> bh = jh2bh(jh);
> + fs_bdev = bh->b_bdev;
> locked = 0;
>
> /* Get reference just to make sure buffer does not disappear
> @@ -290,6 +293,8 @@ write_out_data:
> }
> spin_unlock(&journal->j_list_lock);
> journal_do_submit_data(wbuf, bufs, write_op);
> + if (fs_bdev)
> + blk_unplug(bdev_get_queue(fs_bdev));
>
> return err;
> }
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 282750c..b5448dd 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -26,6 +26,7 @@
> #include <linux/writeback.h>
> #include <linux/backing-dev.h>
> #include <linux/bio.h>
> +#include <linux/blkdev.h>
>
> /*
> * Default IO end handler for temporary BJ_IO buffer_heads.
> @@ -176,6 +177,7 @@ static int journal_wait_on_commit_record(journal_t *journal,
>
> retry:
> clear_buffer_dirty(bh);
> + blk_unplug(bdev_get_queue(bh->b_bdev));
> wait_on_buffer(bh);

There's really not a lot of point to doing it this way, since
wait_on_buffer() will immediately do an unplug of the device anyway.

So it's quite on purpose that I didn't do a lot of tracking for this and
manual unplugging, we simply don't need it.

> if (buffer_eopnotsupp(bh) && (journal->j_flags & JBD2_BARRIER)) {
> printk(KERN_WARNING
> @@ -241,10 +243,12 @@ static int journal_submit_data_buffers(journal_t *journal,
> struct jbd2_inode *jinode;
> int err, ret = 0;
> struct address_space *mapping;
> + struct block_device *fs_bdev = 0;
>
> spin_lock(&journal->j_list_lock);
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> mapping = jinode->i_vfs_inode->i_mapping;
> + fs_bdev = jinode->i_vfs_inode->i_sb->s_bdev;
> jinode->i_flags |= JI_COMMIT_RUNNING;
> spin_unlock(&journal->j_list_lock);
> /*
> @@ -262,6 +266,8 @@ static int journal_submit_data_buffers(journal_t *journal,
> wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
> }
> spin_unlock(&journal->j_list_lock);
> + if (fs_bdev)
> + blk_unplug(bdev_get_queue(fs_bdev));
> return ret;
> }
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2e2d38e..eff2ed9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -274,6 +274,10 @@ int wait_on_page_writeback_range(struct address_space *mapping,
> if (end < start)
> return 0;
>
> + if (mapping->host && mapping->host->i_sb && mapping->host->i_sb &&
> + mapping->host->i_sb->s_bdev)
> + blk_unplug(bdev_get_queue(mapping->host->i_sb->s_bdev));
> +

Ugh, this one is nasty!

blk_run_address_space(mapping);

would be a lot cleaner.

--
Jens Axboe

2009-04-07 21:44:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07, 2009 at 09:32:39PM +0200, Jens Axboe wrote:
>
> This should be all you need, forget the below. The manual unplugging
> only really makes sense for very select situations, since it'll be done
> when someone waits on the buffer anyway.

Thanks, ok; I missed the "sync_buffer" in the call to wait_on_bit() in
__wait_on_buffer(). It also wasn't obvious to me that
wait_on_writeback_range() ultimately ends up calling aops->sync_page
which for filemap.c will call blk_run_backing_dev() --- and that
blk_run_*() results in the right device queue getting unplugged. So
this would be a nice thing to get documented in
Documentation/biodoc.txt.

- Ted

2009-04-07 22:19:43

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
the block device I/O queue between each page that gets flushed out.

The upstream callers of block_write_full_page() which wait for the
writes to finish call wait_on_buffer(), wait_on_writeback_range()
(which ultimately calls sync_page(), which calls
blk_run_backing_dev(), which will unplug the device queue), and so on.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---

We should get this applied to avoid any performance regressions
resulting from commit a64c8610.

fs/buffer.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 977e12a..95b5390 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1646,7 +1646,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
const unsigned blocksize = 1 << inode->i_blkbits;
int nr_underway = 0;
- int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
+ WRITE_SYNC_PLUG : WRITE);

BUG_ON(!PageLocked(page));

--
1.5.6.3


2009-04-07 23:13:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

On Tue, 7 Apr 2009 18:19:33 -0400
Theodore Tso <[email protected]> wrote:

> Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
> use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
> the block device I/O queue between each page that gets flushed out.
>
> The upstream callers of block_write_full_page() which wait for the
> writes to finish call wait_on_buffer(), wait_on_writeback_range()
> (which ultimately calls sync_page(), which calls
> blk_run_backing_dev(), which will unplug the device queue), and so on.
>

<sob>

>
> We should get this applied to avoid any performance regressions
> resulting from commit a64c8610.
>
> fs/buffer.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 977e12a..95b5390 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1646,7 +1646,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> struct buffer_head *bh, *head;
> const unsigned blocksize = 1 << inode->i_blkbits;
> int nr_underway = 0;
> - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
> + int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
> + WRITE_SYNC_PLUG : WRITE);
>
> BUG_ON(!PageLocked(page));

So how does WRITE_SYNC_PLUG differ from WRITE, and what effect does
this change have upon kernel behaviour?



2009-04-07 23:47:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

On Tue, Apr 07, 2009 at 04:09:44PM -0700, Andrew Morton wrote:
> >
> > The upstream callers of block_write_full_page() which wait for the
> > writes to finish call wait_on_buffer(), wait_on_writeback_range()
> > (which ultimately calls sync_page(), which calls
> > blk_run_backing_dev(), which will unplug the device queue), and so on.
>
> <sob>

No question, this stuff needs to be better documented; the codepaths
involved is scattered between files in block/, fs/, and mm/
directories, and it's not well documented as *all* what a filesystem
developer is supposed to do.

> > const unsigned blocksize = 1 << inode->i_blkbits;
> > int nr_underway = 0;
> > - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
> > + int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
> > + WRITE_SYNC_PLUG : WRITE);
> >
> > BUG_ON(!PageLocked(page));
>
> So how does WRITE_SYNC_PLUG differ from WRITE, and what effect does
> this change have upon kernel behaviour?

The difference between WRITE_SYNC_PLUG and WRITE is that from the
perspective of the I/O scheduler, they are prioritized as
"synchronous" operations. Some I/O schedulers, such as AS and CFQ,
prioritize synchronous writes and put them in the same bucket as
synchronous reads, and above asynchronous writes.

Currently, we are using WRITE_SYNC, which has the implicit unplug if
wbc->sync_mode is WB_SYNC_ALL. WRITE_SYNC_PLUG removes the implicit
unplug, which was the issue that you had expressed concern.

- Ted

2009-04-08 05:58:55

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Tue, Apr 07 2009, Theodore Tso wrote:
> On Tue, Apr 07, 2009 at 09:32:39PM +0200, Jens Axboe wrote:
> >
> > This should be all you need, forget the below. The manual unplugging
> > only really makes sense for very select situations, since it'll be done
> > when someone waits on the buffer anyway.
>
> Thanks, ok; I missed the "sync_buffer" in the call to wait_on_bit() in
> __wait_on_buffer(). It also wasn't obvious to me that
> wait_on_writeback_range() ultimately ends up calling aops->sync_page
> which for filemap.c will call blk_run_backing_dev() --- and that
> blk_run_*() results in the right device queue getting unplugged. So
> this would be a nice thing to get documented in
> Documentation/biodoc.txt.

I'm not even sure what is documented about this sort of thing in biodoc,
but it's actually somewhat outside the realm of block layer
documentation. The block layer should document that you need to call
unplug if you are going to be waiting for that IO. How the vm code does
it is implementation detail mostly, the important thing is that it gets
done :-)

--
Jens Axboe


2009-04-08 06:00:34

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

On Tue, Apr 07 2009, Theodore Tso wrote:
> Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
> use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
> the block device I/O queue between each page that gets flushed out.
>
> The upstream callers of block_write_full_page() which wait for the
> writes to finish call wait_on_buffer(), wait_on_writeback_range()
> (which ultimately calls sync_page(), which calls
> blk_run_backing_dev(), which will unplug the device queue), and so on.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
>
> We should get this applied to avoid any performance regressions
> resulting from commit a64c8610.
>
> fs/buffer.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 977e12a..95b5390 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1646,7 +1646,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> struct buffer_head *bh, *head;
> const unsigned blocksize = 1 << inode->i_blkbits;
> int nr_underway = 0;
> - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
> + int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
> + WRITE_SYNC_PLUG : WRITE);
>
> BUG_ON(!PageLocked(page));

I think you should comment on why we don't need to do the actual unplug.
See what I added in fs/jbd/commit.c:journal_commit_transaction():

/*
* Use plugged writes here, since we want to submit several
* before we unplug the device. We don't do explicit
* unplugging in here, instead we rely on sync_buffer() doing
* the unplug for us.
*/

--
Jens Axboe


2009-04-08 08:08:45

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

On Tue, Apr 07 2009, Andrew Morton wrote:
> On Tue, 7 Apr 2009 18:19:33 -0400
> Theodore Tso <[email protected]> wrote:
>
> > Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
> > use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
> > the block device I/O queue between each page that gets flushed out.
> >
> > The upstream callers of block_write_full_page() which wait for the
> > writes to finish call wait_on_buffer(), wait_on_writeback_range()
> > (which ultimately calls sync_page(), which calls
> > blk_run_backing_dev(), which will unplug the device queue), and so on.
> >
>
> <sob>
>
> >
> > We should get this applied to avoid any performance regressions
> > resulting from commit a64c8610.
> >
> > fs/buffer.c | 3 ++-
> > 1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 977e12a..95b5390 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1646,7 +1646,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> > struct buffer_head *bh, *head;
> > const unsigned blocksize = 1 << inode->i_blkbits;
> > int nr_underway = 0;
> > - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
> > + int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
> > + WRITE_SYNC_PLUG : WRITE);
> >
> > BUG_ON(!PageLocked(page));
>
> So how does WRITE_SYNC_PLUG differ from WRITE, and what effect does
> this change have upon kernel behaviour?

How about something like this. Comments welcome. Should we move this to
a dedicated header file? fs.h is amazingly cluttered as it is.

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 562d285..6b6597a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -87,6 +87,57 @@ struct inodes_stat_t {
*/
#define FMODE_NOCMTIME ((__force fmode_t)2048)

+/*
+ * The below are the various read and write types that we support. Some of
+ * them include behavioral modifiers that send information down to the
+ * block layer and IO scheduler. Terminology:
+ *
+ * The block layer uses device plugging to defer IO a little bit, in
+ * the hope that we will see more IO very shortly. This increases
+ * coalescing of adjacent IO and thus reduces the number of IOs we
+ * have to send to the device. It also allows for better queuing,
+ * if the IO isn't mergeable. If the caller is going to be waiting
+ * for the IO, then he must ensure that the device is unplugged so
+ * that the IO is dispatched to the driver.
+ *
+ * All IO is handled async in Linux. This is fine for background
+ * writes, but for reads or writes that someone waits for completion
+ * on, we want to notify the block layer and IO scheduler so that they
+ * know about it. That allows them to make better scheduling
+ * decisions. So when the below references 'sync' and 'async', it
+ * is referencing this priority hint.
+ *
+ * With that in mind, the available types are:
+ *
+ * READ A normal read operation. Device will be plugged.
+ * READ_SYNC A synchronous read. Device is not plugged, caller can
+ * immediately wait on this read without caring about
+ * unplugging.
+ * READA Used for read-ahead operations. Lower priority, and the
+ * block layer could (in theory) choose to ignore this
+ * request if it runs into resource problems.
+ * WRITE A normal async write. Device will be plugged.
+ * SWRITE Like WRITE, but a special case for ll_rw_block() that
+ * tells it to lock the buffer first. Normally a buffer
+ * must be locked before doing IO.
+ * WRITE_SYNC_PLUG Synchronous write. Identical to WRITE, but passes down
+ * the hint that someone will be waiting on this IO
+ * shortly.
+ * WRITE_SYNC Like WRITE_SYNC_PLUG, but also unplugs the device
+ * immediately after submission. The write equivalent
+ * of READ_SYNC.
+ * WRITE_ODIRECT Special case write for O_DIRECT only.
+ * SWRITE_SYNC
+ * SWRITE_SYNC_PLUG Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
+ * See SWRITE.
+ * WRITE_BARRIER Like WRITE, but tells the block layer that all
+ * previously submitted writes must be safely on storage
+ * before this one is started. Also guarantees that when
+ * this write is complete, it itself is also safely on
+ * storage. Prevents reordering of writes on both sides
+ * of this IO.
+ *
+ */
#define RW_MASK 1
#define RWA_MASK 2
#define READ 0
@@ -102,6 +153,11 @@ struct inodes_stat_t {
(SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
#define SWRITE_SYNC (SWRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
#define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
+
+/*
+ * These aren't really reads or writes, they pass down information about
+ * parts of device that are now unused by the file system.
+ */
#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))


--
Jens Axboe


2009-04-08 15:25:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/3] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

On Wed, Apr 08, 2009 at 07:58:54AM +0200, Jens Axboe wrote:
>
> I'm not even sure what is documented about this sort of thing in biodoc,
> but it's actually somewhat outside the realm of block layer
> documentation. The block layer should document that you need to call
> unplug if you are going to be waiting for that IO. How the vm code does
> it is implementation detail mostly, the important thing is that it gets
> done :-)

Fair enough; I agree a lot of the problem is the page writeback
subsystem isn't well documented at all, and is crying out for a review
for possible refactorization and cleanup....

- Ted

2009-04-08 15:26:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

On Wed, Apr 08, 2009 at 08:00:33AM +0200, Jens Axboe wrote:
>
> I think you should comment on why we don't need to do the actual unplug.
> See what I added in fs/jbd/commit.c:journal_commit_transaction():
>
> /*
> * Use plugged writes here, since we want to submit several
> * before we unplug the device. We don't do explicit
> * unplugging in here, instead we rely on sync_buffer() doing
> * the unplug for us.
> */

OK, agreed. I'll add a comment explaining what is going on in the
patch; better there than in the commit log.

- Ted

2009-04-08 22:37:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

On Wed, 8 Apr 2009 10:08:44 +0200 Jens Axboe <[email protected]> wrote:

> > So how does WRITE_SYNC_PLUG differ from WRITE, and what effect does
> > this change have upon kernel behaviour?
>
> How about something like this. Comments welcome.

It's lovely.

> Should we move this to
> a dedicated header file? fs.h is amazingly cluttered as it is.

Sometime, perhaps.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 562d285..6b6597a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -87,6 +87,57 @@ struct inodes_stat_t {
> */
> #define FMODE_NOCMTIME ((__force fmode_t)2048)
>
> +/*
> + * The below are the various read and write types that we support. Some of
> + * them include behavioral modifiers that send information down to the
> + * block layer and IO scheduler. Terminology:
> + *
> + * The block layer uses device plugging to defer IO a little bit, in
> + * the hope that we will see more IO very shortly. This increases
> + * coalescing of adjacent IO and thus reduces the number of IOs we
> + * have to send to the device. It also allows for better queuing,
> + * if the IO isn't mergeable. If the caller is going to be waiting
> + * for the IO, then he must ensure that the device is unplugged so
> + * that the IO is dispatched to the driver.
> + *
> + * All IO is handled async in Linux. This is fine for background
> + * writes, but for reads or writes that someone waits for completion
> + * on, we want to notify the block layer and IO scheduler so that they
> + * know about it. That allows them to make better scheduling
> + * decisions. So when the below references 'sync' and 'async', it
> + * is referencing this priority hint.
> + *
> + * With that in mind, the available types are:
> + *
> + * READ A normal read operation. Device will be plugged.
> + * READ_SYNC A synchronous read. Device is not plugged, caller can
> + * immediately wait on this read without caring about
> + * unplugging.
> + * READA Used for read-ahead operations. Lower priority, and the
> + * block layer could (in theory) choose to ignore this
> + * request if it runs into resource problems.
> + * WRITE A normal async write. Device will be plugged.
> + * SWRITE Like WRITE, but a special case for ll_rw_block() that
> + * tells it to lock the buffer first. Normally a buffer
> + * must be locked before doing IO.
> + * WRITE_SYNC_PLUG Synchronous write. Identical to WRITE, but passes down
> + * the hint that someone will be waiting on this IO
> + * shortly.

>From the text, I'd expect WRITE_SYNC_PLUG to, err, unplug!

> + * WRITE_SYNC Like WRITE_SYNC_PLUG, but also unplugs the device
> + * immediately after submission. The write equivalent
> + * of READ_SYNC.

But this contradicts my expectation.

So what does WRITE_SYNC_PLUG really do dofferent from WRITE?

> + * WRITE_ODIRECT Special case write for O_DIRECT only.
> + * SWRITE_SYNC
> + * SWRITE_SYNC_PLUG Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
> + * See SWRITE.
> + * WRITE_BARRIER Like WRITE, but tells the block layer that all
> + * previously submitted writes must be safely on storage
> + * before this one is started. Also guarantees that when
> + * this write is complete, it itself is also safely on
> + * storage. Prevents reordering of writes on both sides
> + * of this IO.
> + *
> + */
> #define RW_MASK 1
> #define RWA_MASK 2
> #define READ 0
> @@ -102,6 +153,11 @@ struct inodes_stat_t {
> (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
> #define SWRITE_SYNC (SWRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
> #define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
> +
> +/*
> + * These aren't really reads or writes, they pass down information about
> + * parts of device that are now unused by the file system.
> + */
> #define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
> #define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))


2009-04-09 17:59:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

On Wed, Apr 08 2009, Andrew Morton wrote:
> On Wed, 8 Apr 2009 10:08:44 +0200 Jens Axboe <[email protected]> wrote:
>
> > > So how does WRITE_SYNC_PLUG differ from WRITE, and what effect does
> > > this change have upon kernel behaviour?
> >
> > How about something like this. Comments welcome.
>
> It's lovely.
>
> > Should we move this to
> > a dedicated header file? fs.h is amazingly cluttered as it is.
>
> Sometime, perhaps.
>
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 562d285..6b6597a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -87,6 +87,57 @@ struct inodes_stat_t {
> > */
> > #define FMODE_NOCMTIME ((__force fmode_t)2048)
> >
> > +/*
> > + * The below are the various read and write types that we support. Some of
> > + * them include behavioral modifiers that send information down to the
> > + * block layer and IO scheduler. Terminology:
> > + *
> > + * The block layer uses device plugging to defer IO a little bit, in
> > + * the hope that we will see more IO very shortly. This increases
> > + * coalescing of adjacent IO and thus reduces the number of IOs we
> > + * have to send to the device. It also allows for better queuing,
> > + * if the IO isn't mergeable. If the caller is going to be waiting
> > + * for the IO, then he must ensure that the device is unplugged so
> > + * that the IO is dispatched to the driver.
> > + *
> > + * All IO is handled async in Linux. This is fine for background
> > + * writes, but for reads or writes that someone waits for completion
> > + * on, we want to notify the block layer and IO scheduler so that they
> > + * know about it. That allows them to make better scheduling
> > + * decisions. So when the below references 'sync' and 'async', it
> > + * is referencing this priority hint.
> > + *
> > + * With that in mind, the available types are:
> > + *
> > + * READ A normal read operation. Device will be plugged.
> > + * READ_SYNC A synchronous read. Device is not plugged, caller can
> > + * immediately wait on this read without caring about
> > + * unplugging.
> > + * READA Used for read-ahead operations. Lower priority, and the
> > + * block layer could (in theory) choose to ignore this
> > + * request if it runs into resource problems.
> > + * WRITE A normal async write. Device will be plugged.
> > + * SWRITE Like WRITE, but a special case for ll_rw_block() that
> > + * tells it to lock the buffer first. Normally a buffer
> > + * must be locked before doing IO.
> > + * WRITE_SYNC_PLUG Synchronous write. Identical to WRITE, but passes down
> > + * the hint that someone will be waiting on this IO
> > + * shortly.
>
> From the text, I'd expect WRITE_SYNC_PLUG to, err, unplug!

You are still mixing up the sync hint and unplugging. I'll expand it a
bit, I guess.

> > + * WRITE_SYNC Like WRITE_SYNC_PLUG, but also unplugs the device
> > + * immediately after submission. The write equivalent
> > + * of READ_SYNC.
>
> But this contradicts my expectation.
>
> So what does WRITE_SYNC_PLUG really do dofferent from WRITE?

It tells the IO scheduler that this write is really sync, not async.
The key difference between the two is as written - the sync one will
have someone waiting for its completion. What the IO scheduler does with
the flag is its own business. In reality it means that the write is
expedited, whereas async writes are done somewhat more lazily.

> > + * WRITE_ODIRECT Special case write for O_DIRECT only.
> > + * SWRITE_SYNC
> > + * SWRITE_SYNC_PLUG Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
> > + * See SWRITE.
> > + * WRITE_BARRIER Like WRITE, but tells the block layer that all
> > + * previously submitted writes must be safely on storage
> > + * before this one is started. Also guarantees that when
> > + * this write is complete, it itself is also safely on
> > + * storage. Prevents reordering of writes on both sides
> > + * of this IO.
> > + *
> > + */
> > #define RW_MASK 1
> > #define RWA_MASK 2
> > #define READ 0
> > @@ -102,6 +153,11 @@ struct inodes_stat_t {
> > (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE))
> > #define SWRITE_SYNC (SWRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG))
> > #define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER))
> > +
> > +/*
> > + * These aren't really reads or writes, they pass down information about
> > + * parts of device that are now unused by the file system.
> > + */
> > #define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
> > #define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))
>

--
Jens Axboe