2006-03-09 18:37:58

by Badari Pulavarty

[permalink] [raw]
Subject: [RFC PATCH] ext3 writepage() journal avoidance

Hi,

I am trying to speed up ext3 writepage() by avoiding
journaling in non-block allocation cases. Does this
look reasonable ? So far, my testing is fine. What am
I missing here ?

ext3_ordered_writepage() and ext3_writeback_writepage() starts
a transcation all the time. We need to start a transaction only
if we need to allocate a disk block. For normal writes, disk
block allocation happens in prepare_write(). We can find out
if the disk-block is already mapped, looking at buffers attached
to the page. So, for non-block allocation cases (for non-mapped
writes), we don't need to do all the journal stuff in writepage().

Thanks,
Badari

ext3_ordered_writepage() and ext3_writeback_writepage() starts
a transcation all the time. We need to start a transaction only
if we need to allocate a disk block. For normal writes, disk
block allocation happens in prepare_write(). We can find out
if the disk-block is already mapped, looking at buffers attached
to the page. So, for non-block allocation cases (for non-mapped
writes), we don't need to do all the journal stuff in writepage().

Signed-off-by: Badari Pulavarty <[email protected]>

Index: linux-2.6.16-rc5/fs/ext3/inode.c
===================================================================
--- linux-2.6.16-rc5.orig/fs/ext3/inode.c 2006-03-09 10:19:41.000000000
-0800
+++ linux-2.6.16-rc5/fs/ext3/inode.c 2006-03-09 10:31:12.000000000 -0800
@@ -1201,6 +1201,11 @@ static int bput_one(handle_t *handle, st
return 0;
}

+static int check_bmap(handle_t *handle, struct buffer_head *bh)
+{
+ return !buffer_mapped(bh);
+}
+
static int journal_dirty_data_fn(handle_t *handle, struct buffer_head
*bh)
{
if (buffer_mapped(bh))
@@ -1268,6 +1273,7 @@ static int ext3_ordered_writepage(struct
handle_t *handle = NULL;
int ret = 0;
int err;
+ int need_trans = 1;

J_ASSERT(PageLocked(page));

@@ -1278,6 +1284,27 @@ static int ext3_ordered_writepage(struct
if (ext3_journal_current_handle())
goto out_fail;

+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, inode->i_sb->s_blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ } else {
+ /*
+ * Check to see if buffers are mapped to disk blocks.
+ * If disk blocks are already there, no reason for
+ * starting a transaction.
+ */
+ page_bufs = page_buffers(page);
+ need_trans = walk_page_buffers(handle, page_bufs, 0,
+ PAGE_CACHE_SIZE, NULL, check_bmap);
+ }
+
+ if (need_trans == 0) {
+ /* No need to allocate disk blocks - just do IO */
+ ret = block_write_full_page(page, ext3_get_block, wbc);
+ goto out;
+ }
+
+ /* We may need to allocate blocks - start a transaction etc. */
handle = ext3_journal_start(inode, ext3_writepage_trans_blocks
(inode));

if (IS_ERR(handle)) {
@@ -1285,10 +1312,6 @@ static int ext3_ordered_writepage(struct
goto out_fail;
}

- if (!page_has_buffers(page)) {
- create_empty_buffers(page, inode->i_sb->s_blocksize,
- (1 << BH_Dirty)|(1 << BH_Uptodate));
- }
page_bufs = page_buffers(page);
walk_page_buffers(handle, page_bufs, 0,
PAGE_CACHE_SIZE, NULL, bget_one);
@@ -1318,6 +1341,7 @@ static int ext3_ordered_writepage(struct
err = ext3_journal_stop(handle);
if (!ret)
ret = err;
+out:
return ret;

out_fail:
@@ -1333,14 +1357,26 @@ static int ext3_writeback_writepage(stru
handle_t *handle = NULL;
int ret = 0;
int err;
+ int need_trans = 1;

if (ext3_journal_current_handle())
goto out_fail;

- handle = ext3_journal_start(inode, ext3_writepage_trans_blocks
(inode));
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out_fail;
+ /*
+ * Check to see if the disk blocking is already available.
+ * If so, there is no need to start a transaction.
+ */
+ if (page_has_buffers(page))
+ need_trans = walk_page_buffers(handle, page_buffers(page), 0,
+ PAGE_CACHE_SIZE, NULL, check_bmap);
+
+ if (need_trans) {
+ handle = ext3_journal_start(inode,
+ ext3_writepage_trans_blocks(inode));
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_fail;
+ }
}

if (test_opt(inode->i_sb, NOBH))
@@ -1348,9 +1384,11 @@ static int ext3_writeback_writepage(stru
else
ret = block_write_full_page(page, ext3_get_block, wbc);

- err = ext3_journal_stop(handle);
- if (!ret)
- ret = err;
+ if (need_trans) {
+ err = ext3_journal_stop(handle);
+ if (!ret)
+ ret = err;
+ }
return ret;

out_fail:



2006-03-09 23:20:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

Badari Pulavarty <[email protected]> wrote:
>
> I am trying to speed up ext3 writepage() by avoiding
> journaling in non-block allocation cases. Does this
> look reasonable ? So far, my testing is fine. What am
> I missing here ?

Nothing. ext3's writepage(), prepare_write() and commit_write() do often
needlessy open and close transactions when we're doing overwrites. It's
something I've meant to look at for a few years, on and off.

I'd expect that prepare_write() and commit_write() are more important than
writepage().

It might be better to test PageMappedToDisk() rather than walking the
buffers. It's certainly faster and it makes optimisation of
prepare_write() and commit_write() easier to handle.

I'm not sure that PageMappedToDisk() gets set in all the right places
though - it's mainly for the `nobh' handling and block_prepare_write()
would need to be taught to set it. I guess that'd be a net win, even if
only ext3 uses it..

Then again, we might be able to speed up block_prepare_write() if
PageMappedToDisk(page).

If we go this way we need to be very very careful to keep PG_mappedtodisk
coherent with the state of the buffers. Tricky. We need to think about
whether block_truncate_page() should be clearing PG_mappedtoisk if we did a
partial truncate.

Don't forget that ext3 supports journalled-mode files on ordered- or
writeback-mounted filesystems, via `chattr +j'. Please be sure to test the
various combinations which that allows when playing with the write paths -
it can trip things up.

Also be sure to test nobh-mode.

2006-03-10 00:46:34

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

Andrew Morton wrote:

>Badari Pulavarty <[email protected]> wrote:
>
>>I am trying to speed up ext3 writepage() by avoiding
>>journaling in non-block allocation cases. Does this
>>look reasonable ? So far, my testing is fine. What am
>>I missing here ?
>>
>
>Nothing. ext3's writepage(), prepare_write() and commit_write() do often
>needlessy open and close transactions when we're doing overwrites. It's
>something I've meant to look at for a few years, on and off.
>
>I'd expect that prepare_write() and commit_write() are more important than
>writepage().
>
>
>
>It might be better to test PageMappedToDisk() rather than walking the
>buffers. It's certainly faster and it makes optimisation of
>prepare_write() and commit_write() easier to handle.
>
>I'm not sure that PageMappedToDisk() gets set in all the right places
>though - it's mainly for the `nobh' handling and block_prepare_write()
>would need to be taught to set it. I guess that'd be a net win, even if
>only ext3 uses it..
>
>Then again, we might be able to speed up block_prepare_write() if
>PageMappedToDisk(page).
>
Makes sense. I will take a look.

>
>If we go this way we need to be very very careful to keep PG_mappedtodisk
>coherent with the state of the buffers. Tricky. We need to think about
>whether block_truncate_page() should be clearing PG_mappedtoisk if we did a
>partial truncate.
>
>Don't forget that ext3 supports journalled-mode files on ordered- or
>writeback-mounted filesystems, via `chattr +j'.
>

Wow !! Never knew that. I assume we switch mapping->a_ops for this inode ?

>Please be sure to test the
>various combinations which that allows when playing with the write paths -
>it can trip things up.
>
>Also be sure to test nobh-mode.
>

Sure. Thanks for your reply and valuable suggestions. :)

Thanks,
Badari

>



2006-03-10 01:14:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

Badari Pulavarty <[email protected]> wrote:
>
> >Don't forget that ext3 supports journalled-mode files on ordered- or
> >writeback-mounted filesystems, via `chattr +j'.
> >
>
> Wow !! Never knew that. I assume we switch mapping->a_ops for this inode ?

Yes, that's all tucked away in ext3_should_journal_data().

2006-03-10 07:59:30

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance


> I'm not sure that PageMappedToDisk() gets set in all the right places
> though - it's mainly for the `nobh' handling and block_prepare_write()
> would need to be taught to set it. I guess that'd be a net win, even if
> only ext3 uses it..

btw is nobh mature enough yet to become the default, or to just go away
entirely as option ?

2006-03-10 08:26:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

Arjan van de Ven <[email protected]> wrote:
>
>
> > I'm not sure that PageMappedToDisk() gets set in all the right places
> > though - it's mainly for the `nobh' handling and block_prepare_write()
> > would need to be taught to set it. I guess that'd be a net win, even if
> > only ext3 uses it..
>
> btw is nobh mature enough yet to become the default, or to just go away
> entirely as option ?

I don't know how much usage it's had, sorry. It's only allowed in
data=writeback mode and not many people seem to use even that.

2006-03-10 08:44:11

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

On Fri, 2006-03-10 at 00:23 -0800, Andrew Morton wrote:
> Arjan van de Ven <[email protected]> wrote:
> >
> >
> > > I'm not sure that PageMappedToDisk() gets set in all the right places
> > > though - it's mainly for the `nobh' handling and block_prepare_write()
> > > would need to be taught to set it. I guess that'd be a net win, even if
> > > only ext3 uses it..
> >
> > btw is nobh mature enough yet to become the default, or to just go away
> > entirely as option ?
>
> I don't know how much usage it's had, sorry. It's only allowed in
> data=writeback mode and not many people seem to use even that.

would you be prepared to turn it on by default in -mm for a bit to see
how it holds up? The concept seems valuable in itself, so much so that I
feel this should be 1) on always by default when possible and 2) isn't
really the kind of thing that should be a long term option; not having
it almost is a -o pleaseAddThisBug option for each bug fixed.

2006-03-10 08:55:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

Arjan van de Ven <[email protected]> wrote:
>
> On Fri, 2006-03-10 at 00:23 -0800, Andrew Morton wrote:
> > Arjan van de Ven <[email protected]> wrote:
> > >
> > >
> > > > I'm not sure that PageMappedToDisk() gets set in all the right places
> > > > though - it's mainly for the `nobh' handling and block_prepare_write()
> > > > would need to be taught to set it. I guess that'd be a net win, even if
> > > > only ext3 uses it..
> > >
> > > btw is nobh mature enough yet to become the default, or to just go away
> > > entirely as option ?
> >
> > I don't know how much usage it's had, sorry. It's only allowed in
> > data=writeback mode and not many people seem to use even that.
>
> would you be prepared to turn it on by default in -mm for a bit to see
> how it holds up?

spose so. One would have to test it a bit first, make sure that it still
works. Performance testing with PAGE_SIZE much-greater-than blocksize
would be needed.

Unfortunately there's no `-o bh' (nonobh?) to turn it back on again if it
causes problems..

2006-03-10 13:43:42

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

On Fri, 2006-03-10 at 00:23 -0800, Andrew Morton wrote:
> Arjan van de Ven <[email protected]> wrote:
> >
> >
> > > I'm not sure that PageMappedToDisk() gets set in all the right places
> > > though - it's mainly for the `nobh' handling and block_prepare_write()
> > > would need to be taught to set it. I guess that'd be a net win, even if
> > > only ext3 uses it..
> >
> > btw is nobh mature enough yet to become the default, or to just go away
> > entirely as option ?
>
> I don't know how much usage it's had, sorry. It's only allowed in
> data=writeback mode and not many people seem to use even that.

For what it's worth, jfs has been using the nobh_* paths almost since
they where introduced (exclusively, it's not an option), so at least the
vfs pieces have been exercised to some extent.
--
David Kleikamp
IBM Linux Technology Center

2006-03-10 14:59:05

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

Andrew Morton wrote:

>Arjan van de Ven <[email protected]> wrote:
>
>>On Fri, 2006-03-10 at 00:23 -0800, Andrew Morton wrote:
>> > Arjan van de Ven <[email protected]> wrote:
>> > >
>> > >
>> > > > I'm not sure that PageMappedToDisk() gets set in all the right places
>> > > > though - it's mainly for the `nobh' handling and block_prepare_write()
>> > > > would need to be taught to set it. I guess that'd be a net win, even if
>> > > > only ext3 uses it..
>> > >
>> > > btw is nobh mature enough yet to become the default, or to just go away
>> > > entirely as option ?
>> >
>> > I don't know how much usage it's had, sorry. It's only allowed in
>> > data=writeback mode and not many people seem to use even that.
>>
>> would you be prepared to turn it on by default in -mm for a bit to see
>> how it holds up?
>>
>
>spose so. One would have to test it a bit first, make sure that it still
>works. Performance testing with PAGE_SIZE much-greater-than blocksize
>would be needed.
>
I did nobh option only for writeback mode + only if PAGE_SIZE ==
blocksize case :(
I guess I could enhance it for PAGE_SIZE > blocksize case also.

Doing it for ordered mode, journal mode is hard - due to transactions &
ordering.
As you suggested while ago, we need a new mode. I hate to add new modes
since
no one will be using it (unless we decide to make it default). Thats the
reason why
I spent little time doing nobh option for writeback mode.

>
>Unfortunately there's no `-o bh' (nonobh?) to turn it back on again if it
>causes problems..
>

Can be added easily. I will send out a patch for this.

Thanks,
Badari

>



2006-03-10 16:20:43

by Dave Jones

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

On Fri, Mar 10, 2006 at 09:43:57AM +0100, Arjan van de Ven wrote:
> On Fri, 2006-03-10 at 00:23 -0800, Andrew Morton wrote:
> > Arjan van de Ven <[email protected]> wrote:
> > >
> > >
> > > > I'm not sure that PageMappedToDisk() gets set in all the right places
> > > > though - it's mainly for the `nobh' handling and block_prepare_write()
> > > > would need to be taught to set it. I guess that'd be a net win, even if
> > > > only ext3 uses it..
> > >
> > > btw is nobh mature enough yet to become the default, or to just go away
> > > entirely as option ?
> >
> > I don't know how much usage it's had, sorry. It's only allowed in
> > data=writeback mode and not many people seem to use even that.
>
> would you be prepared to turn it on by default in -mm for a bit to see
> how it holds up? The concept seems valuable in itself, so much so that I
> feel this should be 1) on always by default when possible and 2) isn't
> really the kind of thing that should be a long term option; not having
> it almost is a -o pleaseAddThisBug option for each bug fixed.

It'd be good to get that hammered on, as it doesn't see hardly any testing
based upon the experiments I did sometime last year. It left me with
an unmountable root filesystem :-/

Dave

--
http://www.codemonkey.org.uk

2006-03-10 16:39:15

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

On Fri, 2006-03-10 at 11:19 -0500, Dave Jones wrote:
> On Fri, Mar 10, 2006 at 09:43:57AM +0100, Arjan van de Ven wrote:
> > On Fri, 2006-03-10 at 00:23 -0800, Andrew Morton wrote:
> > > Arjan van de Ven <[email protected]> wrote:
> > > >
> > > >
> > > > > I'm not sure that PageMappedToDisk() gets set in all the right places
> > > > > though - it's mainly for the `nobh' handling and block_prepare_write()
> > > > > would need to be taught to set it. I guess that'd be a net win, even if
> > > > > only ext3 uses it..
> > > >
> > > > btw is nobh mature enough yet to become the default, or to just go away
> > > > entirely as option ?
> > >
> > > I don't know how much usage it's had, sorry. It's only allowed in
> > > data=writeback mode and not many people seem to use even that.
> >
> > would you be prepared to turn it on by default in -mm for a bit to see
> > how it holds up? The concept seems valuable in itself, so much so that I
> > feel this should be 1) on always by default when possible and 2) isn't
> > really the kind of thing that should be a long term option; not having
> > it almost is a -o pleaseAddThisBug option for each bug fixed.
>
> It'd be good to get that hammered on, as it doesn't see hardly any testing
> based upon the experiments I did sometime last year. It left me with
> an unmountable root filesystem :-/

Yuck. You are talking about "nobh" option for writeback mode, correct ?
Have any idea on what you were doing ?

Wondering why ? The changes are extremely simple (since I don't handle
pagesize > blocksize + limit truncate handling and fallback to bh code).

Thanks,
Badari

2006-03-10 16:44:27

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

On Fri, 2006-03-10 at 11:19 -0500, Dave Jones wrote:
> On Fri, Mar 10, 2006 at 09:43:57AM +0100, Arjan van de Ven wrote:
> > On Fri, 2006-03-10 at 00:23 -0800, Andrew Morton wrote:
> > > Arjan van de Ven <[email protected]> wrote:
> > > >
> > > >
> > > > > I'm not sure that PageMappedToDisk() gets set in all the right places
> > > > > though - it's mainly for the `nobh' handling and block_prepare_write()
> > > > > would need to be taught to set it. I guess that'd be a net win, even if
> > > > > only ext3 uses it..
> > > >
> > > > btw is nobh mature enough yet to become the default, or to just go away
> > > > entirely as option ?
> > >
> > > I don't know how much usage it's had, sorry. It's only allowed in
> > > data=writeback mode and not many people seem to use even that.
> >
> > would you be prepared to turn it on by default in -mm for a bit to see
> > how it holds up? The concept seems valuable in itself, so much so that I
> > feel this should be 1) on always by default when possible and 2) isn't
> > really the kind of thing that should be a long term option; not having
> > it almost is a -o pleaseAddThisBug option for each bug fixed.
>
> It'd be good to get that hammered on, as it doesn't see hardly any testing
> based upon the experiments I did sometime last year. It left me with
> an unmountable root filesystem :-/

BTW, were you doing "chattr +j" on any of the files ? (writeback, nobh
mounted filesystems ?). If so, I can see my screw ups :(

Only yesterday, I came to know about this fancy thing. I need to take
a closer look. I can't really do NOBH for this case (which currently it
does).

Thanks,
Badari

2006-03-10 16:52:29

by Dave Jones

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

On Fri, Mar 10, 2006 at 08:40:47AM -0800, Badari Pulavarty wrote:

> > > > I don't know how much usage it's had, sorry. It's only allowed in
> > > > data=writeback mode and not many people seem to use even that.
> > >
> > > would you be prepared to turn it on by default in -mm for a bit to see
> > > how it holds up? The concept seems valuable in itself, so much so that I
> > > feel this should be 1) on always by default when possible and 2) isn't
> > > really the kind of thing that should be a long term option; not having
> > > it almost is a -o pleaseAddThisBug option for each bug fixed.
> >
> > It'd be good to get that hammered on, as it doesn't see hardly any testing
> > based upon the experiments I did sometime last year. It left me with
> > an unmountable root filesystem :-/
>
> Yuck. You are talking about "nobh" option for writeback mode, correct ?
> Have any idea on what you were doing ?

Actually, I think I may have neglected to make those mounts writeback.
In retrospect, it was silly, I basically forced nobh on for all mounts.
A few boots later my / reached its maximum mount count, and got a fsck,
which moved a bunch of useful things like /lib/ld-linux.so.2 to lost+found.
There was so much mess that it was easier to reinstall the box than
to pick through it. (Thankfully I tested it on a scratch box ;-)

Dave
--
http://www.codemonkey.org.uk

2006-03-10 16:59:29

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC PATCH] ext3 writepage() journal avoidance

On Fri, 2006-03-10 at 11:51 -0500, Dave Jones wrote:
> On Fri, Mar 10, 2006 at 08:40:47AM -0800, Badari Pulavarty wrote:
>
> > > > > I don't know how much usage it's had, sorry. It's only allowed in
> > > > > data=writeback mode and not many people seem to use even that.
> > > >
> > > > would you be prepared to turn it on by default in -mm for a bit to see
> > > > how it holds up? The concept seems valuable in itself, so much so that I
> > > > feel this should be 1) on always by default when possible and 2) isn't
> > > > really the kind of thing that should be a long term option; not having
> > > > it almost is a -o pleaseAddThisBug option for each bug fixed.
> > >
> > > It'd be good to get that hammered on, as it doesn't see hardly any testing
> > > based upon the experiments I did sometime last year. It left me with
> > > an unmountable root filesystem :-/
> >
> > Yuck. You are talking about "nobh" option for writeback mode, correct ?
> > Have any idea on what you were doing ?
>
> Actually, I think I may have neglected to make those mounts writeback.
> In retrospect, it was silly, I basically forced nobh on for all mounts.
> A few boots later my / reached its maximum mount count, and got a fsck,
> which moved a bunch of useful things like /lib/ld-linux.so.2 to lost+found.
> There was so much mess that it was easier to reinstall the box than
> to pick through it. (Thankfully I tested it on a scratch box ;-)

Well, This makes me feel better. I am not going to take full
responsibility for this. :)

You can't force NOBH on all mounts. It gets silently ignored (there is
a message in dmesg) on anything other than "writeback" mode and pagesize
== blocksize.

Something else is going wrong here.

Thanks,
Badari